Schedule for: 17w5131 - Statistical and Computational Challenges in Large Scale Molecular Biology

Arriving in Banff, Alberta on Sunday, March 26 and departing Friday March 31, 2017
Sunday, March 26
16:00 - 17:30 Check-in begins at 16:00 on Sunday and is open 24 hours (Front Desk - Professional Development Centre)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
20:00 - 22:00 Informal gathering (Corbett Hall Lounge (CH 2110))
Monday, March 27
07:00 - 08:45 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
08:45 - 09:00 Introduction and Welcome by BIRS Station Manager (TCPL 201)
09:00 - 10:00 Eric Xing: Large scale machine learning tutorial (TCPL 201)
10:00 - 10:20 Coffee Break (TCPL Foyer)
10:20 - 11:35 GWAS methods 1 (Chair: Jeff Leek) (TCPL 201)
10:20 - 10:45 Gen Li: QRank: A novel quantile regression tool for eQTL discovery
Over the past decade, there has been a remarkable improvement in our understanding of the role of genetic variation in complex human diseases, especially via genome-wide association studies. However, the underlying molecular mechanisms are still poorly characterized, impending the development of therapeutic interventions. Identifying genetic variants that influence the expression level of a gene, i.e. expression quantitative trait loci (eQTLs), can help us understand how genetic variants influence traits at the molecular level. While most eQTL studies focus on identifying mean effects on gene expression using linear regression, evidence suggests that genetic variation can impact the entire distribution of the expression level. Motivated by the potential in higher order associations, several studies investigated variance eQTLs. In this paper, we develop a Quantile Rank-score based test (QRank), which provides an easy way to identify eQTLs that are associated with the conditional quantile functions of gene expression. We have applied the proposed QRank to the Genotype-Tissue Expression (GTEx) project, and found that the proposed method complements the existing methods, and identifies new eQTLs with heterogeneous effects across different quantile levels. Notably, we show that the eQTLs identified by QRank but missed by linear regression are associated with greater enrichment in genome-wide significant SNPs from the GWAS catalog, and are also more likely to be tissue specific than those identified by linear regression.
(TCPL 201)
10:45 - 11:10 Laurent Jacob: Representing Genetic Determinants in Bacterial GWAS with Compacted De Bruijn Graphs
Antimicrobial resistance has become a major worldwide public health concern, calling for a better characterization of existing and novel resistance mechanisms. GWAS methods applied to bacterial genomes have shown encouraging results for new genetic marker discovery. Most existing approaches either look at SNPs obtained by sequence alignment or consider sets of kmers, whose presence in the genome is associated with the phenotype of interest. While the former approach can only be performed when genomes are similar enough for an alignment to make sense, the latter can lead to redundant descriptions and to results which are hard to interpret. We propose an alignment-free GWAS method detecting haplotypes of variable length associated to resistance, using compacted De Bruijn graphs. Our representation is flexible enough to deal with very plastic genomes subject to gene transfers while drastically reducing the number of features to explore compared to kmers, without loss of information. It accomodates polymorphisms in core genes, accessory genes and non coding regions. Using our representation in a GWAS leads to the selection of a small number of entities which are easier to visualize and interpret than fixed length kmers. We illustrate the benefit of our approach by describing known as well as potential novel determinants of antimicrobial resistance in Pseudomonas aeruginosa, a pathogenic bacteria with a highly plastic genome. Pre-print available at http://biorxiv.org/content/early/2017/03/03/113563.
(TCPL 201)
11:10 - 11:35 Pierre Neuvial: Post hoc inference for multiple testing
When testing a large number of hypotheses simultaneously, a common practice (eg in genomic applications) is to (i) select a subset of candidate hypotheses and (ii) refine this selection using domain-based knowledge. Unfortunately, it is generally not possible to provide a statistical guarantee (e.g. controlled False Discovery Rate) for the resulting set of candidates. This gap between statistical theory and applications has motivated the development of post hoc procedures, for which the candidate sets can be selected "after having seen the data". Goeman and Solari (Stat. Science, 2011) have proposed a construction of post hoc procedures based on "closed testing". Their main procedure is sharp when the hypotheses are independent, but may be conservative under positive dependence. We introduce an alternative framework for post hoc inference, based on the control of a multiple testing risk called the joint Family-Wise Error Rate (JFWER). We propose JFWER-controlling procedures tailored to the case where the joint distribution of the test statistics under the null hypothesis is known, or can be sampled from. We discuss their performance and their link to the procedures proposed by Goeman and Solari. This is joint work with Gilles Blanchard and Etienne Roquain.
(TCPL 201)
11:35 - 13:00 Lunch (Vistas Dining Room)
13:00 - 14:00 Guided Tour of The Banff Centre
Meet in the Corbett Hall Lounge for a guided tour of The Banff Centre campus.
(Corbett Hall Lounge (CH 2110))
14:00 - 14:20 Group Photo
Meet in foyer of TCPL to participate in the BIRS group photo. The photograph will be taken outdoors, so dress appropriately for the weather. Please don't be late, or you might not be in the official group photo!
(TCPL Foyer)
14:20 - 15:55 GWAS methods 2 (Chair: Stephen Montgomery) (TCPL 201)
14:20 - 14:45 Aaron Quinlan: Inferring function from constrained coding regions in the human genome
An established approach to revealing essential genes and critical protein domains is measuring the degree of genomic conservation between species. More recently, statistical models have been developed to estimate functional “constraint” on each human gene by assessing the extent and frequency of genetic variation among thousands of human exomes. While gene-wide predictions of constraint are valuable, a single measure does not capture the often-extreme variability in constraint within a gene. Clearly, constraint can vary dramatically depending on the specific function and structural properties of the resulting protein structure. We have developed a statistical model to identify significantly constrained coding regions (CCRs) by leveraging the genetic variation observed among 60,706 exomes from the Exome Aggregation Consortium. Constrained coding regions identified by our model are significantly enriched for pathogenic mutations in Mendelian disorders and developmental delay, demonstrating its power to capture true biological constraint. I will present our efforts to use regions with similar degrees of constraint to infer function in poorly understood genes. Furthermore, I will illustrate how CCRs can be used to identify critical protein domains not previously identified from phylogenetic conservation. Lastly, I will discuss ongoing efforts to leverage CCRs in the clinical interpretation of variants in Long QT syndrome genes.
(TCPL 201)
14:45 - 15:10 James Zou: Modeling the rare and missing variants quantifies constraints in the human genome
I will describe our recent project collaborating with the Exome Aggregation Consoritum (ExAC) to model the landscape of harmful genetic variations in healthy individuals. We developed an algorithm that uses the variants identified in ExAC to accurately estimate statistics of the variants that are not in this cohort but exist in the general population. The inferred statistics of rare and unobserved variants provide a framework to quantify the discovery power of future sequencing projects. Our model also quantifies constraints on pathways, genes, protein domains and individual codons, and we estimated the selection coefficients corresponding to the observed constraints. Our metric of genomic constraint provides complementary information to evolutionary conservation
(TCPL 201)
15:10 - 15:30 Coffee Break (TCPL Foyer)
15:30 - 15:55 Jason Ernst: Systematic Discovery of Conservation States for Single-Nucleotide Annotation of the Human Genome
Genome-wide association studies have identified a large number of non-coding genomic loci in the human genome associated with disease, whose biological significance is poorly understood. Additional annotations largely based on either functional genomics or comparative genomics data have been used to gain insights into such locations and potentially prioritize likely causal variants among those in linkage disequilibrium. A widely used representation of the functional genomics data is through chromatin states produced by methods such as ChromHMM, which provides cell type specific annotations based on the combinatorial and spatial patterns in epigenomic data. Comparative genomic data provides complementary information as it is not dependent on having data from the appropriate cell or tissue type and can provide single nucleotide resolution information. Recent analyses have suggested constrained elements are among the genomic annotations most enriched for disease heritability. However the currently widely used representations of conservation information focus on either binary calls or a single univariate score from phylogenetic models, and thus do not capture potentially valuable information contained in the multi-species alignments of an increasing number of available species. Here we develop a novel method based on a multivariate hidden Markov model, ConsHMM, to annotate the human genome at single nucleotide resolution into a large number of different conservation states based on the combinatorial patterns of which species align to and which match the human reference genome within a 100-way multi-species alignment. The various conservation states show distinct enrichment properties for other genomic annotations such as regions of open chromatin, CpG islands, transcription start sites, and exons. Using our conservation states we can isolate subsets of existing constrained elements that show enrichments for disease associated heritability and putative regulatory regions identified by functional genomics data from those that do not as well identify additional subsets of bases showing these enrichments outside of the constrained elements.
(TCPL 201)
15:55 - 17:35 GWAS results (Chair: Stephen Montgomery) (TCPL 201)
15:55 - 16:20 Kasper Hansen: Brain region-specific DNA methylation and chromatin accessibility
Epigenetic modifications confer stable transcriptional patterns in the brain, and both normal and abnormal brain function involve specialized brain regions, yet little is known about brain region-specific epigenetic differences We study DNA methylation and chromatin accessibility in multiple brain regions of the normal human brain using whole-genome bisulfite sequencing and ATAC sequencing. Flow sorting reveals differences between neuronal and glial cells, as well as between brain regions. We show that differences in DNA methylation and accessibility between brain regions are enriched for explained heritability of multiple neurological traits.
(TCPL 201)
16:20 - 16:45 Brunilda Balliu: Longitudinal study of gene expression in a Swedish population sample
Background: While DNA sequence is more or less stable throughout life, gene regulation and expression is dynamic, fluctuating in response to different exposures. Increased knowledge about how genes are regulated and expressed, and how this changes over time can give important insights to fundamental questions about gene function. Methods: We sequenced the blood transcriptomes of 65 individuals at both age 70 and 80 (130 samples) for which genotype information was available. We investigated genome-wide changes in gene expression and regulation with age. Results: We identified 4,414 genes showing differential expression with age (2,388 up- and 2,026 down-regulated with age), and strong enrichment for age related GO terms (immune and stress response) and KEGG pathways (longevity and metabolic pathways). Moreover, we found 1,562 and 1,489 eGenes (genes with an eQTL), at age 70 and 80. The difference in the proportion of eGenes between 70 and 80 year old samples was significant (p-value=8.8e-03) Together, these results indicate that, (1) gene expression increases with age and (2) while genetic regulation of gene expression is stable over time for most genes, it is reduced for other genes at a later age. Significance: In summary, we have performed one of the first long-term longitudinal studies of genetics of gene expression that highlights transcriptome dynamics late in life.
(TCPL 201)
16:45 - 17:10 Yoav Gilad: Impact of regulatory variation across human iPSCs and differentiated cells
Induced pluripotent stem cells (iPSCs) are an essential tool for studying cellular differentiation and cell types that are otherwise difficult to access. We investigated the use of iPSCs and iPSC-derived cells to study the impact of genetic variation across different cell types and as models for studies of complex disease. We established a panel of iPSCs from 58 well- studied Yoruba lymphoblastoid cell lines (LCLs); 14 of these lines were further differentiated into cardiomyocytes. We characterized regulatory variation across individuals and cell types by measuring gene expression, chromatin accessibility and DNA methylation. Regulatory variation between individuals is lower in iPSCs than in the differentiated cell types, consistent with the intuition that developmental processes are generally canalized. While most cell type-specific regulatory quantitative trait loci (QTLs) lie in chromatin that is open only in the affected cell types, we found that 20% of cell type-specific QTLs are in shared open chromatin. Finally, we developed a deep neural network to predict open chromatin regions from DNA sequence alone and were able to use the sequences of segregating haplotypes to predict the effects of common SNPs on cell type-specific chromatin accessibility.
(TCPL 201)
17:10 - 17:35 Tim Hughes: Investigating the source and function of most of the genome: endogenous retroelements and the proteins that bind them (TCPL 201)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
Tuesday, March 28
07:00 - 09:00 Breakfast (Vistas Dining Room)
09:00 - 11:25 Inferring hidden structures (Chair: Kasper Hansen) (TCPL 201)
09:00 - 09:25 Wenyi Wang: Cell type-specific Deconvolution of Heterogeneous Tumor Samples
Tumor tissue samples are comprised of a mixture of cancerous and surrounding stromal cells. Understanding tumor heterogeneity is crucial to analyzing gene signatures associated with cancer prognosis and treatment decisions. Compared with the experimental approach of laser micro-dissection to isolate different tissue components, in silico dissection of mixed cell samples is faster and cheaper. Numerous computational approaches previously developed all have their limitations to deconvolute heterogeneous tumor samples. We have developed a three-component deconvolution model, DeMixT, that can account for the immune cell compartment explicitly and is able to address the challenging problem when the observed signals are assumed to come from a mixture of three cell compartments, infiltrated immune cells, tumor microenvironment and cancerous tissues. DeMixT involves a novel two-stage method and yields accurate estimates of cell purities as well as compartment-specific expression profiles. Simulations and real data validations have demonstrated the good performance of our method. Compared with other deconvolution tools, DeMixT can be applied more broadly. It allows for a further understanding of immune infiltration in cancer, in order to assist in the development of novel prognostic markers and therapeutic strategies.
(TCPL 201)
09:25 - 09:50 Elizabeth Purdom: Estimation of Lineages from Single cell sequencing data
Recently-developed methods for assaying individual cells afford researchers a highly detailed view of cellular transcription. One common target for these studies has been stem cells and their descendants, with analyses focused on charting the progression from multipotent cells to differentiated populations. We introduce a novel method, Slingshot, for inferring multiple developmental lineages from single-cell gene expression data. Slingshot is a uniquely robust and flexible tool for inferring developmental lineages and ordering cells to reflect continuous differentiation processes. It constructs a differentiation tree using clusters of cells as nodes, which provides stability and reduces the complexity of the inferred lineages. This tree is then used to assign individual cells to one or more developmental lineages, which are represented by smooth curves in a reduced-dimensionality space. These curves provide discerning power not found in methods based on piecewise linear trajectories, while also adding stability over a range of possible dimensionality reduction and clustering techniques.
(TCPL 201)
09:50 - 10:15 Christina Curtis: Delineating the mode and tempo of human tumor evolution
Cancer results from the acquisition of somatic alterations in a microevolutionary process that typically occurs over many years, much of which is occult. Understanding the evolutionary dynamics that are operative at different stages of progression in individual tumors might inform the earlier detection, diagnosis, and treatment of cancer. Although these processes cannot be directly observed, the resultant spatiotemporal patterns of genetic variation amongst tumor cells encode their evolutionary histories. However, there remains a need for the systematic evaluation of different modes of evolution in diverse solid tumors. In particular, while selection is fundamental to tumorigenesis, methods to infer the relative strength of selection within established human tumors are lacking. I will describe an extensible framework for simulating spatial tumor growth and evaluating evidence for different modes of tumor evolution, ranging from effective neutrality to strong positive selection. Further, I will show how application of this approach to multi-region sequencing data reveals different modes of evolution both within and between solid tumor types. These findings have implications for defining the drivers of tumor growth and inform practical guidelines for characterizing human tumor evolution.
(TCPL 201)
10:15 - 10:35 Coffee Break (TCPL Foyer)
10:45 - 11:10 Nelle Varoquaux: Studying the 3D structure of the P falciparum's genome by modeling contact counts as random Negative Binomial variables
The spatial and temporal organization of the 3D structure of chromosomes is thought to have an important role in genomic function, but is poorly understood. For example, there is a relative paucity of specific transcription factors, and an abundance of chromatin remodeling enzyme in the deadly human parasite P. falciparum. This points towards the involvement of global and local chromatin structure to control gene expression. Recent advances in chromosomes conformation capture (3C) technologies, initially developed to assess interactions between specific pairs of loci, allow one to simultaneously measure multiple contacts on a genome scale, paving the way for more systematic and genome-wide analysis of the 3D architecture of the genome. These new Hi-C techniques result in a genome-wide contact map, a matrix indicating the contact frequency between pairs of loci. I will present here the computational methods we developed to study the 3D organization of the parasite P. falciparum's genome using these contact maps, as well as how we uncovered the 3D structure of the parasite as a critical regulator for transcription and virulence factors in the human malaria parasite. I will discuss how appropriate modeling of contact counts as random Negative Binomial variables allowed us to build robust and accurate 3D models of the genome, as well as to perform differential analysis of the contact maps across different timepoints.
(TCPL 201)
11:15 - 11:40 Ben Raphael: Algorithms for Inferring Evolution and Migration of Tumors
Cancer is an evolutionary process driven by somatic mutations that accumulate in a population of cells that form a primary tumor. In later stages of cancer progression, cells migrate from a primary tumor and seed metastases at distant anatomical sites. I will describe algorithms to reconstruct this evolutionary process from DNA sequencing data of tumors. These algorithms address challenges that distinguish the tumor phylogeny problem from classical phylogenetic tree reconstruction, including challenges due to mixed samples and complex migration patterns. Joint work with Mohammed El-Kebir and Gryte Satas
(TCPL 201)
11:25 - 13:25 Lunch (Vistas Dining Room)
13:30 - 16:45 Regularized estimation (Chair: Laurent Jacob) (TCPL 201)
13:30 - 13:55 Jean-Philippe Vert: Cancer stratification from mutation profiles
Genome-wide somatic mutation profiles of tumours can now be assessed efficiently and promise to move precision medicine forward. Statistical analysis of mutation profiles is however challenging due to the low frequency of most mutations, the varying mutation rates across tumours, and the presence of a majority of passenger events that hide the contribution of driver events. Here we propose a method, NetNorM, to represent whole-exome somatic mutation data in a form that enhances cancer-relevant information using a gene network as background knowledge. We evaluate its relevance for two tasks: survival prediction and unsupervised patient stratification. Using data from 8 cancer types from The Cancer Genome Atlas (TCGA), we show that it improves over the raw binary mutation data and network diffusion for these two tasks.
(TCPL 201)
13:55 - 14:20 Yang Li: The links between DNA variation and complex traits
A central goal of genetics is to understand the links between genetic variation and disease. Intuitively, one might expect disease risk to be explained by a small number of disease-causing variants that cluster in or near core genes and pathways. However recent GWASs have revealed that most complex traits, including height and schizophrenia risk, are highly polygenic. While the strongest of these associations sometimes map to genes directly linked to the trait, the majority of association signals are found across much of the genome, including near genes with housekeeping-like functions. For example, we found that over half of the genomic SNPs are in high linkage disequilibrium with a SNP that has an estimated effect of increasing height by an average of 0.145mm. To better understand how these widespread signals contribute to complex traits, we focused on three diseases for which causal cell-types are relatively well defined: schizophrenia, Crohn’s disease, and rheumatoid arthritis. Again, we found that association signals were widely dispersed. We further found that the causal signal was present sometimes exclusively in regions marked by active chromatin in the relevant cell types (45% to ~100%), but vastly depleted or absent from regions that are generally inactive across cell-types. While variation in cell type-specific gene networks contributes to complex disease risk, we show evidence that genes with housekeeping-like functions cumulatively account for a greater fraction of total SNP heritability. As expected, we found that relevant gene sets exhibited the greatest enrichment in trait heritability. However, we also observed a strong linear relationship between the size of the gene sets and the proportion of heritability they explained, further supporting the hypothesis that most if not all transcribed genes in the relevant cell-type(s) contribute to disease risk. Together, these findings imply a need for rethinking models of complex traits. We propose that gene regulatory networks are sufficiently interconnected for all genes expressed in disease-relevant cells to be liable to affect the functions of core disease-related genes. Consequently, the bulk of the genetic effects on disease are mediated through genes without any direct relationship to disease function, and variation in non-disease genes previously thought to be innocuous may in fact drive complex disease risk in human populations.
(TCPL 201)
14:20 - 14:45 Yves Moreau: Bayesian matrix factorization with side information and application to drug-target activity prediction and gene prioritization
Matrix factorization/completion methods provide an attractive framework to handle sparsely observed data, such as the prediction of biological activity of chemical compounds against drug targets, where only 0.1% to 1% of all compound-target pairs are measured. Matrix factorization searches for latent representations of compounds and targets that allow an optimal reconstruction of the observed measurements. These methods can be further combined with linear regression models to create multitask prediction models. In our case, fingerprints of chemical compounds are used as “side information” to predict target activity. By contrast with classical Quantitative Structure-Activity Relationship (QSAR) models, matrix factorization with side information naturally accommodates the multitask character of compound-target activity prediction. This methodology can be further extended to a fully Bayesian setting to handle uncertainty optimally, which is of great value in this pharmaceutical setting where experiments are costly. We have developed a significant innovation in this setting, which consists in the reformulation of the Gibbs sampler for the Markov Chain Monte Carlo Bayesian inference of the multilinear model of matrix factorization with side information. This reformulation shows that executing the Gibbs sampler only requires performing a sequence of linear regressions with a specific noise injection scheme. This reformulation thus allows scaling up this MCMC scheme to millions of compounds, thousands of targets, and tens of millions of measurements, as demonstrated on a large industrial data set from a pharmaceutical company. We have developed a Python/C++ library, called Macau, implementing this method and which can be applied to many modeling tasks, well beyond our pharmaceutical setting. We discuss the application of our method to drug-target activity prediction using compound structure fingerprints as side information. We also discuss the application of this method to drug-target activity prediction using high-content imaging assays as side information. Our results suggests that high-content imaging assays can be broadly repurposed for drug-target activity prediction and the broad exploration of chemical space.
(TCPL 201)
14:45 - 15:10 John Platig: Predicting regulatory function from network structure
Network representations of large datasets provide a number of advantages: they tend to scale well, they’re intuitive, and—most importantly—they make for pretty pictures. However, using the quantitative properties of biological networks as a predictive tool is often under-appreciated. In this talk I will discuss two examples where such network properties—in an appropriate statistical testing framework—informed functional roles of biological regulators. In the first, we construct bipartite eQTL networks in thirteen tissues with data collected by the GTEx consortium. In the second, we explore the response of the gene regulatory network in Mycobacterium tuberculosis to a targeted drug treatment. In both cases, we find that the network structural properties reflect constraints operating in that biological system, and that each network has a unique, informative feature set. For example, in the eQTL networks, we find that local community hubs (but not global network hubs) are predictive of SNP regulatory roles within tissues, while in the tuberculosis regulatory network degree is very useful in predicting treatment-specific transcription factor activity.
(TCPL 201)
15:10 - 15:30 Coffee Break (TCPL Foyer)
15:30 - 15:55 Anna Bonnet: Heritability estimation in high-dimensional sparse linear mixed models
The heritability of a biological quantitative feature is defined as the proportion of its variation that can be explained by genetic factors. We propose an estimator for heritability in high dimensional sparse linear mixed models and we study its theoretical properties. We highlight the fact that in the case where the size N of the random effects is too large compared to the number n of observations, a precise estimation for heritability cannot be provided. Since in practice almost all datasets verify the condition N >> n, we perform a variable selection method to reduce the size of the random effects and to improve the accuracy of heritability estimations. However, as shown by our simulations, this kind of approach only works when the number of non-zero components in the random effects (i.e. the genetic variants which have an impact on the phenotypic variations) is small enough. In face of this limitation, we proceeded to define an empirical criterion to determine whether it is possible or not to apply a variable selection approach. As an example of its use, we applied our method to estimate the heritability of the volume of several regions of the human brain.
(TCPL 201)
15:55 - 16:20 Anna Goldenberg: Bayesian deep learning models for drug response prediction
Most of the computational methods that predict drug response rely on high throughput drug screens in patient-derived cell lines. However, drug sensitivity data alone is not enough to understand what happened to a cell line in response to a drug. A database of perturbations in response to a set of drug treatments was generated (Duan, 2014) to address this problem. Combining sensitivity and perturbation data is expected to yield a better and more biologically relevant models of drug response. We developed Variational Autoencoder (VAE) models, which are generative graphical models whose parameters and matching inference models are trained simultaneously using Stochastic Variational Inference. In its supervised extension, a latent representation of the data is learned jointly with a classifier in the latent space. Semi-Supervised Variational Autoencoders (SSVAEs) (Kingma, 2014) extend VAE models to utilize unlabeled data, thus enabling to learn better latent representation from more data. We apply SSVAE to drug response prediction in cell lines, to learn a latent representation of the gene expression that is also predictive of drug response. We have also developed Drug Response Variational Autoencoder (DrVAE) that learns latent representation of the underlying gene states before and after drug application that depends on both: cell line's overall response to the drug and expression change of the landmark genes from perturbation experiments. DrVAE is conceptually a SSVAE extended by a "drug effect” that is learning jointly the response/non-response classification and reconstruction of gene expression. In our preliminary experiments, we compared the above mentioned models for sensitivity prediction against well established baselines: Ridge Regression (RR) and SVM with RBF kernel (SVM). We trained a separate model for 19 different drugs, each in 5-fold cross validation. SSVAE improves drug response AUROC by up to 13% compared to RR and 8% compared to SVM for DNA-PK inhibitor NU-7441 and DrVAE improves AUROC by up to 10% compared to RR and 4% compared to SVM in afatinib. Our preliminary experiments show that Stochastic Variational Inference-based models are promising, achieving similar to SVM performance and confidently outperforming linear models. Our models allow to incorporate a variety of diverse types of data, and while they may be hard to train, we feel that there is a great potential of VAE models for prediction of drug response and other tasks.
(TCPL 201)
16:20 - 16:45 Manolis Kellis: Mediation analysis, multi-tissue factor QTLs, EHRs, transfer learning in deep CNNs for genetics (TCPL 201)
16:45 - 17:10 Athma Pai: The major determinants of genome-wide splicing efficiency in flies
The dynamics of gene expression may impact regulation, and the processing of nascent RNA molecules into mature RNA can be a rate-limiting step for establishing gene expression equilibrium. To assess the rates of pre-mRNA splicing, we used a short, progressive metabolic labeling strategy followed by RNA sequencing to estimate the intron half-lives of ~30,000 introns in Drosophila melanogaster S2 cells. We find that splicing rates are strongly correlated with several gene features. Splicing rates varied with intron length (independent of splice site strength) and were fastest for introns of length 60-70 nt, which is the most abundant intron length class in the Drosophila genome. Using our nascent sequencing data, we also identified hundreds of novel recursively spliced segments, where long introns are spliced in multiple segments rather than one unit. We expanded the catalog of known recursively spliced introns in flies by 4-fold, though sub-sampling and saturation analyses indicated that we are still underestimating the true number of recursive sites in the Drosophila genome. We find that recursive splicing is associated with much faster and also more accurate splicing of the ultra-long introns in which they occur. Together, intron length accounts for ~30% of variance in splicing rates and the presence of recursive sites is associated with a two- fold reduction in half-life. Building on these observations, we developed a model that accounts for greater than 50% of the variability in splicing rates across Drosophila introns. Surprisingly, introns within the same gene tend to have similar splicing half-lives and longer first introns are associated with faster splicing of subsequent introns. Our results indicate that genes have different intrinsic rates of splicing, and suggest that these rates are influenced by gene architecture and molecular events at gene 5' ends, likely tuning the dynamics of developmental gene expression.
(TCPL 201)
17:10 - 17:35 Debrief (TCPL 201)
17:35 - 19:35 Dinner (Vistas Dining Room)
Wednesday, March 29
07:00 - 09:00 Breakfast (Vistas Dining Room)
09:00 - 11:25 Deep Learning for computational biology (Chair: Barbara Engelhardt) (TCPL 201)
09:00 - 09:25 Sara Sheehan: Towards automated population genetic inference using deep neural networks
This talk will focus on a novel deep learning algorithm, evoNet, that can jointly estimate parameters from genomic data such as DNA from many individuals. In population genetics, the evolutionary factors that shape variation often leave signatures that are difficult to disentangle. This makes joint inference both necessary and challenging, especially in the case of demographic history and natural selection. Deep learning automatically teases out important features of the data, which makes it useful for biological problems where the underlying models are computationally intractable and appropriate summary statistics are unknown. In particular, convolutional neural networks show great promise for making large-scale population genetic inference more automated and flexible.
(TCPL 201)
09:25 - 09:50 Anshul Kundaje: Deep learning approaches to denoise, impute, integrate and decode functional genomic data
We present interpretable deep learning approaches to address three key challenges in integrative analysis of functional genomic data. (1) Data denoising: Data quality of functional genomic data is affected by a myriad of experimental parameters. Making accurate inferences from chromatin profiling experiments that involve diverse experimental parameters is challenging. We introduce a convolutional denoising algorithm to learn a mapping from suboptimal to high-quality datasets that overcomes various sources of noise and variability, substantially enhancing and recovering signal when applied to low-quality chromatin profiling datasets across individuals, cell types, and species. Our method has the potential to improve data quality at reduced costs. (2) Data imputation: It is largely infeasible to perform 100s of genome-wide assays targeting diverse transcription factors and epigenomic marks in 100s of cellular contexts due to cost and material constraints. We have developed multi-task, multi-modal deep neural networks to predict chromatin marks and in vivo binding events of 100s of TFs by integrating regulatory DNA sequence with just two assays namely ATAC-seq (or DNase-seq) and RNA-seq performed in a target cell type of interest. We train our models on large reference compendia from ENCODE/Roadmap Epigenomics and obtain high prediction accuracy in new cellular contexts thereby significant expanding the context-specific annotation of the non-coding genome.(3) Decoding the context-specific regulatory architecture of the genome: Finally, we develop novel, efficient interpretation engines for extracting predictive and biological meaningful patterns from integrative deep learning models of TF binding and chromatin accessibility. We obtain new insights into TF binding sequence affinity models (e.g. significance of flanking sequences and fusion motifs), infer high-resolution point binding events of TFs, dissect higher-order cis-regulatory sequence grammars (including density and spatial constraints), learn chromatin architectural features correlated with chromatin marks, unravel the dynamic regulatory drivers of cellular differentiation and score the regulatory influence of non-coding genetic variants. We provide early access to all associated code and frameworks at https://github.com/kundajelab
(TCPL 201)
09:50 - 10:15 David Knowles: Learning a mapping from pre-mRNA sequence to splice site usage to understand RNA splicing variation
RNA splicing is complex process carefully regulated by the coordinated action of the hundreds of proteins and associated small nuclear RNAs comprising the core spliceosome and trans-acting splicing factors. To determine which regions to remove as introns, this splicing machinery must interpret information encoded in the pre-mRNA sequence: consensus splice site and branch point motifs, as well as splicing regulatory elements (SRE) such as exonic splicing enhancers. We approach computational modeling of the splicing process by learning a mapping from pre-mRNA sequence to splice site (SS) usage levels. Following our recent work on LeafCutter, which we developed to study intron splicing variation, we use spliced reads from RNA-seq to quantify local intron usage in a straightforward, annotation-agnostic manner. For each 5’ SS we predict the proportion of corresponding spliced reads mapping to each possible 3’ SS (specifically every AG dinucleotide within 100kb) as a function of the 3’ SS sequence context. An analogous model is used to model 5’ SS choice for each 3’ SS. Compared to previous work predicting exon inclusion, we model more splicing events, including 3’ and 5’ extensions, and additionally leverage signal from constitutive splicing. We choose a deep neural network (NN) to represent the mapping from sequence to SS usage proportion. Convolutional NNs (CNN) can naturally detect regulatory elements: the first convolutional layer corresponds to scanning learnt PWMs along the sequence, and following layers allow combinatorial and spatial logic on top of the resulting detections. Max-pooling layers endow limited translational invariance on where motifs are detected, which is appropriate for SREs but undesirable for the SS consensus. We therefore use a CNN on a large sequence context (~800bp) combined with a dense network locally around the SS (~70bp). A Dirichlet-multinomial likelihood is used to appropriately account for overdispersion in RNA-seq read counts. A multi-output extension readily allows modeling of tissue-specific splicing patterns. We assess model performance using 110 male muscle RNA-seq samples from GTEx, training on odd chromosomes and testing on even chromosomes. Out of the typically ~2000 canonical dinucleotides within 100kb we are able to correctly predict the most frequently used 68% of the time (for 3’ SS choice, 65% for 5’ SS), compared to 9.6% when picking the strongest SS by MaxEntScore. The model learns to detect expected features, including the branchpoint consensus and polypyrimidine tract, and can distinguish between canonical dinucleotides which are never spliced, noisily spliced (<1% of reads) and constitutively spliced (>99%). Using GTEx WGS data we are able to predict which SNVs will create cryptic splice sites with an AUC of 99%. Using 12 diverse tissues from GTEx we predict tissue-specific SS usage, with an average correlation between predicted and observed differences in SS usage across all pairs of tissues of 0.31 (compared to 0.07 for existing work on exonic PSI).
(TCPL 201)
10:15 - 10:35 David Kelley: Sequential regulatory activity prediction with long-range convolutional neural networks
Functional genomics approaches to better model genotype-phenotype relationships have important applications toward understanding genomic function and improving human health. In particular, methods to predict transcription factor (TF) binding and chromatin attributes from DNA sequence show promise for determining mechanisms for the plethora of noncoding variants statistically associated with disease in human populations. However, TF binding and chromatin are primarily interesting insofar as they affect gene expression. Thus, such modeling frameworks would likely prove more valuable if they could predict gene expression from DNA sequence. In large mammalian genomes, gene expression depends on very large regions of sequence with complex rules that have been established as high-level principles, but only rarely described in detail for individual loci. Here, I will suggest solutions to these challenges and describe an initial machine learning system to predict transcription across large genomes from DNA sequence using deep convolutional neural networks.
(TCPL 201)
11:25 - 13:30 Lunch (Vistas Dining Room)
13:30 - 17:30 Free Afternoon (Banff National Park)
17:30 - 19:30 Dinner (Vistas Dining Room)
Thursday, March 30
07:00 - 09:25 Breakfast (Vistas Dining Room)
09:25 - 09:50 Jeff Leek: Data science as a science
We all know that any genomic data analysis involves hundreds of decisions by any analyst. We have good theoretical methods for controlling error rates and preventing false discoveries for single methods. But what happens when humans get their hands on our methods and code? In this talk I propose a new framework for modeling data analysis and show some early experimental results in our effort to make data science a rigorous empirical science.
(TCPL 201)
09:50 - 10:15 Christoher Brown: Statistical and experimental methods for causal inference at complex trait associated loci
Genome-wide association studies (GWAS) have identified thousands of loci that contribute to risk for complex diseases. The majority of the heritability of complex disease risk lies within the noncoding regions of the genome. This has led to the hypothesis that the causal variants at GWAS associated loci lead to changes in local gene expression. As a result of linkage disequilibrium and the fact that cis- regulatory elements (CREs) may target genes over large distances, it is often unclear which variant or gene affects disease risk. However, their identification will improve understanding of disease etiology and identify targets for novel therapeutic development. Recent work from efforts such as GTEx has identified genetic variation associated with gene expression variation for essentially every gene. Despite this wealth of data, the characterization of causal mechanisms at complex trait associate loci remains a significant challenge. To address this challenge, we have developed and applied high throughput computational and experimental approaches to identify candidate disease genes and the functional regulatory variants that mediate disease risk. We have focused on cardiovascular disease (CVD) and molecular trait mapping in the liver as model systems. Existing studies have focused on easily ascertained cell types, while the liver, which plays a critical role in regulating cholesterol and lipid metabolism, and where many CVD associated variants likely affect gene expression, has remained understudied. We have deeply phenotyped liver biopsies and iPSC derived hepatocytes form more than 400 donors, collecting RNA-seq along with histone modification and transcription factor ChIP-seq data. We have used these data to identify thousands of genetic variants associated with allele-specific transcription factor binding, histone modification, gene expression, and splicing. Comparison to data from the GTEx and Roadmap Epigenomics projects demonstrate that many of these associations are specific to the liver. We demonstrate that multi-phenotype molecular trait mapping improves statistical power to detect associations and results in improved resolution at identified loci. We have integrated these data with CVD GWAS data using a novel multi-phenotype causal inference framework based on Mendelian randomization to predict the precise variants, CREs, and genes that underlie CVD risk. Using a combination of massively parallel reporter assays, genome-edited stem cells, CRISPR interference, and in vivo mouse models, we establish rs2277862-CPNE1, rs10889356-ANGPTL3, rs10889356-DOCK7, and rs10872142-FRK as causal SNP- gene sets for CVD. These results demonstrate that a molecular trait mapping framework can rapidly identify causal genes and variants contributing to complex human traits and demonstrates that, at many GWAS loci, candidate genes have been falsely implicated based on proximity to the lead SNP.
(TCPL 201)
10:15 - 10:35 Coffee Break (TCPL Foyer)
10:35 - 11:25 Large scale resources (TCPL 201)
10:35 - 11:00 Ben Langmead: Summarizing tens of thousands of RNA-seq samples: themes and lessons
The Sequence Read Archive contains RNA-seq data for over 450K samples, including over 140K from humans. Large-scale projects like GTEx and ICGC are generating RNA-seq data on many thousands of samples. Such huge datasets are valuable, but unwieldy for typical researchers. I will describe work toward the goal of making it easy for researchers to use the archived RNA-seq data available today. I will highlight Rail-RNA (http://rail.bio), its dbGaP-protected version (http://docs.rail.bio/dbgap/), as well as the recount resource (https://jhubiostatistics.shinyapps.io/recount/) and Snaptron service/API (http://snaptron.cs.jhu.edu). Besides showcasing these tools and resources, I'll expound three themes: (a) pulic data is valuable but not easy to use and computationalists should attack this; (b) scalability is not just about scaling software to be distributed & multi-threaded, but is also about making the best use of many datasets at once; (c) "strategically unplugging" from gene annotations can lead to clearer statements about splicing and differential expression.
(TCPL 201)
11:00 - 11:25 Shannon Ellis: In silico phenotyping to improve the usefulness of public data
In this talk I will describe the recount2 resource https://jhubiostatistics.shinyapps.io/recount/ and discuss our effort to computationally re-phenotype samples to integrate and analyze all public RNA-seq data
(TCPL 201)
11:25 - 13:30 Lunch (Vistas Dining Room)
13:30 - 17:10 Heterogeneous data integration (Chair: Anna Goldenberg) (TCPL 201)
13:30 - 13:55 Ronglai Shen: Integrating omics data for cancer classification and prognosis
In this talk, I will present a pan-cancer analysis of multiple omic platforms. We identified shared molecular alterations in different cancer types which may indicate related disease etiology and provide unique opportunities to compare treatments and outcome across cancer types. Furthermore, we developed a kernel learning approach to systematically investigate the prognostic value of germline and somatic mutation, DNA copy number, DNA methylation, mRNA, miRNA, protein expression, for predicting patient survival outcome. We found mRNA expression and DNA methylation are among the most informative data types for cancer prognosis, alone or in combination with clinical factors. The integration of omic profiles with clinical variables further improved the prognostic performance over using the clinical models alone. Moreover, the kernel learning method provides an efficient approach to integrate a large number of moderate effects, and thus consistently outperformed sparse methods such as lasso Cox regression.
(TCPL 201)
13:55 - 14:20 Josh Stuart: Unmasking all forms of cancer: toward integrated maps of all tumor subtypes
The varieties of cancer seem numberless. From classic tell-tale genomic alterations like the Philadelphia chromosome in CML, to the recurrent and specific amino acid V600E BRAF mutations in melanoma, from HER2 amplifications in some breast cancers, to hypermutated tumors in colorectal cancers linked to epigenetic changes. Are tumors that arise in different tissues distinct? Is every patient's tumor distinct? Or are there underlying connections to help construct a molecular taxonomy of cancer's forms? In this talk, I will present results from the TCGA Pan-Cancer analysis project to investigate cancer's forms in the most comprehensive study of tumor subtypes attempted to date. We derived a map of tumor classes encompassing an integrated view of six different omics datasets. While most tumors (90%) cluster with others from the same tissue of origin, a significant fraction (10%) are reclassified into groups of multiple tissue types. Data on patient outcomes suggests the reclassification could provide important information to consider for treatment. I will also present novel pathway analysis methods and landscape visualization techniques that help probe further into these results.
(TCPL 201)
14:20 - 14:45 Natasa Przulj: Predictive Integration of Networked Big Data
We are faced with a flood of molecular, clinical, economic and other data. Various biomolecules interact in a cell to perform biological function, forming large, complex systems. The challenge is how to mine these molecular systems to answer fundamental questions, including gaining new insight into diseases and improving therapeutics. Just as computational approaches for analyzing genetic sequences have revolutionized biological understanding, the expectation is that analyses of large-scale, networked “omics” data will have similar ground-breaking impacts. However, dealing with these data is nontrivial, since many methods for analyzing large networks fall into the category of computationally intractable problems. We develop methods for extracting new biomedical knowledge from the wiring patterns of large molecular and patient data, linking molecular network wiring with biological function and disease, hence translating the information hidden in the wiring patterns into domain knowledge. We apply our methods to other domains, including tracking the dynamics of the world trade network and finding new insights into the origins of wealth and economic crises. Our new methods stem from network science approaches coupled with graph-regularized non-negative matrix tri-factorization, a machine learning technique for co-clustering heterogeneous datasets.
(TCPL 201)
14:45 - 15:10 Benjamin Haibe Kains: Integrative cancer pharmacogenomics to infer large-scale drug taxonomy
Identification of drug targets and mechanism of action (MoA) for new and uncharacterized anticancer drugs is important for optimization of treatment efficacy. Current MoA prediction approaches largely rely on prior information including side effects, therapeutic indication and/or chemo-informatics. Such information is not transferable or applicable for newly identified, previously uncharacterized small molecules. Therefore, a shift in the paradigm of MoA predictions is necessary towards development of unbiased approaches that can elucidate drug relationships and efficiently classify new compounds with basic input data. I will describe a new integrative computational pharmacogenomic approach, referred to as Drug Network Fusion (DNF), to infer scalable drug taxonomies that relies only on basic drug characteristics towards elucidating drug-drug relationships. DNF is the first framework to integrate drug structural information, high-throughput drug perturbation and drug sensitivity profiles, enabling drug classification of new experimental compounds with minimal prior information. I will show that the DNF taxonomy succeeds in identifying pertinent and novel drug-drug relationships, making it suitable for investigating experimental drugs with potential new targets or MoA. I will highlight how the scalability of DNF facilitates identification of key drug relationships across different drug categories, and poses as a flexible tool for potential clinical applications in precision medicine. Our results support DNF as a valuable resource to the cancer research community by providing new hypotheses on the compound MoA and potential insights for drug repurposing.
(TCPL 201)
15:10 - 15:30 Coffee Break (TCPL Foyer)
15:30 - 15:55 Barbara Engelhardt: Intersecting pathology images and gene expression data to understand drivers of complex phenotypes
Understanding the correlations between genotype, gene expression levels, and high dimensional complex traits has been essential to studying the drivers of human complex disease and identifying effective therapeutic strategies for these traits. However, some complex traits are difficult to characterize in such a way as to make the quantification of correlations possible; one such complex trait is pathology imaging data. In this work, we use a type of deep learning, a convolutional autoencoder, to automatically extract one thousand features from each pathology image, and we use sparse canonical correlation analysis to correlate these pathology images with paired gene expression data on the same samples. Across three data sets, including two cancer tissue data sets and the GTEx data that include paired pathology imaging data and gene expression data, we find that our approach identifies the subset of genes that are differentially expressed with respect to specific image features, including cell size, extracellular matrix organization, cell wall thickness, and cell shape. We also pursue genotype association with pathology features in the GTEx data. We validate these associated genes and genotypes correlated with pathology image features using various approaches including gene ontology enrichment, tissue specific expression, and Mendelian randomization, allowing us to identify the drivers of cellular phenotypes. This work begins to explore the possibility of association mapping with phenotype data automatically derived from images.
(TCPL 201)
15:55 - 16:20 James Taylor: Chromosome Conformation in Context
Chromosome conformation capture (3C) techniques have revealed many features about the structure of chromatin: Large scale organization into two compartments (A and B), relatively stable partitioning into Topologically Associating Domains (TADs) at the megabase scale, and more dynamic and cell type specific interactions associated with various architectural proteins at the sub-megabase scale (sub-TADs). However, these assays are limited in that they only interrogate interactions between chromatin without any connection to location in the nucleus, and are an ensemble measurement over thousands of cells. Here we use integrated analysis of chromosome conformation capture (Hi-C) data, DamID, and single cell imaging to understand the relationship of chromosome conformation features with a specific nuclear compartment, the periphery. Using a new high-resolution compartment scoring algorithm, we show that the B (or generally less active) compartment corresponds to Lamin Associated Domains (LADs) which are positioned at the nuclear lamina. These regions have a histone modification profile typical of heterochromatin, but are also depleted of the architectural protein CTCF, suggesting their organization is CTCF independent. However within the LADs we find many small regions (less than 25kb) that have low signal for Lamin association and a dramatic change in compartment score. These regions are highly enriched for histone states and DNA binding proteins associated with active elements, and appear to contain both transcription start sites and a large number of putative cis-regulatory modules. This suggests regions much smaller than a stereotypical TAD can be organized into a different nuclear compartment from their surrounding DNA, and that this organization is involved in gene regulation at a distance.
(TCPL 201)
16:20 - 16:45 Hae Kyung Im: Propagating consequences of molecular mechanisms into complex phenotypes
To understand the biological mechanisms underlying thousands of genetic variants robustly associated with complex traits, scalable methods that integrate GWAS and functional data generated by large-scale efforts are needed. We have proposed a method termed MetaXcan that addresses this need by inferring the downstream consequences of genetically regulated components of molecular traits on complex phenotypes using summary data only. MetaXcan allows multiple causal variants and flexible multivariate models extending the capabilities of existing methods and enabling the testing of complex processes. The application to prediction models of gene expression levels in 44 human tissues and 100+ complex phenotypes revealed many novel genes and re-identified known ones with patterns of regulation in expected as well as unexpected tissues. Prediction models of miRNA showed the potential to identify novel mRNA targets.
(TCPL 201)
16:45 - 17:35 Debrief (TCPL 201)
17:35 - 19:30 Dinner (Vistas Dining Room)
Friday, March 31
07:00 - 09:00 Breakfast (Vistas Dining Room)
09:00 - 10:00 Open discussion (TCPL 201)
10:00 - 10:30 Coffee Break (TCPL Foyer)
10:30 - 12:00 Open discussion (TCPL 201)
11:30 - 12:00 Checkout by Noon
5-day workshop participants are welcome to use BIRS facilities (BIRS Coffee Lounge, TCPL and Reading Room) until 3 pm on Friday, although participants are still required to checkout of the guest rooms by 12 noon.
(Front Desk - Professional Development Centre)
12:00 - 13:30 Lunch from 11:30 to 13:30 (Vistas Dining Room)