Statistical and Computational Challenges in Large Scale Molecular Biology (17w5131)
Stephen Montgomery (Stanford University)
Manolis Kellis (Massachusetts Institute of Technology and Broad Institute)
Jeff Leek (Johns Hopkins Bloomberg School of Public Health)
Anna Goldenberg (University of Toronto)
Barbara Engelhardt (Princeton University)
Laurent Jacob (Centre national de la recherche scientifique)
Over the past few years, an increasing number of large scale data sets have been made available in molecular biology. GTEx, for example, produced more than ~10000 RNA-Seq assays for multiple tissues in hundreds of individuals, Mindact generated gene expression data from about 7000 breast tumors in a single study, and 23andMe claims to have genotyped about 900,000 individuals. This growth in the available genomic data is expected to increase our capacity to identify cancer subtypes, regulatory genes, SNPs associated with phenotypes of interest, and biomarkers for many human traits. It also suggests exploring more complex feature representations when analyzing these datasets.
However, increasing the number of samples and features leads to a set of interrelated statistical and computational problems. Accordingly, the objectives of our workshop will be to:
- Systematically identify the statistical and computational problems arising during the analysis of large scale data in molecular biology;
- Bring together experts in computational biology, molecular biology, computer science, and statistics to propose innovative solutions to these problems, by leveraging recent advances in each of these fields.
Relevance, importance and timeliness
A number of studies generating high throughput molecular data for a large number of biological samples have been completed over the past five years. Our workshop is important because the availability of these datasets holds great promises in terms of health improvement and understanding of molecular biology. First of all, if exploited correctly, larger sample sizes should improve our ability to predict phenotypes of interest from molecular data. This entails very important applications such as improving the survival of cancer patients by better predicting which treatment they should receive, or decreasing bacterial resistances by predicting which antibiotic is efficient against a new strain. Correctly exploiting large scale datasets should also allow us to better identify genetic and epigenetic determinants of these phenotypes, yielding a better understanding of human diseases and potentially guiding the development of new treatments and prevention policies. In particular, more samples should allow the detection of less frequent variants in the human genome, or more complex features involving several modalities (copy number, expression, methylation, etc) associated with diseases. Finally, larger sample sizes should help with essential unsupervised tasks such as the inference of regulation networks, or the identification of cancer subtypes.
Our workshop is relevant because all of these promises are conditioned on our solving of new statistical and computational challenges. First (Challenge 1), we need to build new feature spaces and estimators whose complexity is adapted to these larger sample sizes, which involves designing novel, potentially more complex descriptors of the samples but still controlling the bias/variance trade-off. Second (Challenge 2), we need to build models which correctly integrate different modalities, such as copy number variation and gene expression. Third (Challenge 3), larger scale studies are more prone to unwanted variations, because they typically involve different labs and technical changes which can affect the measurements and become confounders in retrospective analyses. Similar or worse problems arise when trying to combine several existing datasets. We need methods which take this unwanted variation into account. Finally, (Challenge 4), we need new algorithms that make existing statistical tools scalable to the new sample sizes, and make estimation over the larger and more complex features of Challenge 1 tractable.
We also believe our workshop is very timely because some of these statistical and computational challenges are starting to be addressed in other application fields of statistics. It is crucial to recognize that the orders of magnitude are still very different in molecular biology and other data science application fields because of the cost and complexity of the data generation process: current large scale high throughput sequencing data sets typically contain a few thousand of samples but millions of features while computer vision, web, or astronomy datasets can involve billions or trillions of samples and relatively fewer features. A first consequence is that not all recent developments in machine learning are immediately transferable to computational biology. For example, so called deep learning methods have gained a lot of popularity and now represent the state of the art in computer vision but may not be the most appropriate tool for prediction of cancer outcome from molecular data. However, the fact that other fields already have much larger sample sizes also means that they had to develop efficient and scalable algorithms for basic tasks like feature selection, classification or clustering. These recent developments are a great source of inspiration for computational biology, where large scale computation is still an emerging challenge.
We believe having a small scale workshop involving international experts in machine learning, statistics, computational biology and molecular biology is of utmost importance for three main reasons. The first reason is that the technical advances we are referring to are very recent, often unknown to computational biologists and involve paradigms such as online optimization, accelerated gradient methods and network flow optimization, with which they are sometimes unfamiliar. The second reason is that it is not always obvious to non-statisticians which novel methods are appropriate given the current n/p regime. Conversely, the third reason is that statisticians do not know what the recent challenges are in molecular biology. Having them work on abstract versions of the problems is often not satisfactory as it is necessary to be aware of technical realities and of the underlying biology of the problem to come up with useful solutions.