Schedule for: 23w5090 - Single-Cell Plus – Data Science Challenges in Single-Cell Research

Beginning on Sunday, July 2 and ending Friday July 7, 2023

All times in Banff, Alberta time, MDT (UTC-6).

Sunday, July 2
16:00 - 17:30 Check-in begins at 16:00 on Sunday and is open 24 hours (Front Desk - Professional Development Centre)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building.
(Vistas Dining Room)
20:00 - 22:00 Informal gathering
PDC Lounge & Reading Room
(Other (See Description))
Monday, July 3
07:00 - 09:00 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
09:00 - 09:20 Introduction and Welcome by BIRS Staff
A brief introduction to BIRS with important logistical information, technology instruction, and opportunity for participants to ask questions.
(TCPL 201)
09:20 - 09:30 Sunduz Keles: Day 1: System biology, gene regulation and epigenomics (TCPL 201)
09:30 - 09:45 Jake Yeung: Greater than the sum of the parts: Learning relationships between histone modifications in single cells
Detecting histone modifications in single cells by sequencing is still in its infancy, but has the potential to unlock the full spectrum of different chromatin states in the genome of individual cells. More established single-cell sequencing technologies, such as mRNA-seq and ATAC-seq, interrogate only a tiny fraction of the genome. Progress therefore hinges on both new measurement methods that can map multiple epigenetic marks in single cells as well as new analysis methods that connect different modalities together. I will first present sortChIC, a method to map histone modifications in single cells from rare cell populations. During hematopoiesis, we find that repressive chromatin dynamics are qualitatively different than active ones: active chromatin states mainly distinguish mature blood cell types, while changes in repressive states mainly distinguish HSCs and mature cell types. These findings suggest that hematopoiesis requires overcoming heterochromatin barriers in a cell fate-independent manner. Next I will present scChIX-seq, a framework to generate multimodal histone modification data in single cells. We develop multimodal analysis methods that reveal genome regulation that would otherwise be missed from unimodal data. Overall, these integrated experimental and computational methods reveal dynamic relationships between chromatin states, and how those relationships change during differentiation.
(TCPL 201)
09:45 - 10:00 Keegan Korthauer: Probabilistic modelling of single-cell methylation sequencing data reveals regions that are informative of cell type and cell state
Regions that exhibit heterogeneity in DNA methylation (DNAm) across cells may play a role in processes such as gene regulation, disease susceptibility and environmental influences. They may also act as predictive signatures of cell type or cell state. Single-cell bisulfite sequencing (scBS-seq) provides measurements of DNAm in individual cells, but the data are extremely sparse, typically with greater than 80% missing rate. We propose a novel computational tool for detection of variably methylated regions (VMRs) in scBS-seq data. Our approach uses a probabilistic model to (1) leverage the correlation structure of nearby DNAm sites, and (2) pool information across cells to overcome the challenges of sparsity. Compared to VMRs detected by previous methods, our approach demonstrates increased clustering accuracy in simulations and a case study of mouse neuronal cells.
(TCPL 201)
10:00 - 10:15 Zhana Duren: Modelling gene regulation via integrative analysis of single cell multi-omics data
The accurate inference of context-specific Gene Regulatory Networks (GRNs) from genomics data is a crucial task in computational biology. However, current methods have limitations, including relying solely on gene expression data, lower resolution from bulk data, and limited data availability for certain cellular systems. To address these challenges, we developed a new method based on lifelong neural network to infer high accuracy GRNs (LINGER) from single cell gene expression and chromatin accessibility data by leveraging atlas-scale external data. The LINGER model proposes a metric called the "pioneer index" to quantify the ability of transcription factors (TFs) to initiate chromatin remodeling, improving the accuracy and interpretability of the GRN. The LINGER method achieved 3 times higher accuracy compared to currently available methods and provided insights into the interpretation of disease-associated variants and genes, offering a comprehensive tool for inferring gene regulation from genomics data.
(TCPL 201)
10:15 - 10:45 Coffee Break (TCPL Foyer)
10:45 - 11:00 Sunyoung Shin: Scalable test of statistical significance for protein-DNA binding changes with insertion and deletion of bases in the genome
Mutations in the noncoding DNA, which represents approximately 99% of the human genome, have been crucial to understanding disease mechanisms through dysregulation of disease-associated genes. One key element in gene regulation that noncoding mutations mediate is the binding of proteins to DNA sequences. Insertion and deletion of bases (InDels) are the second most common type of mutations, following single nucleotide polymorphisms, that may impact protein-DNA binding. However, no existing methods can estimate and test the effects of InDels on the process of protein-DNA binding. We develop a novel test of statistical significance, namely the binding change test (BC test), using a Markov model to evaluate the impact and identify InDels altering protein-DNA binding. The test predicts binding changer InDels of regulatory significance with an efficient importance sampling algorithm generating background sequences in favor of large binding affinity changes. Simulation studies demonstrate its excellent performance. The application to human leukemia data uncovers candidate pathological InDels on modulating MYC binding in leukemic patients. We develop an R package atIndel, which is available on GitHub."
(TCPL 201)
11:00 - 11:15 Rachel Wang: scTIE: a unified framework for data integration and inference of gene regulation using single-cell temporal multimodal data
Single-cell technologies offer unprecedented opportunities to dissect gene regulatory mechanisms in context-specific ways. Although an increasing number of computational methods have been developed for inferring gene regulatory relationships from scRNA-seq and scATAC-seq data, the data integration problem, essential for accurate cell type identification, has been mostly treated as a standalone challenge. I will present scTIE, a unified method that integrates temporal multimodal data and infers regulatory relationships predictive of cellular state changes. scTIE uses an autoencoder to embed cells from all time points into a common space using iterative optimal transport, followed by extracting interpretable information to predict cell trajectories. Using a variety of synthetic and real temporal multimodal datasets, we demonstrate scTIE achieves effective data integration while preserving more biological signals than existing methods, particularly in the presence of batch effects and noise. Furthermore, on the exemplar multiome dataset we generated from differentiating mouse embryonic stem cells over time, we demonstrate scTIE captures regulatory elements highly predictive of cell transition probabilities, providing new potentials to understand the regulatory landscape driving developmental processes.
(TCPL 201)
11:15 - 12:15 Hongkai Ji: Discussion Session (TCPL 201)
12:20 - 12:30 Group Photo
Meet in foyer of TCPL to participate in the BIRS group photo. The photograph will be taken outdoors, so dress appropriately for the weather. Please don't be late, or you might not be in the official group photo!
(TCPL Foyer)
12:35 - 14:00 Lunch
Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
13:00 - 14:00 Guided Tour of The Banff Centre
Meet in the PDC front desk for a guided tour of The Banff Centre campus.
(TCPL 201)
14:20 - 15:30 Hongyu Zhao: Round table discussion integration of GWAS with scATAC-seq data (TCPL 201)
15:40 - 16:10 Coffee Break (TCPL Foyer)
16:10 - 16:25 Yuanhua Huang: Modelling of cellular dynamics on differentiation and lineage
The recent advances in single-cell RNA-seq technologies offer a promising way to dissect the cellular dynamics of both differentiation and lineages. However, various statistical and computational challenges exist in inferring these temporal latent variables or structures. In this talk, we will introduce the recent methodology progress in the single-cell RNA velocity field and discuss a few potential strategies that may further enhance the robustness to be applicable for a broad range of biological systems. We will also introduce how lineage reconstruction techniques may elucidate the clonal preference in cell fate decisions.
(Online)
16:35 - 16:50 Kwangmoon Park: Joint tensor modeling of single cell 3D genome and epigenetic data with Muscle
Emerging single cell technologies that simultaneously capture long-range interactions of genomic loci together with their DNA methylation levels are advancing our understanding of three-dimensional genome structure and its interplay with the epigenome at the single cell level. While methods to analyze data from single cell high throughput chromatin conformation capture (scHi-C) experiments are maturing, methods that can jointly analyze multiple single cell modalities with scHi-C data are lacking. Here, we introduce Muscle, a semi-nonnegative joint decomposition of Multiple single cell tensors, to jointly analyze 3D conformation and DNA methylation data at the single cell level. Muscle takes advantage of the inherent tensor structure of the scHi-C data, and integrates this modality with DNA methylation. We developed an alternating least squares algorithm for estimating Muscle parameters and established its optimality properties. Parameters estimated by Muscle directly align with the key components of the downstream analysis of scHi-C data in a cell type specific manner. Evaluations with data-driven experiments and simulations demonstrate the advantages of the joint modeling framework of Muscle over single modality modeling or a baseline multi modality modeling for cell type delineation and elucidating associations between modalities.
(TCPL 201)
17:00 - 17:15 Zhanying Feng: Combinatorial regulons (cregulon): a novel optimization model for unraveling cellular identity and state transitions through single multi-omics data
We propose combinatorial regulon (cRegulon) to model the combinations among TFs, which can better characterize cell types and serves as the driving forces for cell state transitions. By leveraging rapidly accumulated single multi-omics data, we develop an optimization model to systematically infer cRegulons (i.e., the representative TF modules, their associated regulatory elements and target genes formed regulatory network). In our approach, cRegulon is jointly reconstructed from i) identifying TF modules from TF combinatorial network, ii) explaining gene expression of scRNA-seq data, and iii) explaining gene activity of scATAC-seq data. Therefore, the inferred cRegulons provide details of how TF combinations utilize specific regulations to characterize the identity and transition of cell type/states.
(Online)
17:15 - 17:50 Sara Mostafavi: Discussion Session (TCPL 201)
17:30 - 20:00 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building.
(Vistas Dining Room)
Tuesday, July 4
07:00 - 09:00 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
09:10 - 09:25 Sara Mostafavi: Day 2: Advances in single-cell RNA-Seq data (TCPL 201)
09:25 - 09:40 Hongkai Ji: A statistical framework for differential pseudotime analysis with multiple single-cell RNA-seq samples
Pseudotime analysis with single-cell RNA-sequencing data has been widely used to study dynamic gene regulatory programs along continuous biological processes. While many methods have been developed to infer the pseudotemporal trajectories of cells within a biological sample, it remains a challenge to compare pseudotemporal patterns with multiple samples (or replicates) across different experimental conditions. Lamian is a comprehensive and statistically-rigorous computational framework for differential multi-sample pseudotime analysis. It can be used to identify changes in a biological process associated with sample covariates, such as different biological conditions while adjusting for batch effects, and to detect changes in gene expression, cell density, and topology of a pseudotemporal trajectory. Unlike existing methods that ignore sample variability, Lamian draws statistical inference after accounting for cross-sample variability and hence substantially reduces sample-specific false discoveries that are not generalizable to new samples. Using both real scRNA-seq and simulation data, including an analysis of differential immune response programs between COVID-19 patients with different disease severity levels, we demonstrate the advantages of Lamian in decoding cellular gene expression programs in continuous biological processes.
(TCPL 201)
09:40 - 09:55 Mengjie Chen: The curses of performing differential expression analysis using single cell data
Differential expression analysis in single-cell transcriptomics provides essential insights into cell-type-specific responses to internal and external stimuli. While many methods are available to identify differentially expressed genes from single-cell transcriptomics, recent studies raise important concerns about the performance of state-of-the-art methods. As single-cell studies are scaled up to population-level quickly, powerful and accurate methods will be essential for obtaining meaningful results. In this context, we highlight various limitations and conceptual flaws in the current workflows for single-cell differential expression analysis. Furthermore, we present a new paradigm that offers a potential solution to these issues.
(TCPL 201)
09:55 - 10:10 Julia Salzman: SPLASH is a reference-free statistical algorithm, unifying biological discovery in single cell sequencing and beyond
Myriad mechanisms diversify the sequence content of RNA transcripts and are of great interest to single cell biology. Currently, these events are detected using tools that first require alignment to a necessarily incomplete reference genome alignment in the first step; this incompleteness is especially prominent in diseases such as cancer. Second, today the next step in analysis requires as a custom choice of bioinformatic procedure to follow it: for example, to detect splicing, RNA editing or V(D)J recombination among others. I will present collaborative work based on a new statistics-first analytic method —SPLASH (Statistically Primary aLignment Agnostic Sequence Homing)— that performs unified, reference-free inference directly on raw sequencing reads without a reference genome or cell metadata. SPLASH is highly efficient and simple to run. As a snapshot of SPLASH, applying to 10,326 primary human single cells in 19 tissues profiled with SmartSeq2, we discover a set of splicing and histone regulators with highly conserved intronic regions that are themselves targets of complex splicing regulation, unreported transcript diversity in the heat shock protein HSP90AA1, and diversification in centromeric RNA expression, V(D)J recombination, RNA editing, and repeat expansions missed by existing methods, as well as unpublished extensions to 10x genomics data.
(TCPL 201)
10:10 - 10:25 Hyonho Chun: Similarity-assisted variational autoencoder for nonlinear dimension reduction with application to single-cell RNA sequencing data
Deep generative models naturally become nonlinear dimension reduction tools to visualize large-scale datasets such as single-cell RNA sequencing datasets for revealing latent grouping patterns or identifying outliers. The Variational autoencoder (VAE) is a popular deep generative method equipped with encoder/decoder structures. The encoder and decoder are useful when a new sample is mapped to the latent space and a data point is generated from a point in a latent space. However, the VAE tends not to show grouping pattern clearly without additional annotation information. On the other hand, similarity-based dimension reduction methods such as t-SNE or UMAP present clear grouping patterns even though these methods do not have encoder/decoder structures. To bridge this gap, we propose a new approach that adopts similarity information in the VAE framework. In addition, for biological applications, we extend our approach to a conditional VAE (CVAE) to account for covariate effects in the dimension reduction step.Our method is able to produce clearer grouping patterns than those of other regularized VAE methods by utilizing similarity information encoded in the data via the highly celebrated UMAP loss function.
(TCPL 201)
10:25 - 10:45 Coffee Break (TCPL Foyer)
11:15 - 11:30 Agus Salim: RUV-III-NB: A robust method for normalization of single cell RNA-seq data
Normalization of single cell RNA-seq data remains a challenging task. The performance of different methods can vary greatly between datasets when unwanted factors and biology are associated. Most normalization methods also only remove the effects of unwanted variation for the cell embedding but not from gene-level data typically used for differential expression (DE) analysis to identify marker genes. We propose RUV-III-NB, a method that can be used to remove unwanted variation from both the cell embedding and gene-level counts. Using pseudo-replicates, RUV-III-NB explicitly takes into account potential association with biology when removing unwanted variation. The method can be used for both UMI or read counts and returns adjusted counts that can be used for downstream analyses such as clustering, DE and pseudotime analyses. Using published datasets with different technological platforms, kinds of biology and levels of association between biology and unwanted variation, we show that RUV-III-NB manages to remove library size and batch effects, strengthen biological signals, improve DE analyses, and lead to results exhibiting greater concordance with independent datasets of the same kind. The performance of RUV-III-NB is consistent and is not sensitive to the number of factors assumed to contribute to the unwanted variation.
(TCPL 201)
11:30 - 11:45 Matthew Ritchie: Modelling group heteroscedasticity in single-cell RNA-seq pseudo-bulk data
Group heteroscedasticity is commonly observed in pseudo-bulk single-cell RNA-seq datasets and its presence can hamper the detection of differentially expressed genes. Since most bulk RNA-seq methods assume equal group variances, we introduce two new approaches that account for heteroscedastic groups, namely voomByGroup and voomWithQualityWeights using a blocked design (voomQWB). Compared to current gold-standard methods that do not account for group heteroscedasticity, we show results from simulations and various experiments that demonstrate the superior performance of voomByGroup and voomQWB in terms of error control and power when group variances in pseudo-bulk single-cell RNA-seq data are unequal.
(TCPL 201)
11:45 - 12:00 Jessica Mar: One of these cells is not like the other - modelling variability of gene expression in single cell data
Gene expression changes underpin the regulation of almost all cellular phenotypes in nature. While we typically focus on changes in average gene expression, we know that changes in gene expression variability can also impact regulation too. Single cell data has provided an incredible opportunity to study how the variability of gene expression impacts a cell population. But like any data science question, there are challenges in how to model variability in single cell data. This talks highlights studies from my group which have focused on how to model heterogeneity in single cell RNA-seq data and its role in regulating phenotypes like ageing and differentiation.
(TCPL 201)
12:00 - 12:30 David Shih: Discussion (TCPL 201)
12:15 - 14:30 Lunch
Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
14:30 - 15:30 Sunduz Keles: Round table discussion: Grand challenges in single cell data (TCPL 201)
15:30 - 16:10 Coffee Break (TCPL Foyer)
16:10 - 16:25 Zheng Ye: Robust normalization and integration of single-cell protein expression across CITE-seq datasets
CITE-seq technology enables the direct measurement of protein expression, known as antibody-derived tags (ADT), in addition to RNA expression. The increase in the copy number of protein molecules leads to a more robust detection of protein features compared to RNA, providing a deep definition of cell types. However, due to added discrepancies of antibodies, such as the different types or concentrations of IgG antibodies, the batch effects of the ADT component of CITE-seq can dominate over biological variations, especially for the across-study integration. We present ADTnorm as a normalization and integration method designed explicitly for the ADT counts of CITE-seq data. Benchmarking with existing scaling and normalization methods, ADTnorm achieves a fast and accurate matching of the negative and positive peaks of the ADT counts across samples, efficiently removing technical variations across batches. Further quantitative evaluations confirm that ADTnorm achieves the best cell-type separation while maintaining the minimal batch effect. Therefore, ADTnorm facilitates the scalable ADT count integration of massive public CITE-seq datasets with distinguished experimental designs, which are essential for creating a corpus of well-annotated single-cell data with deep and standardized annotations.
(TCPL 201)
16:25 - 16:40 Kelly Street: Improving the Resolution of Single-Cell TCR-seq
T-cell receptors (TCRs) are hypervariable protein complexes that recognize foreign antigens and play an important role in immune response. Modern sequencing technology allows for the full characterization of these complexes at single-cell resolution and has the potential to serve as a broadly applicable diagnostic tool. However, single-cell TCR sequencing data is often ambiguous, making it difficult to differentiate between cells with distinct clonotypes. Many modern analyses have focused solely on cells with complete information, discarding ambiguous cells and thereby losing data. We propose an expectation maximization (E-M) algorithm for clonotype assignment, which leverages data from ambiguous cells to provide superior repertoire characterization.
(TCPL 201)
16:40 - 16:55 David Shih: Integrative analysis of scRNA-seq, scTCR-seq, and TCR-seq to identify and characterize antigen-specific T cells
Investigating the phenotypic profiles of antigen-specific T cells is critical to understanding T cell responses against pathogens as well as improving the efficacy of therapeutics and vaccines. However, current methodologies for identifying antigen-reactive T cells are limited in scope, throughput, or specificity. We have therefore developed an integrative approach to identify antigen-specific T cells in blood samples and characterize their single-cell transcriptomes. Our approach involves first identifying pathogen-specific T cells by modeling the temporal expansion trajectories in longitudinal bulk TCR-seq data, followed by using TCR sequences as barcodes to label the identified antigen-specific T cells in matched scRNA-seq and scTCR-seq data. Applying our approach to a clinical study of an experimental vaccine against human cytomegalovirus, we are able to characterize the single-cell transcriptomes of vaccine-specific T cells and discover transcriptional signatures of transient and durable T cell response to cytomegalovirus. Our approach can thus facilitate the study of T cell responses to vaccines and pathogens. To develop our methodology further, we propose a new longitudinal clustering method using Bayesian nonparametrics.
(TCPL 201)
16:55 - 17:30 Keegan Korthauer: Discussion Session (TCPL 201)
17:30 - 19:00 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building.
(Vistas Dining Room)
Wednesday, July 5
07:00 - 09:00 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
09:10 - 09:25 Hongyu Zhao: Day 3: (TCPL 201)
09:25 - 09:40 Yu Li: scNovel: a neural network framework for novel rare cell detection of single-cell transcriptome data
Since bulk RNA sequencing can only provide a holistic perspective on the differences between samples, researchers are eager to obtain single-cell resolution of cell types within diseased tissues to develop more precise therapies. Single-cell RNA-sequencing has become a powerful tool to study biologically significant characteristics at explicitly high resolution. With the unprecedented boom in cell atlases, auto-annotation tools have become more prevalent due to their speed, accuracy, and user-friendly features. However, these tools have mostly focused on general cell type annotation and have not adequately addressed the challenge of detecting novel rare cell types. In this work, we introduce scNovel, a powerful model that specifically focuses on novel rare cell detection. By testing our model on diverse datasets with different scales, protocols, and degrees of imbalance, we demonstrate that scNovel significantly outperforms previous state-of-the-art novel cell detection models, reaching the most AUROC performance. Furthermore, we validate scNovel's performance on a million-scale dataset and demonstrate its ability to detect novel cell clusters for biological discovery through the analysis of clinical data. We believe that scNovel will be an important tool for high-throughput clinical data in a wide range of applications. To be more specific, scNovel can help to predict cell-type-specific gene expression profiles with biological significance, which helps biologists and medical researchers to perform downstream analysis leading to enlightening biological results and precision medical diagnoses.
(Online)
09:40 - 09:55 Yue Li: Guided-topic modelling of single-cell transcriptomes enables sub-cell-type and disease-subtype deconvolution of bulk transcriptomes
Cell-type composition is an important indicator of health. We present Guided Topic Model for deconvolution (GTM-decon) to automatically infer cell-type-specific gene topic distributions from single-cell RNA-seq data for deconvolving bulk transcriptomes. GTM-decon performs competitively on deconvolving simulated and real bulk data compared with the state-of-the-art methods. Moreover, as demonstrated in deconvolving disease transcriptomes, GTM-decon can infer multiple cell-type-specific gene topic distributions per cell type, which captures sub-cell-type variations. GTM-decon can also use phenotype labels as a guide to infer phenotype-specific gene distributions. In a nested-guided design, GTM-decon identified cell-type-specific differentially expressed genes from bulk breast cancer transcriptomes.
(TCPL 201)
09:55 - 10:10 Haiyan Huang: Towards a more reliable single-cell RNA-seq clustering - new measure to preserve global cell type relationships
Unsupervised cell clustering based on meaningful biological variation in single-cell RNA sequencing (scRNA-seq) data has received significant attention, as it assists with identifying ontological subpopulations among the data. A key step in the clustering process is to compute distances between cells using a specified distance measure. Although certain distance measures may successfully separate cells into biologically relevant clusters, they may fail to retain the global structure of the data, such as the relative similarity between cell clusters. In this talk, I will introduce a new measure that can more consistently retain the global cell type relationships than commonly used distance measures for scRNA-seq clustering. We used this measure to uncover compositional differences between annotated leukocyte cell groups in a compendium of Mus musculus scRNA-seq assays comprising 12 tissues.
(TCPL 201)
10:10 - 10:25 Shila Ghazanfar: Mosaic single cell data integration
Currently available single cell -omics technologies capture many unique features with different biological information content. Data integration aims to place cells, captured with different technologies, onto a common embedding to facilitate downstream analytical tasks. Current horizontal data integration techniques use a set of common features, thereby ignoring non-overlapping features and losing information. Here we introduce StabMap, a mosaic data integration technique that stabilises mapping of single cell data by exploiting the non-overlapping features. StabMap is a flexible approach that first infers a mosaic data topology based on shared features, then projects all cells onto supervised or unsupervised reference coordinates by traversing shortest paths along the topology. We show that StabMap performs well in various simulation contexts, facilitates “multi-hop” mosaic data integration, and enables the use of novel spatial gene expression features for mapping dissociated single cell data onto a spatial transcriptomic reference.
(TCPL 201)
10:30 - 11:00 Coffee Break (TCPL Foyer)
11:00 - 11:15 Emma Zhang: Cell-type-specific co-expression inference from single cell RNA-sequencing data
The advancement of single cell RNA-sequencing (scRNA-seq) technology has enabled the direct inference of co-expressions in specific cell types, facilitating our understanding of cell-type-specific biological functions. For this task, the high sequencing depth variations and measurement errors in scRNA-seq data present two significant challenges, and they have not been adequately addressed by existing methods. We propose a statistical approach, CS-CORE, for estimating and testing cell-type-specific co-expressions, that explicitly models sequencing depth variations and measurement errors in scRNA-seq data. Systematic evaluations show that most existing methods suffer from inflated false positives as well as biased co-expression estimates and clustering analysis, whereas CS-CORE gave accurate estimates in these experiments. When applied to scRNA-seq data from postmortem brain samples from Alzheimer’s disease patients/controls and blood samples from COVID-19 patients/controls, CS-CORE identified cell-type-specific co-expressions and differential co-expressions that were more reproducible and/or more enriched for relevant biological pathways than those inferred from existing methods.
(TCPL 201)
11:15 - 11:30 Jingyi Jessica Li: ClusterDE: a post-clustering differentially expressed (DE) gene identification method robust to false-positive inflation caused by double-dipping
In typical single-cell RNA-seq data analysis, first, a clustering algorithm is applied to cluster cells; then, a statistical method is used to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure uses the same data twice, an issue known as “double dipping”: the same gene expression data are used to define cell clusters and DE genes, leading to false-positive DE genes even when the cell clusters are spurious. To overcome this challenge, we propose ClusterDE, a post-clustering DE method for controlling the false discovery rate (FDR) regardless of clustering quality. The core idea of ClusterDE is to generate in silico negative control data with only one cluster, which can be used in contrast to real data for evaluating the whole clustering+DE procedure. Using comprehensive simulation and real data analysis, we show that ClusterDE not only has solid FDR control but also finds cell-type marker genes that are biologically meaningful. ClusterDE is fast, transparent, and adaptive to a wide range of clustering methods and statistical tests.
(TCPL 201)
11:30 - 11:45 Mark Robinson: Benchmarking computational methods for single cell and spatial transcriptomics data
Computational methods represent the lifeblood of modern molecular biology. However, the field is experiencing a somewhat unprecedented explosion of computational tools, especially for the analysis of single cell and spatial transcriptomics data. I will motivate the situation with a couple examples of benchmarking tools and methods related to our own research, but also discuss the topic of benchmarking more generally by reporting on a meta-analysis of single-cell method benchmarks. I'll propose a new computational system for flexible continuous benchmarking that allows the community to be engaged at various levels.
(TCPL 201)
11:45 - 12:30 Angela Wu: Discussion Session (TCPL 201)
12:30 - 13:30 Lunch
Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
13:30 - 17:30 Free Afternoon (Banff National Park)
17:30 - 20:00 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building.
(Vistas Dining Room)
Thursday, July 6
07:00 - 09:00 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
09:10 - 09:25 Rafael Irizarry: Day 4: Challenges in spatially resolved single cell data (TCPL 201)
09:25 - 09:40 Rafael Irizarry: Statistical challenges in Single-Cell RNA-Seq and spatial transcriptomics
I will start the talk by describing general statistical challenges in high-throughput genomics related to batch effects and systematic errors. Then I will describe some of our recent work related to cell-type classification and clustering with single-cell RNA-Seq (scRNA-Seq) and spatial transcriptomics.
(TCPL 201)
09:40 - 09:55 Xiting Yan: Spatial Deconvolution Method Considering Platform Effect Removal, Sparsity and Spatial Information
Spatial barcoding-based transcriptomic (ST) technologies unbiasedly measure mRNA expression of cells with physical locations in intact tissue. But the measured gene expression data lack single-cell resolution and require cell type deconvolution for cellular-level downstream analysis. We developed SDePER to deconvolve ST data using reference single-cell RNA sequencing (scRNA-seq) data from the same tissue type. A conditional variational autoencoder (CVAE) was used to remove platform effects, i.e., the systematic differences between reference scRNA-seq and ST data, and a graph Laplacian regularized model (GLRM) was developed to consider both spatial information and sparsity. Based on the estimated cell type compositions, a random walk was constructed to impute cell type compositions and gene expression at enhanced resolution. We compared the performance of SDePER and six existing methods using both simulated and real data. Results showed that SDePER was robust to platform effects and achieved the most accurate estimation. Furthermore, applications to four different real ST datasets with histological staining images demonstrated that SDePER achieved results with the highest consistency with the staining images. SDePER also had the most accurate imputed gene expression of known marker genes. In summary, SDePER achieved significantly more accurate and robust results than the existing ST data deconvolution methods.
(TCPL 201)
09:55 - 10:10 Can Yang: SpatialScope: A unified approach for integrating spatial and single-cell transcriptomics data using deep generative models
The rapid emergence of spatial transcriptomics (ST) technologies is revolutionizing our understanding of tissue spatial architecture and their biology. Current ST technologies based on either next generation sequencing (seq-based approaches) or fluorescence in situ hybridization (image-based approaches), while providing hugely informative insights, remain unable to provide spatial characterization at transcriptome-wide single-cell resolution, limiting their usage in resolving detailed tissue structure and detecting cellular communications. To overcome these limitations, we developed SpatialScope, a unified approach to integrating scRNA-seq reference data and ST data that leverages deep generative models. With innovation in model and algorithm designs, SpatialScope not only enhances seq-based ST data to achieve single-cell resolution, but also accurately infers transcriptome-wide expression levels for image-based ST data. We demonstrate the utility of SpatialScope through comprehensive simulation studies and then apply it to real data from both seq-based and image-based ST approaches. SpatialScope provides a spatial characterization of tissue structures at transcriptome-wide single-cell resolution, greatly facilitating the downstream analysis of ST data, such as detection of cellular communication by identifying ligand-receptor interactions from seq-based ST data, localization of cellular subtypes, and detection of spatially differently expressed genes.
(TCPL 201)
10:10 - 10:25 Discussion (Online)
10:30 - 11:00 Coffee Break (TCPL Foyer)
11:00 - 11:15 Jean Yee Hwa Yang: Biologically-informed self-supervised learning for segmentation of subcellular spatial transcriptomics data
TBA
(TCPL 201)
11:15 - 11:30 Mellisa Davis: Rethinking assumptions in spatial molecular data analysis: the role and impact of library size normalisation
Spatial molecular technologies have revolutionised the study of disease microenvironments by providing spatial context to tissue heterogeneity. Recent spatial technologies are increasing the throughput and spatial resolution of measurements, resulting in larger datasets at single cell resolution. The added spatial dimension and volume of measurements poses an analytics challenge that has, in the short-term, been addressed by adopting methods designed for the analysis of single-cell RNA-seq data. Though these methods work well in some cases, they do not necessarily translate appropriately to spatial technologies. A common assumption is that total sequencing depth, also known as library size, represents technical variation in single-cell RNA-seq technologies, and this is often normalised out during analysis. Through analysis of several different spatial datasets, we noted that this assumption does not necessarily hold in spatial molecular data. To formally assess this, we explore the relationship between library size and independently annotated spatial regions, across 23 samples from 4 different spatial technologies with varying throughput and spatial resolution. We found that library size confounded biology across all technologies, regardless of the tissue being investigated. Statistical modelling of binned total transcripts shows that tissue region is strongly associated with library size across all technologies, even after accounting for cell density of the bins. Through a benchmarking experiment, we show that normalising out library size leads to sub-optimal spatial domain identification using common graph-based clustering algorithms. On average, better clustering was achieved when library size effects were not normalised out explicitly, especially with data from the newer sub-cellular localised technologies. Taking these results into consideration, we recommend that spatial data should not be specifically corrected for library size prior to analysis unless strongly motivated. We also emphasise that spatial data are different to single-cell RNA-seq and care should be taken when adopting algorithms designed for single cell data.
(TCPL 201)
11:30 - 11:45 Xiang Zhou: Accurate and scalable spatial domain detection via integrated reference-informed segmentation for spatial transcriptomics
Spatially resolved transcriptomics (SRT) studies are becoming increasingly common and increasingly large, providing unprecedented opportunities for characterizing the spatial and functional organization of complex tissues. Here, we present a computational method, IRIS, that can characterize the spatial organization of complex tissues through accurate and efficient detection of spatial domains. IRIS is unique in its ability in leveraging single-cell RNA-seq data for reference-informed spatial domain detection, integrating multiple SRT tissue slices jointly while explicitly accounting for the correlation both within and across slices, and taking advantage of multiple algorithmic innovations for highly scalable computation. We demonstrate the advantages of IRIS through in-depth analysis of six SRT datasets from different technologies across distinct tissues, species, and spatial resolutions. In these applications, IRIS achieves 51% ~ 97% accuracy gain over existing methods in the dataset with known ground truth. In addition, IRIS is 4.6 ~ 134.7 times faster than existing methods in moderate-sized datasets and is the only method applicable to large-scale SRT datasets including stereo-seq and 10x Xenium. As a result, IRIS captures the fine-scale structures of brain regions, reveals the spatial heterogeneity of tumor microenvironments, and characterizes the structural changes of the seminiferous tubes in the testis underlying diabetes, all at a speed and accuracy unattainable by existing approaches.
(TCPL 201)
11:45 - 12:20 Haiyan Huang: Discussion (TCPL 201)
12:30 - 14:30 Lunch
Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
14:30 - 15:30 Can Yang: Round table discussion gaps and opportunities in Spatial omics (TCPL 201)
15:30 - 16:10 Coffee Break (TCPL Foyer)
16:10 - 16:25 Di Wu: Gene set tests and cell-cell communication in scRNA-seq data
Gene set tests and detection of cell-cell communications (CCC) in single cell RNAseq (scRNAseq) data are two key analysis methods to interpret the data for biological follow-up. Two-Sigma-G is a competitive test to test whether the gene in a prior defined gene set, e.g., from a pathway or other researchers’ experiments, are more differentially expressed comparing to the randomly selected gene sets. It employs the Two-Sigma framework based on zero-inflated negative binomial distribution and allowing random effects since many cells are from one biological sample and there may be multiple samples in a sample group. Simulations have been run for model fitting and the control of type I, and type II error in the tests. Methods are applied in a well-designed HIV related scRNAseq datasets for biological discovery. We also have developed a statistical method to detect (CCC) mediated by ligand-receptor (LR) complexes, associated with the sample groups. We simultaneously model the data distribution that are featured with excess zeros, and performing the statistical test for differential CCC for a pair of LR and a pair of cell types, applied in scRNAseq data of humanized mouse spleen samples with or without the infection of acute human immunodeficiency virus (HIV).
(TCPL 201)
16:25 - 16:40 Gerald Quon: Inference of donor-specific co-expression networks across cohorts
Gene co-expression networks are routinely inferred to identify co-expression modules and pathways active in diverse cell types. Their construction and inference is challenging due to their high dimensional nature and typically few samples available for inference. Here I discuss a multi-task framework for inferring and comparing multiple co-expression networks across individuals in a cohort. I will demonstrate that despite low information content in single cell transcriptome data, we are still able to parse out major differences in network structure that are correlated with phenotypes such as stem cell potential.
(TCPL 201)
16:40 - 16:55 Ellis Patrick: Identifying changes in cell states related to their spatial context in tissue microenvironment
The human body comprises over 37 trillion cells with diverse forms and functions, which can exhibit dynamic changes based on their environmental context. Understanding the spatial interactions between cells and changes in their state within the tissue microenvironment is crucial to comprehending the development of human diseases. State-of-the-art technologies such as PhenoCycler, IMC, CosMx, Xenium, and others can deeply phenotype cells in their native environment, providing a high-throughput means of identifying spatially related changes in cell state. The Statial Bioconductor package offers a suite of complementary approaches for identifying changes in cell state explained by changes in cell type localization. In this presentation, we introduce new functionality in the Statial package that can 1) identify changes in cell state between distinct tissue environments, 2) uncover changes in marker expression associated with cell proximities, and 3) model spatial relationships between cells in the context of hierarchical cell lineage structures. We provide context for these approaches and explain when and why modeling spatial relationships between cells in these ways is appropriate. Finally, we demonstrate how these approaches can be used in a classification setting to predict patient prognosis or treatment response.
(TCPL 201)
16:55 - 17:30 Mellisa Davis: Discussion Session (TCPL 201)
17:30 - 20:00 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building.
(Vistas Dining Room)
Friday, July 7
07:00 - 09:00 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
09:00 - 09:01 Day 5: Scaling up for single cell data science (TCPL 201)
09:01 - 09:15 Joshua Ho: Scalable analysis methods for single cell omics data and lineage tracing
Single cell RNA-seq (scRNA-seq) analysis is gaining widespread adoption in many areas of biomedical research. A large amount of scRNA-seq data are being generated at a rapid rate. Nonetheless, it is challenging to quickly and efficiently process a large collection of scRNA-seq data. Extending our team’s pioneering experience in developing scalable bioinformatics tools, we have developed a suite of cloud-accelerated scRNA-seq analysis tools that efficiently process read-level information at a highly scalable manner. In this talk, we will showcase a number of tools that we have developed to enable large-scale single cell RNA-seq analysis. Furthermore, we will discuss new methods that enable somatic genetic variants in mitochondria to be used as endogenous lineage tracing markers in scRNA-seq data. We will showcase the use of this type of novel lineage tracing technology to study cancer using human clinical tumour samples.
(TCPL 201)
09:15 - 09:30 Zuoheng Wang: Graphical generative model for identification of disease associated perturbations to intercellular communications in single-cell RNA sequencing data
Diverse types of cells interact and communicate with each other to maintain tissue homeostasis and perform biological functions. Perturbations to these interactions can break the homeostasis of the tissue microenvironment, leading to disease. Understanding intercellular communication changes in disease is critical for therapeutic development. Cell-cell communication networks (CCCNs) inferred from single-cell RNA sequencing data are highly variable and only capture a snapshot of the dynamic intercellular communication system. We develop a graphical generative model to compare CCCNs between disease and control samples to identify disease associated perturbations to intercellular communications. The distribution of CCCNs is learned using variational graph autoencoder (VGAE) in disease and control groups separately. Then a large number of graphs is generated to assess the significance of the difference between the two distributions using different graph distance measures. We demonstrate the advantage of this approach in improving the power of identifying disease associated perturbations to intercellular communications through both simulation studies and real scRNA-seq datasets.
(TCPL 201)
09:30 - 09:45 Yingxin Lin: Atlas-scale single-cell multi-sample multi-condition data integration using scMerge2
The recent emergence of multi-sample multi-condition single-cell multi-cohort studies allows researchers to investigate different cell states. The effective integration of multiple large-cohort studies promises biological insights into cells under different conditions that individual studies cannot provide. Here, we present scMerge2, a scalable algorithm that allows data integration of atlas-scale multi-sample multi-condition single-cell studies. We have generalized scMerge2 to enable the merging of millions of cells from single-cell studies generated by various single-cell technologies. Using a large COVID-19 data collection with over five million cells from 1000+ individuals, we demonstrate that scMerge2 enables multi-sample multi-condition scRNA-seq data integration from multiple cohorts and reveals signatures derived from cell-type expression that are more accurate in discriminating disease progression. Further, we demonstrate that scMerge2 can remove dataset variability in CyTOF, imaging mass cytometry and CITE-seq experiments, demonstrating its applicability to a broad spectrum of single-cell profiling technologies.
(TCPL 201)
09:45 - 10:00 Angela Wu: Cross-species single-cell atlases: analysis and challenges
The rapid emergence of large-scale atlas-level single-cell RNA-seq (scRNA-seq) datasets presents remarkable opportunities for broad and deep biological investigations through integrative analyses. However, harmonizing such datasets requires integration approaches to be not only computationally scalable, but also capable of preserving a wide range of fine-grained cell populations. As part of the Tabula Microcebus Consortium whose mission is to create a single cell atlas of the grey mouse lemur, we faced such challenges during the integration of scRNA-seq data generated from multiple animals, tissue types, batches, and technologies. Manual annotation of each dataset, particularly the identification of rare cell-types, proved to be difficult and tedious. To address these challenges, we embarked on a detailed exploration of large-scale scRNA-seq data, uncovered underlying features of their data distributions, and created two tools for data integration: FIRM and Portal. These two algorithms were used to construct the Tabula Microcebus single cell atlas, and are suitable for scRNA-seq datasets with different characteristics. I will present the findings of the Tabula Microcebus, as well as present perspectives on our current work in cross-species analyses.
(TCPL 201)
10:00 - 10:15 Hongyu Zhao: An informatics framework for assembling human cell atlases as a digital life
Profiling molecular features of all cells is essential for understanding the human body in health and diseases. Scientists are enthusiastic in building such atlases of human cells using single-cell omics technologies. More and more single-cell studies have been conducted in the world with the rapid development and popularization of single-cell sequencing technologies, generating tremendous amount of single-cell data in the public domain. This suggests the possibility of building cell atlases by assembling data in scattered publications. However, the information complexity and volume of cell atlas data are magnitudes larger than that of the human genome project. We proposed a unified information framework for assembling atlases from data of various sources and built the first prototype of human Ensemble Cell Atlas (hECA). We argued that the ideal cell atlas should be like a “digital life” or a “virtual human body” composed of virtual cells. We developed an “in data” cell experiment scheme that allows extracting cells from the atlas using logic formula to investigate scientific questions such as drug side effects that may involve multiple organs and cell types.
(TCPL 201)
10:15 - 10:30 Ge Gao: Delineate the regulatory map in silico
Human individual cells, as the basic biological units of our bodies, carry out their functions through rigorous regulation of gene expression and exhibit heterogeneity among each other in every human tissue. In addition to identify individual genes, one is often interested in how multiple genes interact to form regulatory circuits and carry out cellular functions. Combining massive omics data and leading-edge statistical modeling/machine learning approaches, we have developed set of novel bioinformatic technologies to delineate the regulatory map and characterize the functional genome in action globally during past years. Here we will present our recent advances as well as their potential applications in clinical and translational study.
(Online)
10:30 - 11:00 Checkout by 11AM
5-day workshop participants are welcome to use BIRS facilities (TCPL ) until 3 pm on Friday, although participants are still required to checkout of the guest rooms by 11AM.
(Front Desk - Professional Development Centre)
11:00 - 12:30 Flexible Session
TBA
(TCPL 201)
12:30 - 14:00 Lunch (Vistas Dining Room)