Schedule for: 24w5284 - Statistical Aspects of Trustworthy Machine Learning

Beginning on Sunday, February 11 and ending Friday February 16, 2024

All times in Banff, Alberta time, MST (UTC-7).

Sunday, February 11
16:00 - 17:30 Check-in begins at 16:00 on Sunday and is open 24 hours (Front Desk - Professional Development Centre)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building.
(Vistas Dining Room)
19:30 - 21:00 Informal gathering
Meet and Greet at the BIRS Lounge (PDC second floor)
(Other (See Description))
Monday, February 12
07:00 - 08:45 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
08:45 - 09:00 Introduction and Welcome by BIRS Staff
A brief introduction to BIRS with important logistical information, technology instruction, and opportunity for participants to ask questions.
(TCPL 201)
08:55 - 09:00 Theme of the day: Interpretability (TCPL 201)
09:00 - 10:00 Kris Sankaran: Interpretability and Scientific Foundation Models: A Review
Foundation models have begun to appear for scientific data that, at first glance, seem to have little in common with the internet-scale image and language data that drove their original development. For example, ESM-2 (protein sequence), HyenaDNA (DNA sequence), Prithvi (satellite hyperspectral), and scGPT (multi-omics) were all released in 2023. The data and problems might differ, but these scientific foundation models have proven to be as versatile in adaptation to novel tasks as their nonscientific counterparts. To ensure that these models are used in a wise, scientifically rigorous way, it will be worthwhile to draw from the literature on interpretability for deep learning, focusing on how complex models are understood and communicated. We will review ideas from this literature, draw connections with scientific foundation models, and highlight remaining challenges. We will explore how data visualization can clarify the reasons for a model’s successes and failures, informing more targeted data collection and algorithmic progress. Finally, we will discuss how these models are a double-edged sword from the standpoint of scientific reproducibility — creating shared, accessible resources on the one hand and obfuscating provenance on the other.
(TCPL 201)
10:00 - 10:30 Coffee Break (TCPL Foyer)
10:30 - 11:00 Cynthia Rudin: Simpler Machine Learning Models for a Complicated World
While the trend in machine learning has tended towards building more complicated (black box) models, such models have not shown any performance advantages for many real-world datasets, and they are more difficult to troubleshoot and use. For these datasets, simpler models (sometimes small enough to fit on an index card) can be just as accurate. However, the design of interpretable models for practical applications is quite challenging for at least two reasons: 1) Many people do not believe that simple models could possibly be as accurate as complex black box models. Thus, even persuading someone to try interpretable machine learning can be a challenge. 2) Transparent models have transparent flaws. In other words, when a simple and accurate model is found, it may not align with domain expertise and may need to be altered, leading to an "interaction bottleneck" where domain experts must interact with machine learning algorithms. In this talk, I will present a new paradigm for machine learning that gives us insight into the existence of simpler models for a large class of real-world problems and solves the interaction bottleneck. In this paradigm, machine learning algorithms are not focused on finding a single optimal model, but instead capture the full collection of good (i.e., low-loss) models, which we call "the Rashomon set." Finding Rashomon sets is extremely computationally difficult, but the benefits are massive. I will present the first algorithm for finding Rashomon sets for a nontrivial function class (sparse decision trees) called TreeFARMS. TreeFARMS, along with its user interface TimberTrek, mitigate the interaction bottleneck for users. TreeFARMS also allows users to incorporate constraints (such as fairness constraints) easily. I will also present a "path," that is, a mathematical explanation, for the existence of simpler yet accurate models and the circumstances under which they arise. In particular, problems where the outcome is uncertain tend to admit large Rashomon sets and simpler models. Hence, the Rashomon set can shed light on the existence of simpler models for many real-world high-stakes decisions. This conclusion has significant policy implications, as it undermines the main reason for using black box models for decisions that deeply affect people's lives. I will conclude the talk by providing an overview of applications of interpretable machine learning within my lab, including applications to neurology, materials science, mammography, visualization of genetic data, the study of how cannabis affects the immune system of HIV patients, heart monitoring with wearable devices, and music generation. This is joint work with my colleagues Margo Seltzer and Ron Parr, as well as our exceptional students Chudi Zhong, Lesia Semenova, Jiachang Liu, Rui Xin, Zhi Chen, and Harry Chen. It builds upon the work of many past students and collaborators over the last decade. Here are papers I will discuss in the talk: Rui Xin, Chudi Zhong, Zhi Chen, Takuya Takagi, Margo Seltzer, Cynthia Rudin Exploring the Whole Rashomon Set of Sparse Decision Trees, NeurIPS (oral), 2022. https://arxiv.org/abs/2209.08040 Zijie J. Wang, Chudi Zhong, Rui Xin, Takuya Takagi, Zhi Chen, Duen Horng Chau, Cynthia Rudin, Margo Seltzer TimberTrek: Exploring and Curating Sparse Decision Trees with Interactive Visualization, IEEE VIS, 2022. https://poloclub.github.io/timbertrek/ Lesia Semenova, Cynthia Rudin, and Ron Parr On the Existence of Simpler Machine Learning Models. ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT), 2022. https://arxiv.org/abs/1908.01755 Lesia Semenova, Harry Chen, Ronald Parr, Cynthia Rudin A Path to Simpler Models Starts With Noise, NeurIPS, 2023. https://arxiv.org/abs/2310.19726
(Online)
11:00 - 11:30 Hongtu Zhu: Deep non-crossing quantile (NQ) learning
In this paper, we present deep non-crossing quantile (NQ) learning, designed to estimate the conditional distribution of a specific random quantity using deep learning. This approach concurrently tackles the age-old issue of monotonicity absence. The innovative NQ network structure we propose captures the mean value and quantile discrepancies of the target distribution. By utilizing non-negative activations, we ensure the monotonicity of the estimates. This deep NQ learning framework is versatile, catering to a range of challenges, including distributional reinforcement learning (RL) and causal effect estimation. We also develop a comprehensive theory for the deep NQ estimator, offering a in-depth analysis that demonstrates its superiority over existing deep learning techniques. Our experimental results further underscore the efficacy of the NQ network across diverse scenarios.
(TCPL 201)
11:30 - 13:30 Lunch
Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
13:40 - 14:00 Group Photo
Meet in foyer of TCPL to participate in the BIRS group photo. The photograph will be taken outdoors, so dress appropriately for the weather. Please don't be late, or you might not be in the official group photo!
(TCPL Foyer)
14:00 - 14:30 Yuan Ji: A Class of Dependent Random Distributions Based on Atom Skipping
We propose a new class of Bayesian nonparametrics models for grouped data based on an idea called "atom skipping". Atom skipping generates a new model called the Plaid Atoms Model (PAM) for grouped data. PAM belongs to a class of widely known models that induce dependent random distributions and clusters across multiple groups. Specifically, PAM defines a clustering structure where some clusters are shared across groups while others are uniquely possessed by a group. We discuss the proposed processes related to atom skipping and their theoretical properties. Minor extensions of the proposed model for multivariate or count data are presented. Simulation studies and applications using real-world datasets illustrate these new models’ performance and distinct behavior from ex- isting models.
(TCPL 201)
14:30 - 15:00 Coffee Break (TCPL Foyer)
15:00 - 15:30 Hubert Baniecki: Interpretable machine learning for time-to-event prediction in medicine and healthcare
Time-to-event prediction, e.g. cancer survival analysis or hospital length of stay, is a highly prominent statistical learning task in medical and healthcare applications. However, only a few interpretable machine learning methods comply with its challenges. In this presentation, I will review our recent research concerning time-dependent explanations of machine learning survival models: methods (doi:10.1016/j.knosys.2022.110234), statistical software (doi:10.48550/arXiv.2308.16113), and applications (doi:10.1007/978-3-031-34344-5_9). We show that post-hoc explanations allow for finding biases in machine learning systems predicting hospital length of stay using a novel multi-modal dataset created from 1235 X-ray images with textual radiology reports annotated by human experts.
(TCPL 201)
15:30 - 16:00 Debashis Mondal: Estimating the fraction of anomaly points
In the past decade, there has been much progress and growth of anomaly detection algorithms in machine learning literature with applications ranging from insider threat detection, bio-surveillance, computer security, data cleaning and scientific discovery. Trustworthiness of these machine learning algorithms are of great interest. In this talk I will discuss how statistical ideas, particularly, semi-parametric mixture models can be applied to better understand the outcomes of these algorithms. The work arose in collaboration with former PhD student Si Liu and EECS professor Tom Dietterich.
(TCPL 201)
16:00 - 17:00 Jun Yan: Group discussion (TCPL 201)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building.
(Vistas Dining Room)
19:30 - 21:00 Informal gathering (Other (See Description))
Tuesday, February 13
07:00 - 08:30 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
08:25 - 08:30 Theme of the day: Generative AI and Fairness (TCPL 201)
08:30 - 09:30 Haoda Fu: Generative AI on Smooth Manifolds: A Tutorial
Generative AI is a rapidly evolving technology that has garnered significant interest lately. In this presentation, we’ll discuss the latest approaches, organizing them within a cohesive framework using stochastic differential equations to understand complex, high-dimensional data distributions. We’ll highlight the necessity of studying generative models beyond Euclidean spaces, considering smooth manifolds essential in areas like robotics and medical imagery, and for leveraging symmetries in the de novo design of molecular structures. Our team’s recent advancements in this blossoming field, ripe with opportunities for academic and industrial collaborations, will also be showcased.
(TCPL 201)
09:30 - 10:00 Bin YU: What is uncertainty in today's practice of data science?
Uncertainty quantification is central to statistics, and a corner-stone for building trust in data conclusions for any real-world data problem. The current practice of statistics formally addresses uncertainty arising from sample-to-sample-variability under a generative stochastic model, which is unfortunately often not model-checked enough in today’s practice of statistics. In a data science life cycle (DSLC) that each data analysis goes through in practice, there are many other important sources of uncertainty. In this talk, we discuss uncertainty sources in a DLSC from human judgment calls through the lens of the Predictability-Computability-Stability (PCS) framework and documentation for veridical (truthfully) data science. In particular, we will formally address two additional sources from data cleaning/preprocessing and model/algorithm choices so that more trustworthy or reproducible data-driven discoveries can be achieved.
(Online)
10:00 - 10:30 Coffee Break (TCPL Foyer)
10:30 - 11:30 Lightning session: Lloyd Elliott; Bei Jiang; Wenlong Mou; Deshan Perera; Qingrun Zhang;
# Lloyd Elliot: Teaching Machine Learning using Data for Good Interest in degrees and courses on data science, machine learning and statistics has increased greatly over the past fifteen years. In addition to teaching technical and theoretical aspects, University events and programs support this interest through hackathons and case studies in which real datasets are examined (sometimes in collaboration with corporate or NGO or charity co-organizers). This raises two questions about fairness: 1) In the case of corporate co-organizers, or real datasets derived from commercial data, with respect to insights and deliverables developed by students, where is the line between pedagogy and uncompensated student labour? 2) How can we make a positive impact in the world through setting case studies as teachers? For five years, I have taught "Learning from Data Science." I will discuss my own thoughts on these questions, and provide my own insights based on my own setting of case studies. # Bei Jiang: Online Local Differential Private Quantile Inference via Self-normalization Based on binary inquiries, we developed an algorithm to estimate population quantiles under Local Differential Privacy (LDP). By self-normalizing, our algorithm provides asymptotically normal estimation with valid inference, resulting in tight confidence intervals without the need for nuisance parameters to be estimated. Our proposed method can be conducted fully online, leading to high computational efficiency and minimal storage requirements with O(1) space. We also proved an optimality result by an elegant application of one central limit theorem of Gaussian Differential Privacy (GDP) when targeting the frequently encountered median estimation problem. With mathematical proof and extensive numerical testing, we demonstrate the validity of our algorithm both theoretically and experimentally. # Wenlong Mou: A decorrelation method for general regression adjustment in randomized experiments Randomized experiments are the gold standard for estimating the effect of an intervention, while the efficiency of estimation can be further improved using regression adjustments. Standard regression adjustment involves bias due to sample re-use, and this bias leads to behavior that is sub-optimal in the sample size, and/or imposes restrictive assumptions. In this talk, I present a simple yet effective decorrelation method that circumvents these issues. Among other results, I will highlight sharp non-asymptotic guarantees satisfied by the estimator, under very mild assumptions. # Deshan Perera: CATE: An accelerated and scalable solution for large-scale genomic data processing through GPU and CPU-based parallelization The power of the statistical tests that quantify the evolution of a genome are strengthened by larger sample sizes. However, the increased sample sizes create a significant demand on computational resources resulting in longer compute times. Parallelization, especially using the Graphical Processing Unit (GPU) can alleviate this burden. NVIDIA’s CUDA GPUs are becoming commonplace in solving genetic algorithms with the aim of reducing computational time. So far, such potential of high scale parallelization has not been realized in molecular evolution analyses. CATE (CUDA Accelerated Testing of Evolution) is such a software solution. It is a scalable program built using NVIDIA’s CUDA platform together with an exclusive file hierarchy to process six frequently used evolutionary tests, namely: Tajima’s D, Fu and Li's D, D*, F and F*, Fay and Wu’s H and E, McDonald–Kreitman test, Fixation Index, and Extended Haplotype Homozygosity. CATE is composed of two main innovations. A file organization system coupled with a novel multithreaded search algorithm called Compound Interpolation Search and the large-scale parallelization of the algorithms using the GPU, CPU and SSD. Powered by these implementations CATE is magnitudes faster than standard tools. For instance, CATE processes all 54,849 human genes for all 22 autosomal chromosomes across the five super populations present in the 1000 Genomes Project in less than thirty minutes while counterpart software took 3.62 days. This proven framework has the potential to be adapted for GPU-accelerated large-scale parallel computations of many evolutionary and genomic analyses. GitHub repository: https://github.com/theLongLab/CATE GitHub Wiki: https://github.com/theLongLab/CATE/wiki Published in Methods in Ecology and Evolution: https://doi.org/10.1111/2041-210X.14168 # Qingrun Zhang: eXplainable representation learning via Autoencoders revealing Critical genes Machine Learning models have been frequently used in transcriptome analyses. Particularly, Representation Learning (RL), e.g., autoencoders, are effective in learning critical representations in noisy data. However, learned representations, e.g., the “latent variables” in an autoencoder, are difficult to interpret, not to mention prioritizing essential genes for functional follow-up. In contrast, in traditional analyses, one may identify important genes such as Differentially Expressed (DiffEx), Differentially Co-Expressed (DiffCoEx), and Hub genes. Intuitively, the complex gene-gene interactions may be beyond the capture of marginal effects (DiffEx) or correlations (DiffCoEx and Hub), indicating the need of powerful RL models. However, the lack of interpretability and individual target genes is an obstacle for RL’s broad use in practice. To facilitate interpretable analysis and gene-identification using RL, we propose “Critical genes”, defined as genes that contribute highly to learned representations (e.g., latent variables in an autoencoder). As a proof-of-concept, supported by eXplainable Artificial Intelligence (XAI), we implemented eXplainable Autoencoder for Critical genes (XA4C) that quantifies each gene’s contribution to latent variables, based on which Critical genes are prioritized. Applying XA4C to gene expression data in six cancers showed that Critical genes capture essential pathways underlying cancers. Remarkably, Critical genes has little overlap with Hub or DiffEx genes, however, has a higher enrichment in a comprehensive disease gene database (DisGeNET) and a cancer-specific database (COSMIC), evidencing its potential to disclose massive unknown biology. As an example, we discovered five Critical genes sitting in the center of Lysine degradation (hsa00310) pathway, displaying distinct interaction patterns in tumor and normal tissues. In conclusion, XA4C facilitates explainable analysis using RL and Critical genes discovered by explainable RL empowers the study of complex interactions.
(TCPL 201)
10:30 - 10:49 Lloyd Elliott: Teaching Machine Learning using Data for Good
Interest in degrees and courses on data science, machine learning and statistics has increased greatly over the past fifteen years. In addition to teaching technical and theoretical aspects, University events and programs support this interest through hackathons and case studies in which real datasets are examined (sometimes in collaboration with corporate or NGO or charity co-organizers). This raises two questions about fairness: 1) In the case of corporate co-organizers, or real datasets derived from commercial data, with respect to insights and deliverables developed by students, where is the line between pedagogy and uncompensated student labour? 2) How can we make a positive impact in the world through setting case studies as teachers? For five years, I have taught "Learning from Data Science." I will discuss my own thoughts on these questions, and provide my own insights based on my own setting of case studies.
(TCPL 201)
10:49 - 11:01 Bei Jiang: Online Local Differential Private Quantile Inference via Self-normalization
Based on binary inquiries, we developed an algorithm to estimate population quantiles under Local Differential Privacy (LDP). By self-normalizing, our algorithm provides asymptotically normal estimation with valid inference, resulting in tight confidence intervals without the need for nuisance parameters to be estimated. Our proposed method can be conducted fully online, leading to high computational efficiency and minimal storage requirements with O(1) space. We also proved an optimality result by an elegant application of one central limit theorem of Gaussian Differential Privacy (GDP) when targeting the frequently encountered median estimation problem. With mathematical proof and extensive numerical testing, we demonstrate the validity of our algorithm both theoretically and experimentally.
(TCPL 201)
11:01 - 11:13 Wenlong Mou: A decorrelation method for general regression adjustment in randomized experiments
Randomized experiments are the gold standard for estimating the effect of an intervention, while the efficiency of estimation can be further improved using regression adjustments. Standard regression adjustment involves bias due to sample re-use, and this bias leads to behavior that is sub-optimal in the sample size, and/or imposes restrictive assumptions. In this talk, I present a simple yet effective decorrelation method that circumvents these issues. Among other results, I will highlight sharp non-asymptotic guarantees satisfied by the estimator, under very mild assumptions.
(TCPL 201)
11:13 - 11:20 Deshan Perera: CATE: An accelerated and scalable solution for large-scale genomic data processing through GPU and CPU-based parallelization
The power of the statistical tests that quantify the evolution of a genome are strengthened by larger sample sizes. However, the increased sample sizes create a significant demand on computational resources resulting in longer compute times. Parallelization, especially using the Graphical Processing Unit (GPU) can alleviate this burden. NVIDIA’s CUDA GPUs are becoming commonplace in solving genetic algorithms with the aim of reducing computational time. So far, such potential of high scale parallelization has not been realized in molecular evolution analyses. CATE (CUDA Accelerated Testing of Evolution) is such a software solution. It is a scalable program built using NVIDIA’s CUDA platform together with an exclusive file hierarchy to process six frequently used evolutionary tests, namely: Tajima’s D, Fu and Li's D, D*, F and F*, Fay and Wu’s H and E, McDonald–Kreitman test, Fixation Index, and Extended Haplotype Homozygosity. CATE is composed of two main innovations. A file organization system coupled with a novel multithreaded search algorithm called Compound Interpolation Search and the large-scale parallelization of the algorithms using the GPU, CPU and SSD. Powered by these implementations CATE is magnitudes faster than standard tools. For instance, CATE processes all 54,849 human genes for all 22 autosomal chromosomes across the five super populations present in the 1000 Genomes Project in less than thirty minutes while counterpart software took 3.62 days. This proven framework has the potential to be adapted for GPU-accelerated large-scale parallel computations of many evolutionary and genomic analyses. GitHub repository: https://github.com/theLongLab/CATE GitHub Wiki: https://github.com/theLongLab/CATE/wiki Published in Methods in Ecology and Evolution: https://doi.org/10.1111/2041-210X.14168
(TCPL 201)
11:20 - 11:33 Qingrun Zhang: eXplainable representation learning via Autoencoders revealing Critical genes
Machine Learning models have been frequently used in transcriptome analyses. Particularly, Representation Learning (RL), e.g., autoencoders, are effective in learning critical representations in noisy data. However, learned representations, e.g., the “latent variables” in an autoencoder, are difficult to interpret, not to mention prioritizing essential genes for functional follow-up. In contrast, in traditional analyses, one may identify important genes such as Differentially Expressed (DiffEx), Differentially Co-Expressed (DiffCoEx), and Hub genes. Intuitively, the complex gene-gene interactions may be beyond the capture of marginal effects (DiffEx) or correlations (DiffCoEx and Hub), indicating the need of powerful RL models. However, the lack of interpretability and individual target genes is an obstacle for RL’s broad use in practice. To facilitate interpretable analysis and gene-identification using RL, we propose “Critical genes”, defined as genes that contribute highly to learned representations (e.g., latent variables in an autoencoder). As a proof-of-concept, supported by eXplainable Artificial Intelligence (XAI), we implemented eXplainable Autoencoder for Critical genes (XA4C) that quantifies each gene’s contribution to latent variables, based on which Critical genes are prioritized. Applying XA4C to gene expression data in six cancers showed that Critical genes capture essential pathways underlying cancers. Remarkably, Critical genes has little overlap with Hub or DiffEx genes, however, has a higher enrichment in a comprehensive disease gene database (DisGeNET) and a cancer-specific database (COSMIC), evidencing its potential to disclose massive unknown biology. As an example, we discovered five Critical genes sitting in the center of Lysine degradation (hsa00310) pathway, displaying distinct interaction patterns in tumor and normal tissues. In conclusion, XA4C facilitates explainable analysis using RL and Critical genes discovered by explainable RL empowers the study of complex interactions.
(TCPL 201)
11:45 - 13:30 Lunch
Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
13:30 - 14:30 Sanmi Koyejo: Algorithmic Fairness; Why it’s hard and why it’s interesting (Tutorial)
In only a few years, algorithmic fairness has grown from a niche topic to a major component of machine learning and artificial intelligence research and practice. As a field, we have had some embarrassing mistakes, yet our understanding of the core issues, potential impacts, and mitigation approaches has grown. This tutorial presents a range of recent findings, discussions, questions, and partial answers in the space of algorithmic fairness in recent years. While this tutorial will not attempt a comprehensive overview of this rich area, we aim to provide the participants with some tools and insights and to explore the connections between algorithmic fairness and a broad range of ongoing research efforts in the field. We will tackle some of the hard questions that you may have about algorithmic fairness, and hopefully address some misconceptions that have become pervasive.
(TCPL 201)
14:30 - 15:00 Joshua Snoke: De-Biasing the Bias: Methods for Improving Disparity Assessments with Noisy Group Measurements
Health care decisions are increasingly informed by clinical decision support algorithms, but concern exits that these algorithms, trained using machine learning, may perpetuate or increase racial and ethnic disparities in the administration of health care resources. Clinical data often has the systemic feature that it does not contain any racial/ethnic information or contains erroneous and poor measures of race and ethnicity. This can lead to potentially misleading or insufficient assessments of algorithmic bias in clinical settings. We present novel methods to assess and mitigate potential bias in algorithmic machine learning models used to inform clinical decisions when race and ethnicity information is missing or poorly measured. We provide theoretical bounds on the statistical bias for a set of commonly used fairness metrics, and we show how these bounds can be estimated in practice and that they hold under a set of simple assumptions. Further, we provide a method for sensitivity analysis to estimate the range of potential disparities when the assumptions do not hold. We show that these methods for accurately estimating disparities can be extended to post-algorithm adjustments to enforce common definitions of fairness. We provide a case study using inferred race and ethnicity from the Bayesian Surname Information Geocoding (BISG) algorithm to estimate disparities in a clinical algorithm used to inform osteoporosis treatment decisions. With these novel methods, a policy maker can understand the range of potential disparities resulting from the use of a given algorithm, even when race and ethnicity information is missing, and make informed decisions regarding the safe implementation of machine learning for supporting clinical decisions.
(TCPL 201)
15:00 - 15:30 Coffee Break (TCPL Foyer)
15:30 - 16:00 Giles Hooker: A Generic Approach to Stabilized Model Distillation
Model distillation has been a popular method for producing interpretable machine learning. It uses an interpretable ``student” model to mimic the predictions made by the black box ``teacher” model. However, when the student model is sensitive to the variability of the data sets used for training even when keeping the teacher fixed, the corresponded interpretation is not reliable. Existing strategies stabilize model distillation by checking whether a large enough corpus of pseudo-data is generated to reliably reproduce student models, but methods to do so have so far been developed for a specific student model. In this paper, we develop a generic approach for stable model distillation based on central limit theorem for the average loss. We start with a collection of candidate student models and search for candidates that reasonably agree with the teacher. Then we construct a multiple testing framework to select a corpus size such that the consistent student model would be selected under different pseudo samples. We demonstrate the application of our proposed approach on three commonly used intelligible models: decision trees, falling rule lists and symbolic regression. Finally, we conduct simulation experiments on Mammographic Mass and Breast Cancer datasets and illustrate the testing procedure throughout a theoretical analysis with Markov process.
(TCPL 201)
16:00 - 16:30 Danica Sutherland: Conditional independence measures for fairer, more reliable models
Several notions of algorithmic fairness and techniques for out-of-distribution generalize amount to enforcing the independence of model outputs $\Phi(X)$ from a protected attribute, domain identifier, or similar $Z$, conditional on the true label $Y$. Much work in this area assumes discrete $Y$ and $Z$, and struggle to handle complex predictions (e.g. object localization from images) and/or complex conditioning (e.g. handling fairness with respect to the combination of many attributes). We present a kernel-based technique for measuring conditional dependence for continuous $Y$ and $Z$ that is well-suited to learning complicated $\Phi(X)$ through stochastic gradient methods, called the Conditional Independence Regression CovariancE (CIRCE), both in settings where $Y$ and $Z$ are continuous but relatively “simple,” as well as when we must learn a structure on those variables as well. We will also discuss the use of this and related measures for statistical testing.
(TCPL 201)
16:30 - 17:30 Hao Zhang: Group discussions (TCPL 201)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building.
(Vistas Dining Room)
19:30 - 21:00 Informal gathering (Other (See Description))
Wednesday, February 14
07:00 - 08:30 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
08:25 - 08:30 Theme of the day: Privacy (TCPL 201)
08:30 - 09:00 Xiaoli Meng: Protecting Individua Privacy against All Adversaries – Is It possible?
Differential privacy (DP), rooted in cryptography, epitomizes a significant advancement in balancing data privacy with data utility. Yet, as DP garners attention, it unveils complex challenges and misconceptions that confound even seasoned experts. Through a statistical lens, we examine these nuances. Central to our discussion is DP's commitment to curbing the relative risk of individual data disclosure, unperturbed by an adversary's prior knowledge, via the premise that posterior-to-prior ratios are constrained by extreme likelihood ratios. A stumbling block surfaces when 'individual privacy' is delineated by counterfactually manipulating static individual data values, without considering their interdependencies. Alarmingly, this static viewpoint, flagged for its shortcomings for over a decade (Kifer and Machanavajjhala, 2011, ACM; Tschantz, Sen, and Datta, 2022, IEEE), continues to overshadow DP narratives, leading to the erroneous but widespread belief that DP is impervious to adversaries' prior knowledge. Turning to Warner's (1965, JASA) randomized response mechanism—the pioneering recorded instance of a DP mechanism—we show how DP's mathematical assurances can crumble to an arbitrary degree when adversaries grasp the interplay among individuals. Drawing a parallel, it's akin to the folly of solely quarantining symptomatic individuals to thwart an airborne disease's spread. Thus, embracing a statistical perspective on data, seeing them as accidental manifestations of underlying essential information constructs, is as vital for bolstering data privacy as it is for rigorous data analysis. (This presentation is based on joint work with James Bailie and Robin Gong.
(Online)
09:00 - 09:30 Xiaoxiao Li: Forgettable Federated Linear Learning with Certified Data Removal
Federated learning (FL) is a trending distributed learning framework that enables collaborative model training without data sharing. In this study, we focus on the FL paradigm that grants clients the ``right to be forgotten''. The forgettable FL framework should bleach its global model weights as it has never seen that client and hence does not reveal any information about the client. To this end, we propose the Forgettable Federated Linear Learning (F2L2) framework featuring novel training and data removal strategies. The training pipeline employs linear approximation on the model parameter space to enable our framework work for deep neural networks while achieving comparable results with canonical neural network training. We also introduce an efficient and effective certified removal strategy by approximating the Hessian matrix. Unlike the previous uncertified and heuristic machine unlearning methods in FL, we provide theoretical guarantees by bounding the differences of model weights by our method and that from retraining from scratch.
(TCPL 201)
09:30 - 10:00 Mathias Lecuyer: PANORAMIA: Efficient Privacy Auditing of Machine Learning Models without Retraining
Privacy auditing methods for machine learning (ML) models rely on empirical attacks to estimate a lower bound of the privacy leakage of a model or algorithm. These auditing methods usually leverage one or more of the following primitives: model retraining, changes in the training data, or modification to the training procedure. These primitives can be costly, require access to the training data or control of the training process, and may only be able to audit training algorithms instead of trained models. These shortcomings limit the applicability of privacy audits. We introduce a practical auditing scheme, called PANORAMIA, which relies on a multi-datapoints membership hypothesis test using synthetic “non-member” data. PANORAMIA yields privacy estimates for large-scale ML models without re-training, and only requires access to a subset of the training data. To demonstrate the generality of our approach, we evaluate our auditing scheme across multiple ML domains, ranging from image classification to large-scale language models and tabular data classification.
(TCPL 201)
10:00 - 10:30 Coffee Break (TCPL Foyer)
10:30 - 11:00 Wei Pan: Some applications of large-scale trait imputation with genotyped individuals and GWAS summary data
We have recently proposed a nonparametric method called LS-imputation for large-scale trait imputation based on a GWAS summary dataset and a large set of genotyped individuals. The imputed trait values, along with the genotypes, can be treated as an individual-level dataset for downstream genetic analyses, including those that cannot be done with GWAS summary data. Due to the problem of both “a large n and a large p”, a “divide and conquer” approach has been proposed, which however causes some technical challenges in downstream analyses. We will show several challenges in some real data applications.
(TCPL 201)
11:00 - 11:30 Kasper Hansen: Large-scale genotype prediction from RNA-seq reveals new issues in policy and ethic
To enhance the utility of publicly available RNA-seq data in humans, we have developed the recount3 resource, by uniformly processing more than 330,000 samples from GTEx, TCGA and the short read archive. However, most of these samples do not have matching genotype information, which are necessary for analyses of eQTLs and allele specific expression. To address this issue, we have developed a model to predict genotypes for bi-allelelic SNPs in coding regions. Our model achieves and an overall accuracy of 99.5%, and we have deployed this model to predict genotypes for all the human data in recount3. By doing so, we are converted publicly available data into genotype data which is typically considered restricted access. Furthermore, the RNA-seq data comes from thousands of different studies, each with its own consent. We discuss ethical and policy implications of our work.
(TCPL 201)
11:30 - 12:00 Xiaotong Shen: Group discussion (TCPL 201)
12:00 - 13:30 Lunch
Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
13:30 - 17:30 Free Afternoon (Banff National Park)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building.
(Vistas Dining Room)
19:30 - 21:00 Informal gathering (Other (See Description))
Thursday, February 15
07:00 - 08:30 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
08:25 - 08:30 Theme of the day: Robustness (TCPL 201)
08:30 - 09:30 Pin-Yu Chen: An Eye for AI: Towards Scientific Approaches for Evaluating and Improving Robustness and Safety of Foundation Models
Foundation models, which use deep learning pre-trained on large-scale unlabeled data and then fine-tuned with task-specific supervision, have become a prominent technique in AI technology. While foundation models have great potential to learn general representations and exhibit efficient generalization across domains and data modalities, they can pose unprecedented challenges and significant risks to robustness and safety. This talk outlines recent challenges and advances in the robustness and safety of foundation models. It also introduces the “AI model inspector” framework for comprehensive risk assessment and mitigation, and provides use cases in generative AI and large language models.
(Online)
09:30 - 10:00 Yufeng Liu: Statistical Significance of Clustering for High Dimensional Data
Clustering serves as a fundamental tool for exploratory data analysis, but a key challenge lies in determining the reliability of the clusters identified by these methods, differentiating them from artifacts resulting from natural sampling variations. In this talk, I will present statistical significance of clustering (SigClust) as a cluster evaluation tool for high dimensional data. To begin, we define a cluster as data originating from a single Gaussian distribution and frame the assessment of statistical significance of clustering as a formal testing procedure. Addressing the challenge of high-dimensional covariance estimation in SigClust, we employ a combination of invariance principles and a factor analysis model. I’ll also discuss an enhanced SigClust using multidimensional scaling (MDS) on dissimilarity matrices. SigClust for hierarchical clustering will be presented as well. Simulations and real data, including cancer subtype analysis, validate SigClust’s effectiveness in assessing clustering significance.
(TCPL 201)
10:00 - 10:30 Coffee Break (TCPL Foyer)
10:30 - 11:00 Linbo Wang: Sparse Causal Learning: Challenges and Opportunities
There has been a recent surge in attention towards trustworthy AI, especially as it starts playing a pivotal role in high-stakes domains such as healthcare, the justice system, and finance. Causal inference emerges as a promising path toward building AI systems that are stable, fair, and explainable. However, it often hinges on precise and strong assumptions. In this talk, I introduce sparse causal learning as a common ground between trustworthy AI and robust causal inference. Specifically, I reconsider the supervised learning problem of predicting an outcome using multiple predictors through the lens of causality. I show that it is possible to remove spurious correlations caused by unmeasured confounding by leveraging low-dimensional structures in the predictors. This new approach leads to algorithms that are theoretically justifiable, computationally feasible, and statistically sound.
(TCPL 201)
11:00 - 11:30 Ying Li: Benchmarking Machine Learning Models for Polymer Informatics: An Example of Glass Transition Temperature
In the field of polymer informatics, utilizing machine learning (ML) techniques to evaluate the glass transition temperature Tg and other properties of polymers has attracted extensive attention. This data-centric approach is much more efficient and practical than the laborious experimental measurements when encountered a daunting number of polymer structures. Various ML models are demonstrated to perform well for Tg prediction. Nevertheless, they are trained on different data sets, using different structure representations, and based on different feature engineering methods. Thus, the critical question arises on selecting a proper ML model to better handle the Tg prediction with generalization ability. To provide a fair comparison of different ML techniques and examine the key factors that affect the model performance, we carry out a systematic benchmark study by compiling 79 different ML models and training them on a large and diverse data set. The three major components in setting up an ML model are structure representations, feature representations, and ML algorithms. In terms of polymer structure representation, we consider the polymer monomer, repeat unit, and oligomer with longer chain structure. Based on that feature, representation is calculated, including Morgan fingerprinting with or without substructure frequency, RDKit descriptors, molecular embedding, molecular graph, etc. Afterward, the obtained feature input is trained using different ML algorithms, such as deep neural networks, convolutional neural networks, random forest, support vector machine, LASSO regression, and Gaussian process regression. We evaluate the performance of these ML models using a holdout test set and an extra unlabeled data set from high-throughput molecular dynamics simulation. The ML model’s generalization ability on an unlabeled data set is especially focused, and the model’s sensitivity to topology and the molecular weight of polymers is also taken into consideration. This benchmark study provides not only a guideline for the Tg prediction task but also a useful reference for other polymer informatics tasks.
(TCPL 201)
11:30 - 13:00 Lunch
Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
13:00 - 13:30 Yuanjia Wang: Towards Generative Models for Analyzing Multi-Dimensional Digital Phenotypes
Digital technologies (e.g., mobile phones) can be used to obtain objective, frequent, and real-world digital phenotypes from individuals. However, modeling these data poses substantial challenges since observational data are subject to confounding and various sources of variabilities. For example, signals on patients' underlying health status and treatment effects are mixed with variation due to the living environment and measurement noises. The digital phenotype data thus shows extensive variabilities between- and within patients as well as across different health domains (e.g., motor, cognitive, and speaking). Motivated by a mobile health study of Parkinson's disease (PD), we develop a mixed-response state-space (MRSS) model to jointly capture multi-dimensional, multi-modal digital phenotypes and their measurement processes by a finite number of latent state time series. These latent states reflect the dynamic health status and personalized time-varying treatment effects and can be used to adjust for informative measurements. We conduct comprehensive simulation studies and demonstrate the advantage of MRSS in modeling a mobile health study that remotely collects real-time digital phenotypes from PD patients. We discuss extensions to deep latent state-space models for generating digital phenotype time-series data to learn optimal treatment strategies.
(TCPL 201)
13:30 - 14:00 Tengyuan Liang: Randomization Inference When N = 1
N-of-1 experiments, where a unit serves as its own control and treatment in different time windows, have been used in certain medical contexts for decades. However, due to effects that accumulate over long time windows and interventions that have complex evolution, a lack of robust inference tools has limited the widespread applicability of such N-of-1 designs. This work combines techniques from experiment design in causal inference and system identification from control theory to provide such an inference framework. We derive a model of the dynamic interference effect that arises in linear time-invariant dynamical systems. We show that a family of causal estimands analogous to those studied in potential outcomes are estimable via a standard estimator derived from the method of moments. We derive formulae for higher moments of this estimator and describe conditions under which N-of-1 designs may provide faster ways to estimate the effects of interventions in dynamical systems. We also provide conditions under which our estimator is asymptotically normal and derive valid confidence intervals for this setting.
(TCPL 201)
14:00 - 14:30 Donglin Zeng: Integrating Tools from Statistical Modelling and Machine Learning to Learn Optimal Treatment Regimes from Electronic Health Records
This talk presents a general framework to integrate analytic tools from both statistical modelling and machine learning to learn optimal treatment rules for type 2 diabetes (T2D) patients from electronic health records (EHRs). We first adopt joint statistical models to characterize patient’s pretreatment conditions using longitudinal markers from EHRs. The statistical estimation accounts for informative measurement times using inverse intensity weighting methods. The predicted latent processes in the joint models are used to divide patients into a finite of subgroups and within each group, patients share similar health profiles in EHRs. Next, we learn optimal individualized treatment rules by extending a matched machine learning algorithm within each subgroup. We apply this integrative analysis to estimate optimal treatment rules for T2D patients in an EHRs dataset from the Ohio State University Wexner Medical Center. We demonstrate the utility of our method to select the optimal treatments from four classes of drugs and achieve a better control of glycated hemoglobin than any one-size-fits-all rules.
(TCPL 201)
14:30 - 15:00 Coffee Break (TCPL Foyer)
15:15 - 15:45 Anna Neufeld: Data thinning and its applications
We propose data thinning, a new approach for splitting an observation from a known distributional family with unknown parameter(s) into two or more independent parts that sum to yield the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This proposal is very general, and can be applied to a broad class of distributions within the natural exponential family, including the Gaussian, Poisson, negative binomial, Gamma, and binomial distributions, among others. Furthermore, we generalize data thinning to enable splitting an observation into two or more parts that can be combined to yield the original observation using an operation other than addition; this enables the application of data thinning far beyond the natural exponential family. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the “usual” approach of cross-validation via sample splitting, especially in unsupervised settings in which the latter is not applicable. We will present an application of data thinning to single-cell RNA-sequencing data, in a setting where sample splitting is not applicable. This is joint work with Daniela Witten (University of Washington), Ameer Dharamshi (University of Washington), Lucy Gao (University of British Columbia), and Jacob Bien (University of Southern California)
(TCPL 201)
15:45 - 16:15 Sanmi Koyejo: Learning from Uncertain Pairwise Preferences
This talk will outline will be in two parts. The first part proposes a cooperative inverse decision theory (CIDT) framework to formalize the metric elicitation problem, i.e., choosing metrics that best align with human preferences. Optimal policies in this framework produce active learning that leads to an exponential improvement in sample complexity over previous work. This framework can be used to efficiently deal with decision data that is sub-optimal due to noise, conflicting experts, or systematic error. The second part outlines a mechanism design addressing a general repeated-auction setting where the utility derived from a sold good is revealed post-sale. The mechanism's novelty lies in using pairwise comparisons for eliciting information from the bidder. We prove this mechanism asymptotically truthful, individually rational, and welfare and revenue maximizing. Experimental results on multi-label toxicity annotation data, an example of negative utilities, highlight how our proposed mechanism could enhance social welfare in data auctions. Together, these works highlight advances in learning from uncertain human preferences.
(TCPL 201)
16:15 - 16:45 Keegan Korthauer: Group discussion (TCPL 201)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building.
(Vistas Dining Room)
19:30 - 21:00 Informal gathering (Other (See Description))
Friday, February 16
07:00 - 08:45 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
10:00 - 10:30 Coffee Break (TCPL Foyer)
10:30 - 11:00 Checkout by 11AM
5-day workshop participants are welcome to use BIRS facilities (TCPL ) until 3 pm on Friday, although participants are still required to checkout of the guest rooms by 11AM.
(Front Desk - Professional Development Centre)
12:00 - 13:30 Lunch from 11:30 to 13:30 (Vistas Dining Room)