Schedule for: 18w5054 - Workshop on the Interface of Machine Learning and Statistical Inference

Beginning on Sunday, January 14 and ending Friday January 19, 2018

All times in Banff, Alberta time, MST (UTC-7).

Sunday, January 14
16:00 - 17:30 Check-in begins at 16:00 on Sunday and is open 24 hours (Front Desk - Professional Development Centre)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
20:00 - 22:00 Informal gathering (Corbett Hall Lounge (CH 2110))
Monday, January 15
07:00 - 08:45 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
08:45 - 09:00 Introduction and Welcome by BIRS Station Manager (TCPL 201)
09:00 - 09:45 Rich Caruana: Friends Don’t Let Friends Deploy Black-Box Models: The Importance of Transparency and Intelligibility in Machine Learning (TCPL 201)
09:45 - 10:30 Bin Yu: Stability and Iterative Random Forests (TCPL 201)
10:30 - 11:00 Coffee Break (TCPL Foyer)
11:00 - 11:45 Lucas Mentch: Inference and Variable Selection for Random Forests
Despite the success of tree-¬based learning algorithms (bagging, boosting, random forests), these methods are often seen as prediction-¬only tools whereby the interpretability and intuition of traditional statistical models is sacrificed for predictive accuracy. We present an overview of recent work that suggests this black-¬box perspective need not be the case. We consider a general resampling scheme in which predictions are averaged across base-learners built with subsamples and demonstrate that the resulting estimator belongs to an extended class of U-¬statistics. As such, a corresponding central limit theorem is developed allowing for confidence intervals to accompany predictions, as well as formal hypothesis tests for variable significance and additivity. The test statistics proposed can also be extended to produce consistent measures of variable importance. In particular, we propose to extend the typical randomized node-wise feature availability to tree-wise feature availability, allowing for hold-out variable importance measures that, unlike traditional out-of-bag measures, are robust to correlation structures between
(TCPL 201)
11:50 - 12:00 Group Photo
Meet in foyer of TCPL to participate in the BIRS group photo. The photograph will be taken outdoors, so dress appropriately for the weather. Please don't be late, or you might not be in the official group photo!
(TCPL Foyer)
12:00 - 13:00 Lunch
Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
13:00 - 14:00 Guided Tour of The Banff Centre
Meet in the Corbett Hall Lounge for a guided tour of The Banff Centre campus.
(Corbett Hall Lounge (CH 2110))
14:00 - 15:00 Data Sets (TCPL 201)
15:00 - 15:30 Coffee Break (TCPL Foyer)
15:30 - 17:35 Introductions (TCPL 201)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
Tuesday, January 16
07:00 - 09:00 Breakfast (Vistas Dining Room)
09:00 - 09:45 Whitney Newey: Inference for Functionals of Machine Learning Estimators (TCPL 201)
09:45 - 10:30 Michael Kosorok: Inferential challenges for machine learning applications in precision medicine and data-driven decision science (TCPL 201)
10:30 - 11:00 Coffee Break (TCPL Foyer)
11:00 - 11:45 Erwan Scornet: Consistency of Random Forests
The recent and ongoing digital world expansion now allows anyone to have access to a tremendous amount of information. However collecting data is not an end in itself and thus techniques must be designed to gain in-depth knowledge from these large data bases.This has led to a growing interest for statistics, as a tool to find patterns in complex data structures, and particularly for turnkey algorithms which do not require specific skills from the user. Such algorithms are quite often designed based on a hunch without any theoretical guarantee. Indeed, the overlay of several simple steps (as in random forests or neural networks) makes the analysis more arduous. Nonetheless, the theory is vital to give assurance on how algorithms operate thus preventing their outputs to be misunderstood. Among the most basic statistical properties is the consistency which states that predictions are asymptotically accurate when the number of observations increases. In this talk, I will present a first result on Breiman’s forests consistency and show how it sheds some lights on its good performance in a sparse regression setting.
(TCPL 201)
11:45 - 14:00 Lunch (Vistas Dining Room)
14:00 - 14:45 Jelena Bradic: High dimensional inference: do we need sparsity? (TCPL 201)
15:00 - 15:30 Torsten Hothorn: Transformation Forests
Regression models for supervised learning problems with a continuous target are commonly understood as models for the conditional mean of the target given predictors. This notion is simple and therefore appealing for interpretation and visualisation. Information about the whole underlying conditional distribution is, however, not available from these models. A more general understanding of regression models as models for conditional distributions allows much broader inference from such models, for example the computation of prediction intervals. Several random forest-type algorithms aim at estimating conditional distributions, most prominently quantile regression forests (Meinshausen, 2006, JMLR). We propose a novel approach based on a parametric family of distributions characterised by their transformation function. A dedicated novel ``transformation tree'' algorithm able to detect distributional changes is developed. Based on these transformation trees, we introduce ``transformation forests'' as an adaptive local likelihood estimator of conditional distribution functions. The resulting models are fully parametric yet very general and allow broad inference procedures, such as the model-based bootstrap, to be applied in a straightforward way.
(TCPL 201)
15:30 - 16:00 Coffee Break (TCPL Foyer)
16:00 - 16:30 Jake Hofman: Identifying causal effects through auxiliary outcomes
Unobserved or unknown confounders complicate even the simplest attempts to estimate the effect of one variable on another using observational data. When cause and effect are both influenced by these confounders, methods based on exploiting natural experiments (e.g., instrumental variables) have been proposed to eliminate confounds. Unfortunately, however, good instruments are difficult to come by and often apply in very limited settings. In this talk we investigate a particular scenario in time series data that permits causal identification in the presence of unobserved confounders and present an algorithm to automatically find such scenarios at scale. Specifically, we examine a setting where the effect variable can be split up into two parts: one that is potentially affected by the cause, and another that is independent of it. We show that when both of these variables are caused by the same confounders, the problem of identification reduces to that of testing for independence among observed variables. We demonstrate the method by estimating the causal impact of Amazon's recommender system, finding thousands of examples within the dataset that satisfy the criteria for causal identification.
(TCPL 201)
16:30 - 17:00 Adele Cutler (TCPL 201)
17:30 - 19:30 Dinner (Vistas Dining Room)
Wednesday, January 17
07:00 - 09:00 Breakfast (Vistas Dining Room)
09:00 - 09:45 Edward George: The Remarkable Flexibility of BART (TCPL 201)
09:45 - 10:30 Andrew Wilson: Bayesian GANs and Stochastic MCMC
Through an adversarial game, generative adversarial networks (GANs) can implicitly learn rich distributions over images, audio, and data which are hard to model with an explicit likelihood. I will present a practical Bayesian formulation for unsupervised and semi-supervised learning with GANs. Within this framework, we use stochastic gradient Hamiltonian Monte Carlo for marginalizing parameters. The resulting approach can automatically discover complementary and interpretable generative hypotheses for collections of images. Moreover, by exploring an expressive posterior over these hypotheses, we show that it is possible to achieve state-of-the-art quantitative results on image classification benchmarks, even with less than 1% of the labelled training data.
(TCPL 201)
10:30 - 11:00 Coffee Break (TCPL Foyer)
11:00 - 11:45 Lucas Janson: Knockoffs: using machine learning for finite-sample controlled variable selection in nonparametric models (TCPL 201)
11:45 - 13:45 Lunch (Vistas Dining Room)
13:30 - 17:30 Free Afternoon (Banff National Park)
17:30 - 19:30 Dinner (Vistas Dining Room)
Thursday, January 18
07:00 - 09:00 Breakfast (Vistas Dining Room)
09:00 - 09:45 Susan Athey: SHOPPER: A PROBABILISTIC MODEL OF CONSUMER CHOICE WITH SUBSTITUTES AND COMPLEMENTS
We develop shopper, a sequential probabilistic model of market baskets. shopper uses interpretable components to model the forces that drive how a customer chooses products; in particular, we designed shopper to capture how items interact with other items.We develop an efficient posterior inference algorithm to estimate these forces from large-scale data, and we analyze a large dataset from a major chain grocery store. We are interested in answering counterfactual queries about changes in prices. We found that shopper provides accurate predictions even under price interventions, and that it helps identify complementary and substitutable pairs of products.
(TCPL 201)
09:45 - 10:30 Jennifer Hill: Causal inferences that capitalizes on machine learning and statistics: opportunities and challenges (TCPL 201)
10:30 - 11:00 Coffee Break (TCPL Foyer)
11:00 - 11:45 Mark van der Laan: Targeted Learning: Integrating the State of the Art of Machine Learning with Statistical Inference (TCPL 201)
11:45 - 14:00 Lunch (Vistas Dining Room)
14:00 - 14:45 Stefan Wager: Learning Objectives for Treatment Effect Estimation
We develop a general class of two-step algorithms for heterogeneous treatment effect estimation in observational studies. We first estimate marginal effects and treatment propensities to form an objective function that isolates the heterogeneous treatment effects, and then optimize the learned objective. This approach has several advantages over existing methods. From a practical perspective, our method is very flexible and easy to use: In both steps, we can use any method of our choice, e.g., penalized regression, a deep net, or boosting; moreover, these methods can be fine-tuned by cross-validating on the learned objective. Meanwhile, in the case of penalized kernel regression, we show that our method has a quasi-oracle property, whereby even if our pilot estimates for marginal effects and treatment propensities are not particularly accurate, we achieve the same regret bounds as an oracle who has a-priori knowledge of these nuisance components. We implement variants of our method based on both penalized regression and convolutional neural networks, and find promising performance relative to existing baselines.
(TCPL 201)
15:30 - 16:00 Coffee Break (TCPL Foyer)
15:30 - 16:00 Nathan Kallus: Generalized Optimal Matching for Inference and Policy Learning (TCPL 201)
16:00 - 16:30 Michal Kolesar: Optimal inference linear models under constrained parameter spaces (TCPL 201)
16:30 - 17:00 Ashkan Ertefaie: A Greedy Gradient Q-learning Approach for Constructing Optimal Policies in Infinite Time Horizon Settings (TCPL 201)
17:00 - 17:30 Alexandra Chouldechova: "Algorithmic bias": Practical and technical challenges (TCPL 201)
17:30 - 19:30 Dinner (Vistas Dining Room)
Friday, January 19
07:00 - 09:00 Breakfast (Vistas Dining Room)
09:00 - 10:30 Round Table Discussion (TCPL 201)
10:30 - 11:30 Coffee Break (TCPL Foyer)
11:30 - 12:00 Checkout by Noon
5-day workshop participants are welcome to use BIRS facilities (BIRS Coffee Lounge, TCPL and Reading Room) until 3 pm on Friday, although participants are still required to checkout of the guest rooms by 12 noon.
(Front Desk - Professional Development Centre)
12:00 - 13:30 Lunch from 11:30 to 13:30 (Vistas Dining Room)