Schedule for: 18w5089 - Recent Developments in Statistical Theory and Methods Based on Distributed Computing

Beginning on Sunday, May 20 and ending Friday May 25, 2018

All times in Oaxaca, Mexico time, CDT (UTC-5).

Sunday, May 20
14:00 - 23:59 Check-in begins (Front desk at your assigned hotel)
19:30 - 22:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
20:30 - 21:30 Informal gathering (Hotel Hacienda Los Laureles)
Monday, May 21
07:30 - 08:45 Breakfast (Restaurant at your assigned hotel)
08:45 - 09:00 Introduction and Welcome (Conference Room San Felipe)
09:00 - 09:45 Xiaoming Huo: Computationally and Statistically Efficient distributed Inference with Theoretical Guarantees
In many contemporary data-analysis settings, it is expensive and/or infeasible to assume that the entire data set is available at a central location. In recent works of computational mathematics and machine learning, great strides have been made in distributed optimization and distributed learning (i.e., machine learning). On the other hand, classical statistical methodology, theory, and computation are typically based on the assumption that the entire data are available at a central location - this is a significant shortcoming in classical statistical knowledge. The statistical methodology and theory for distributed inference have been actively developed. This talk will discuss one distributed statistical method that is computationally efficient, requiring minimal communication, and have comparable statistical properties. Theoretical guarantees of this distributed statistical estimator is presented.
(Conference Room San Felipe)
09:45 - 10:30 Stanislav Minsker: Distributed Statistical Estimation and Rates of Convergence in Normal Approximation
In this talk, we will present algorithms for distributed statistical estimation that can take advantage of the divide-and-conquer approach. We show that one of the key benefits attained by an appropriate divide-and-conquer strategy is robustness, an important characteristic of large distributed systems. Moreover, we introduce a class of algorithms that are based on the properties of the spatial median, establish connections between performance of these distributed algorithms and rates of convergence in normal approximation, and provide tight deviations guarantees for resulting estimators in the form of exponential concentration inequalities. Techniques are illustrated with several examples; in particular, we obtain new results for the median-of-means estimator, as well as provide performance guarantees for robust distributed maximum likelihood estimation. The talk is based on a joint work with Nate Strawn.
(Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 11:45 Peter Song: Meta Estimation of Normal Mean Parameter: Seven Perspectives of Data Integration
Data integration has recently drawn considerable attention in the statistical literature. At this talk we will present a synergic treatment on the estimation of mean parameter of a normal distribution from seven different schools of statistics, which sheds light on the future development of data integration analytics. They include best linear unbiased estimation (BLUE), maximum likelihood estimation (MLE), Bayesian estimation, empirical Bayesian estimation (EBE), Fisher's fiducial estimation, generalized methods of moments (GMM) estimation, and empirical likelihood estimation (ELE). Their properties of scalability and distributed inference will be discussed and compared analytically and numerically.
(Conference Room San Felipe)
11:45 - 12:30 Ding-Xuan Zhou: Theory of Deep Convolutional Neural Networks and Distributed Learning
Deep learning has been widely applied and brought breakthroughs in speech recognition, computer vision, and many other domains. The involved deep neural network architectures and computational issues have been well studied in machine learning. But there lacks a theoreti- cal foundation for understanding the approximation or generalization ability of deep learning methods with network architectures such as deep convolutional neural networks with convo- lutional structures. This talk describes a mathematical theory of deep convolutional neural networks (CNNs). In particular, we show the universality of a deep CNN, meaning that it can be used to approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough. Our quantitative estimate, given tightly in terms of the number of free parameters to be computed, verifies the efficiency of deep CNNs in dealing with large dimensional data. Some related distributed learning algorithms will also be discussed.
(Conference Room San Felipe)
12:30 - 13:15 Bochao Jia: Double-Parallel Monte Carlo for Bayesian Analysis of Big Data
This paper proposes a simple, practical and efficient MCMC algorithm for Bayesian analysis of big data. The proposed algorithm suggests to divide the big dataset into some smaller subsets and provides a simple method to aggregate the subset posteriors to approximate the full data posterior. To further speed up computation, the proposed algorithm employs the population stochastic approximation Monte Carlo (Pop-SAMC) algorithm, a parallel MCMC algorithm, to simulate from each subset posterior. Since this algorithm consists of two levels of parallel, data parallel and simulation parallel, it is coined as “Double Parallel Monte Carlo”. The validity of the proposed algorithm is justified mathematically and numerically.
(Conference Room San Felipe)
13:20 - 13:30 Group Photo (Hotel Hacienda Los Laureles)
13:30 - 15:00 Lunch (Restaurant Hotel Hacienda Los Laureles)
15:15 - 16:00 Jin Zhou: Variance Component Testing and Selection for a Longitudinal Microbiome Study
High-throughput sequencing technology has enabled population-based studies of the role of the human microbiome in disease etiology and exposure response. Due to the high cost of sequencing technology such studies usually have limited sample sizes. We study the association of microbiome composition and clinical phenotypes by testing the nullity of variance components. When the null model has more than one variance parameters and sample sizes are limited, such as in longitudinal metagenomics studies, testing zero variance components remains an open challenge. In this talk, I first introduce a series of efficient exact tests (score test, likelihood ratio test, and restricted likelihood ratio test) of testing zero variance components in presence of multiple variance components. Our approach does not rely on the asymptotic theory thus significantly boosts the power in small samples. Furthermore, to further conquer limited sample size and high dimensional features of metagenomics data, we introduce a variance component selection scheme with lasso penalization. We propose an minorization-maximization (MM) algorithm for the difficult optimization problem. Extensive simulations demonstrate the superiority of our methods vs existing methods. Finally, we apply our method to a longitudinal microbiome study of HIV infected patients
(Conference Room San Felipe)
16:00 - 16:30 Coffee Break (Conference Room San Felipe)
16:30 - 17:15 working group time (Conference Room San Felipe)
17:15 - 19:00 working group time (Conference Room San Felipe)
19:00 - 21:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
Tuesday, May 22
07:30 - 09:00 Breakfast (Restaurant at your assigned hotel)
09:00 - 09:45 Martin Lysy: Applications of a Distributed Computational Method for Microparticle Tracking in Biological Fluids
State-of-the-art techniques in passive particle-tracking microscopy provide high-resolution path trajectories of diverse foreign particles in biological fluids. In order to analyze experiments often tracking thousands of particles at once, scientists must account for many sources of unwanted variability, such as heterogeneity of the fluid environment and measurement error. To this end, this talk presents a versatile family of hierarchical stochastic process models, along with a scalable split-merge distributed computing strategy for parameter inference. Also presented are several applications to quantifying subdiffusive mobility of tracer particles in human lung mucus.
(Conference Room San Felipe)
09:45 - 10:30 Rajarshi Guhaniyogi: DISK: Divide and Conquer Spatial Kriging for Massive Sea Surface Database (Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 11:45 Joong-Ho Won: A Continuum of Optimal Primal-Dual Algorithms for Convex Composite Minimization Problems with Applications to Structured Sparsity
Many statistical learning problems can be posed as minimization of a sum of two convex functions, one typically a composition of non-smooth and linear functions. Examples include regression under structured sparsity assumptions. Popular algorithms for solving such problems, e.g., ADMM, often involve non-trivial optimization subproblems or smoothing approximation. We consider two classes of primal-dual algorithms that do not incur these difficulties, and unify them from a perspective of monotone operator theory. From this unification we propose a continuum of preconditioned forward-backward operator splitting algorithms amenable to parallel and distributed computing. For the entire region of convergence of the whole continuum of algorithms, we establish its rates of convergence. For some known instances of this continuum, our analysis closes the gap in theory. We further exploit the unification to propose a continuum of accelerated algorithms. We show that the whole continuum attains the theoretically optimal rate of convergence. The scalability of the proposed algorithms, as well as their convergence behavior, is demonstrated up to 1.2 million variables with a distributed implementation.
(Conference Room San Felipe)
11:45 - 12:30 Emily Hector: A Distributed and Integrated Method of Moments for High-Dimensional Correlated Data Analysis
We present a divide-and-conquer procedure implemented in a distributed and parallelized scheme for statistical estimation and inference of regression parameters with high-dimensional correlated responses with multi-level nested correlations. Despite significant efforts in the literature, the computational bottleneck associated with high-dimensional likelihoods prevents the scalability of existing methods. The proposed method addresses this challenge by dividing subjects into independent groups and responses into correlated subvectors to be analyzed separately and in parallel on a distributed platform. Theoretical challenges related to combining results from dependent data are overcome in a statistically efficient way using a meta-estimator derived from Hansen’s Generalized Method of Moments. We provide a rigorous theoretical framework for efficient estimation, inference, and goodness-of-fit tests. We develop an R package for ease of implementation. We illustrate our method’s performance with simulations and the analysis of a complex neuroimaging motivating dataset from an association study of the effects of iron deficiency on auditory recognition memory.
(Conference Room San Felipe)
12:30 - 13:15 Sharmistha Guha: Bayesian Regression with Undirected Network Predictors with an Application to Brain Connectome Data
We propose a Bayesian approach to regression with a continuous scalar response and an undirected network predictor. Undirected network predictors are often expressed in terms of symmetric adjacency matrices, with rows and columns of the matrix representing the nodes, and zero entries signifying no association between two corresponding nodes. Network predictor matrices are typically vectorized prior to any analysis, thus failing to account for the important structural information in the network. This results in poor inferential and predictive performance in presence of small sample sizes. We propose a novel class of network shrinkage priors for the coefficient corresponding to the undirected network predictor. The proposed framework is devised to detect both nodes and edges in the network predictive of the response. Our framework is implemented using an efficient Markov Chain Monte Carlo algorithm. Empirical results in simulation studies illustrate strikingly superior inferential and predictive gains of the proposed framework in comparison with the ordinary high dimensional Bayesian shrinkage priors and penalized optimization schemes. We apply our method to a brain connectome dataset that contains information on brain networks along with a measure of creativity for multiple individuals. Here, interest lies in building a regression model of the creativity measure on the network predictor to identify important regions and connections in the brain strongly associated with creativity. To the best of our knowledge, our approach is the first principled Bayesian method that is able to detect scientifically interpretable regions and connections in the brain actively impacting the continuous response (creativity) in the presence of a small sample size.
(Conference Room San Felipe)
13:30 - 15:00 Lunch (Restaurant Hotel Hacienda Los Laureles)
15:15 - 16:00 Luc Villandré: Challenges in the prediction of motor vehicle traffic collisions with GPS travel data
In the field of road safety, crashes involving physical injuries typically occur on roadways, which constrain the events to lie along a linear network. Substantial research efforts have been devoted to the development of methods for point patterns on linear networks. In one such model, we assume that crash coordinates are produced by a Poisson point process whose domain corresponds to edges in the road network. This talk focuses on the analysis of geo-localised accident data in the context of a smart city initiative launched by the City of Quebec (Canada) aiming to identify crash hotspots on the road network based on covariates derived from GPS data. Data originate from three sources: i) a geolocalised traffic accident database whose entries are based on police reports, ii) GPS trajectories obtained from a study on 4,000 drivers involving 55,000 trips and iii) the structure of the road network obtained from the OpenStreetMap (OSM) database. We highlight challenges, both methodological and computational, with the use of those three data sources in producing sensible inference for the covariate effects.
(Conference Room San Felipe)
16:00 - 16:30 Coffee Break (Conference Room San Felipe)
16:30 - 18:00 Josh Day: Distributed Computing Using JuliaDB and OnlineStats (Conference Room San Felipe)
19:00 - 21:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
Wednesday, May 23
07:30 - 09:00 Breakfast (Restaurant at your assigned hotel)
09:00 - 09:45 Min-ge Xie: On Combination of Inferences After Split-and-Conquer
This talk will give an overview of scientific methods that are used for combining estimations or inferences from different data (or subsets of data). It will discuss pros and cons of different methodologies that are commonly used in scientific research. It will stress the use of 'distribution estimators' to combining inferences and provide a `unified' angle to view both Bayesian and frequentist approaches. It will also provide examples to illustrate some pitfalls of some well-known approaches.
(Conference Room San Felipe)
09:45 - 10:30 Josh Day: Online Algorithms for Statistics (Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 11:45 Sergio Adrián Lagunas Pinacho: Good Practices to Write Good, Clean, and Collaborative Code (Conference Room San Felipe)
11:45 - 12:30 Edgar Jimenez: Parallel Forecasts For Demand Planning of Perishable Processed Foods
Forecasting is a key activity in demand planning which is a pillar of supply chain management. In the present work we expose the results achieved by a system developed for a producer of processed foods located in Mexico. This system required a custom made architecture because of two key requirements: high speed of forecast generation and minimum possible error. The models were based on weekly data for all clients and products. The forecast procedure was based on parallel processing, which in the latest iteration of the system also used hierarchical properties of the forecasts.
(Conference Room San Felipe)
12:30 - 13:30 Lunch (Restaurant Hotel Hacienda Los Laureles)
13:30 - 19:00 Free Afternoon (Oaxaca)
19:00 - 21:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
Thursday, May 24
07:30 - 09:00 Breakfast (Restaurant at your assigned hotel)
09:00 - 09:45 Zhengwu Zhang: Optimization Problems in Brain Connectome Analysis (Conference Room San Felipe)
09:45 - 10:30 Ernesto Álvarez González: Advances in computing parameters for JK69 triplets with fixed topology (Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 11:45 Shanshan Cao (Conference Room San Felipe)
11:45 - 12:30 Min-ge Xie: Confidence Distribution - Part II (Conference Room San Felipe)
12:30 - 13:30 Lunch (Restaurant Hotel Hacienda Los Laureles)
13:30 - 16:00 Lightning talks or working group time (Conference Room San Felipe)
16:00 - 16:30 Coffee Break (Conference Room San Felipe)
16:30 - 19:00 Lightning talks or working group time (Conference Room San Felipe)
19:00 - 21:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
Friday, May 25
07:30 - 09:00 Breakfast (Restaurant at your assigned hotel)
09:00 - 10:30 Lightning talks or working group time (Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 12:00 Lightning talks or working group time (Conference Room San Felipe)
12:00 - 14:00 Lunch (Restaurant Hotel Hacienda Los Laureles)