Schedule for: 15w5003 - Current and Future Challenges in Robust Statistics

Beginning on Sunday, November 15 and ending Friday November 20, 2015

All times in Banff, Alberta time, MST (UTC-7).

Sunday, November 15
16:00 - 17:30 Check-in begins at 16:00 on Sunday and is open 24 hours (Front Desk - Professional Development Centre)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
20:00 - 22:00 Informal gathering (Corbett Hall Lounge (CH 2110))
Monday, November 16
07:00 - 08:45 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
08:45 - 09:00 Introduction and Welcome by BIRS Station Manager (TCPL 201)
09:06 - 09:51 Steven Marron: Robustness Against Heterogeneity in Big Data
A major challenge in the world of Big Data is heterogeneity. This often results from the aggregation of smaller data sets into larger ones. Such aggregation creates heterogeneity because different experimenters typically make different design choices. Even when attempts are made at common designs, environmental or operator effects still often create heterogeneity. Thus motivates moving away from the classical conceptual model of Gaussian distributed data, in the direction of Gaussian mixtures. But classical mixture estimation methods are usually useless in Big Data contexts, because there are far too many parameters to efficiently estimate. Thus there is a strong need for statistical procedures which are robust against mixture distributions without the need for explicit estimation. Some early ideas in this important new direction are discussed.
(TCPL 201)
09:52 - 10:27 Stephan Morgenthaler: Bias and robustness (TCPL 201)
10:30 - 11:00 Coffee Break (TCPL Foyer)
11:00 - 11:45 Werner Stahel: What is a robust prediction interval? (TCPL 201)
11:45 - 13:00 Lunch (Vistas Dining Room)
13:00 - 13:45 Guided Tour of The Banff Centre
Meet in the Corbett Hall Lounge for a guided tour of The Banff Centre campus.
(Corbett Hall Lounge (CH 2110))
13:45 - 14:15 Group Photo
Meet in foyer of TCPL to participate in the BIRS group photo. Please don't be late, or you will not be in the official group photo! The photograph will be taken outdoors so a jacket might be required.
(TCPL Foyer)
14:15 - 15:00 Douglas Wiens: Model robust scenarios for active learning
What we in Statistics call experimental design is very much like what those in Machine Learning call active learning. In both cases, the idea is that independent variables are chosen in some optimal manner, and at these values a response variable is measured. In design, the regressors are determined by a design measure, obtained by the designer according to some optimality principle such as minimum mean squared error of the predicted values. In ’passive learning’ these regressors are randomly sampled from ’the environment’, in active learning they are randomly sampled from a subpopulation according to a probability density derived by the designer in some optimal manner. So a major difference between active learning and experimental design is in the random, rather than deterministic, sampling of the regressors from the learning density or design measure. When the parametric model being fitted is exactly correct, the corresponding loss functions are asymptotically equivalent and the methods of experimental design apply, with only minor modifications, to active learning. When however this model is in doubt, some significant differences between robust design and robust learning emerge, and with them interesting, new, optimality problems.
(TCPL 201)
15:00 - 15:30 Coffee Break (TCPL Foyer)
15:31 - 16:18 David Tyler: Regularized M-Estimators of Multivariate Scatter (TCPL 201)
16:20 - 17:00 Marc Genton: Tukey g-and-h Random Fields (TCPL 201)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
Tuesday, November 17
09:05 - 09:44 Elvezio Ronchetti: Robust Filtering
Filtering methods are powerful tools to estimate the hidden state of a state-space model from observations available in real time. However, they are known to be highly sensitive to the presence of small misspecifications of the underlying model and to outliers in the observation process. In this paper, we show that the methodology of robust statistics can be adapted to sequential filtering. We define a filter as being robust if the relative error in the state distribution caused by misspecifications is uniformly bounded by a linear function of the perturbation size. Since standard filters are nonrobust even in the simplest cases, we propose robustified filters which provide accurate state inference in the presence of model misspecifications. The robust particle filter naturally mitigates the degeneracy problems that plague the bootstrap particle filter and its many extensions. We illustrate the good properties of robust filters in linear and nonlinear state-space examples.
(TCPL 201)
09:45 - 10:30 William Aeberhard: A Proposal for Robust Estimation of Fixed Parameters in General State-Space Models
State-space models (SSMs) encompass a wide range of popular models encountered in various fields such as mathematical finance, control engineering and ecology. SSMs are essentially characterized by a hierarchical structure, with latent variables governed by Markovian dynamics. Fixed parameters in these models are traditionally estimated by maximum likelihood and generally include regression and auto- regression coefficients as well as correlations and scale parameters. Standard robust estimation techniques from generalized linear and time series models cannot be directly adapted to SSMs, and this mainly for two reasons: first, integrating high-dimensional latent variables out of a joint likelihood inevitably requires some approximation (except in very special cases); second, the approximated maximum likelihood scores are typically exceedingly complicated, if not intractable. We propose a robust estimating method based on an unpublished 2001 paper by Shinto Eguchi and Yutaka Kano: instead of introducing weights at the estimating equations level, we downweight observations on the log-likelihood scale. A Laplace approximation of the marginal log-likelihood allows us to formulate a computable estimator for which we derive the influence functional for different scenarios of additive outliers. We resort to indirect inference for the computation of Fisher consistency correction terms.
(TCPL 201)
10:30 - 11:00 Coffee Break (TCPL Foyer)
11:01 - 11:52 Stefan Van Aelst: Robust functional principal components by least trimmed squares
Classical functional principal component analysis can yield erroneous approximations in pres- ence of outliers. To reduce the influence of atypical data we propose two methods based on trimming: a multivariate least trimmed squares (LTS) estimator and a componentwise variant. The multivariate LTS minimizes the least squares criterion over subsets of curves. The componentwise version minimizes the sum of univariate LTS scale estimators in each of the components. In general the curves can be considered as realizations of a random element on a separable Hilbert space. For a fixed dimension q, we then aim to robustly estimate the q-dimensional linear subspace that gives the best approximation to the functional data. Following Boente and Salibin-Barrera (2014) our estimators uses smoothing to first represent irregu- larly spaced curves in a high-dimensional space and then calculates the LTS solution on these multivariate data. The solution of the multivariate data is subsequently mapped back onto the Hilbert space. Poorly fitted observations can then be flagged as outliers. A simulation study and real data applications show that our estimators yield competitive results, both in identifying outliers and approximating regular data when compared to other existing methods.
(TCPL 201)
11:45 - 13:30 Lunch (Vistas Dining Room)
13:34 - 14:18 Daniel Peña: Robust Generalized Dynamic Principal Components
We define generalized dynamic principal components (GDPC) as the time series minimizing a reconstruction of the original series with an interpolation criterion. We first used as criterion the mean squared error and obtained a solution that can be applied under more general conditions than the one used by Brillinger, including the case of non stationary series and relatively short series. Then we used a robust criterion to obtain a robust version of the generalized dynamic principal components that will work when the series have outlier contamination. Our non robust and robust procedures will be illustrated with real datasets.
(TCPL 201)
14:20 - 15:01 Ana Bianco: Robust estimation in partially linear measurement error models
In many applications of regression analysis, there are covariates that are measured with errors. Measurement error models are a useful tool for the analysis in this kind of situations. Among semipara- metric models, partially linear models have been extensively used due to their flexibility to model linear components in conjunction with non-parametric ones. In this talk, we focus on partially linear models where the covariates of the linear component are measured with additive errors. We consider a robust fam- ily of estimators of the parametric and nonparametric components that combine robust local smoothers with robust parametric techniques. The resulting estimators are based on a three-step procedure. We prove that, under regularity conditions, they are consistent. We study their robustness by means of the empirical influence function. A simulation study allows to compare the behaviour of the robust estimators with their classical relatives and a real example data is analysed to illustrate the performance of the proposal.
(TCPL 201)
15:00 - 15:30 Coffee Break (TCPL Foyer)
15:30 - 16:10 Marianti Markatou: Distances and their role in robustness
Statistical distances, divergences and similar quantities have a large history and play a funda- mental role in statistics, machine learning and associated scientific disciplines. In this talk, we first examine aspects of robustness met in biomedical applications; we then discuss the role of statistical distances in the solution of the robustness challenge. Inferential aspects facilitating scientific understanding are discussed and conclude with the need of broadening the conceptual and practical base of robustness.
(TCPL 201)
16:10 - 16:42 Peter Filzmoser: Pairwise Mahalanobis distances in the context of local outlier detection
The Mahalanobis distance between pairs of multivariate observations is used as a measure of similarity between the observations. The theoretical distribution is derived, and the result is used for judging on the degree of isolation of an observation. In case of spatially dependent data where spatial coordinates are available, different exploratory tools are introduced for studying the degree of isolation of an observation from a fraction of its neighbors, and thus to identify local multivariate outliers.
(TCPL 201)
17:30 - 19:30 Dinner (Vistas Dining Room)
Wednesday, November 18
09:00 - 09:38 Doug Martin (TCPL 201)
09:38 - 10:30 Peter Rousseeuw: Detecting cellwise outliers
A multivariate dataset consists of n observations in p dimensions, and is often stored in an n by p matrix X. Robust statistics has mostly focused on identifying and downweighting outlying rows of X, called rowwise or casewise outliers. However, downweighting an entire row if only one (or a few) of its cells are deviating entails a huge loss of information. Also, in high-dimensional data the majority of the rows may contain a few contaminated cells, which yields a loss of robustness as well. Recently new robust methods have been developed for datasets with missing values and with cellwise outliers, also called elementwise outliers. We will explore the detection of cellwise outliers, and compare the misclassification rates between methods by means of simulations in which the data contain cellwise outliers, rowwise outliers, or both simultaneously. The result of a cellwise outlier detection rule can also be used in a second step which robustly estimates the underlying location and scatter matrix. We will compare the accuracy of a few such two-step procedures.
(TCPL 201)
10:30 - 11:00 Coffee Break (TCPL Foyer)
11:11 - 11:54 Alfio Marazzi: Session in Honour of Ricardo Maronna, Doug Martin and Victor Yohai (TCPL 201)
11:45 - 13:30 Lunch (Vistas Dining Room)
14:15 - 17:30 Free Afternoon (Banff National Park)
17:30 - 19:30 Dinner (TCPL 201)
Thursday, November 19
09:00 - 09:45 Christophe Croux: Robust and sparse regression in high dimensions
Robust regression estimators as the Least Trimmed Squares, S- and MM-estimators cannot be computed when we have more variables than observations. Recently, sparse versions of these estimators have been proposed, where many of the estimated regression coefficients will be zero. Moreover, these estimators can be computed in the high dimension, low sample size setting. They are defined by adding an L1-penalty to their objective function, and maintain the breakdown point of the non-sparse counterparts. Feasible algorithms are available. In this talk we will review existing results, and highlight some remaining issues, such as the selection of the sparsity level, the choice of the starting value of the algorithm, and the danger of cell wise outliers. Furthermore, the estimation of the residuals scale, important for outlier detection, invokes some difficulties in these high dimensional settings.
(TCPL 201)
09:45 - 10:30 Ezequiel Smucler: Robust and sparse estimators for linear regression models
Penalized regression estimators are a popular tool for the analysis of sparse and high-dimensional data sets. However, most of the proposals of penalized regression estimators are defined using unbounded loss functions, and therefore are very sensitive to the presence of outlying observations, especially high leverage outliers. Thus, robust estimators for sparse and high-dimensional linear regression models are in need. In this talk, we in- troduce Bridge and adaptive Bridge versions of MM-estimators: q -penalized MM-estimators of regression and MM-estimators with an adaptive t penalty. We discuss their asymptotic properties and outline an algorithm to calculate them for the special case of q = t = 1. The advantages of our proposed estimators are demonstrated through a simulation study and the analysis of a real high- dimensional data set.
(TCPL 201)
10:30 - 11:00 Coffee Break (TCPL Foyer)
11:00 - 11:45 Marco Avella-Medina: Robust penalized M-estimators
Data sets where the number of variables p is comparable to or larger than the number of observations n arise frequently nowadays in a large variety of fields. High dimensional statistics has played a key role in the analysis of such data and much progress has been achieved over the last two decades in this domain. Most of the existing procedures are likelihood based and therefore quite sensitive to deviations from the stochastic assumptions. We study robust penalized M-estimators and discuss some of their formal robustness properties. In the context of high dimensional generalized linear models we provide oracle properties for our proposals. We discuss some strategies for the selection of the tuning parameter and extensions to generalized additive models. We illustrate the behavior of our estimators in a simulation study.
(TCPL 201)
11:45 - 13:30 Lunch (Vistas Dining Room)
14:00 - 15:00 Free Afternoon (TCPL201)
15:00 - 15:30 Coffee Break (TCPL Foyer)
15:30 - 16:15 Po-Ling Loh: High-dimensional precision matrix estimation: Cellwise corruption under epsilon-contamination
We analyze the statistical consistency of robust estimators for precision matrices in high dimen- sions. Such estimators, formed by plugging robust covariance matrix estimators into the graphical Lasso or CLIME optimization programs, were recently proposed in the robust statistics literature, but only an- alyzed from the point of view of breakdown behavior. As a complementary result, we provide bounds on the statistical error incurred by the precision matrix estimators based on cellwise epsilon-contamination, thus revealing the interplay between the problem dimensions and the degree of contamination permitted in the observed distribution. We discuss implications of our work for problems involving graphical model estimation when the uncontaminated data follow a multivariate normal distribution.
(TCPL 201)
16:15 - 17:00 Ricardo Maronna: Robust and efficient estimation of multivariate scatter and location
We deal with the equivariant estimation of scatter and location for p-dimensional data, giving emphasis to scatter. It is important that the estimators possess both a high efficiency for normal data and a high resistance to outliers, that is, a low bias under contamination. The most frequently employed estimators are not quite satisfactory in this respect. The Minimum Volume Ellipsoid (MVE) and Minimum Covariance Determinant (MCD) estimators are known to have a very low efficiency. S-Estimators with a monotonic weight function like the bisquare have a low efficiency for small p, and their efficiency tends to one with increasing p. Unfortunately, this advantage is paid for by a serious loss of robustness for large p. We consider four families of estimators with controllable efficiencies whose performance for moderate to large p has not been explored to date: S-estimators with a non-monotonic weight function (Rocke 1996), MM-estimators, -estimators, and the Stahel-Donoho estimator. Two types of starting estimators are employed: the MVE computed through subsampling, and a semi-deterministic procedure proposed by Pea and Prieto (2007) for outlier detection. A simulation study shows that the Rocke estimator starting from the Pea-Prieto estimator and with an adequate tuning, can simultaneously attain high efficiency and high robustness for p, and the MM estimator can be recommended for p¡15.
(TCPL 201)
17:30 - 19:30 Dinner (Vistas Dining Room)
19:30 - 20:15 Hannu Oja: Subspace estimation in linear dimension reduction
In linear dimension reduction for a p-variate random vector x, the general idea is to find an orthogonal projection (matrix) P of rank k, k < p such that Px carries all or most of the information. In unsupervised dimension reduction this means that x|Px just presents (uninteresting) noise. In supervised dimension reduction for an interesting response variable y, x and y are conditionally independent, given Px, that is, the dependence of x on y is only through Px. In this talk we consider the problem of estimating the minimal subspace, that is, the corresponding unknown projection P with known or unknown dimension. Most of the linear (supervised and unsu- pervised) dimension reduction methods such as principal component analysis (PCA), fourth order blind identification (FOBI), Fisher’s linear discrimination subspace or sliced inverse regression (SIR) are based on a simultaneous diagonalization of two matrices S1 and S2. Asymptotic and robustness properties of the estimates of P can then be derived from those of the estimates of S1 and S2. We also discuss the tools for robustness studies as well as the possibility to robustify these approaches by replacing these matrices by their robust counterparts. The talk is based on the co-operation with several people.
(TCPL 201)
20:15 - 21:00 Luis Angel Garcia-Escudero: Adaptive choice of parameters in Robust Clustering for model based clustering
Outliers can be extremely harmful when applying well-known Cluster Analysis methods. More- over, clustered outliers can be also troublesome for traditional robust techniques. Therefore, the devel- oping of appropriate robust clustering methods could be useful for addressing simultaneously both types of problems. The TCLUST method is a flexible way for doing robust cluster analysis by resorting to trimming. This methodology can be implemented by using the tclust package available at the CRAN repository. This high flexibility allows to deal with non-necessarily spherical clusters and to cope with dif- ferent amounts/types of contamination. However, due to this high flexibility, the use of this methodology in real data applications is not completely straightforward and requires the specification of some tuning parameters (the number of clusters, the trimming proportion and a constant constraining the relative clus- ters shapes and sizes). A fully automatic way to choose simultaneously all these parameters is not feasible given that their dependence on the desired type of cluster partition. I.e., the user of any clustering method must always play an active role by specifying the type of clusters that he/she is particularly interested in. When applying TCLUST, we will present some new graphical and automatized procedures which may help the user in making easier this specification. These procedures allow TCLUST to be initialized with less risky parameters configurations which can be later adapted to the data set at hand.
(TCPL 201)
Friday, November 20
07:00 - 09:00 Breakfast (Vistas Dining Room)
09:07 - 10:17 Werner Stahel: Robust Prediction Intervals: Problem, Possible Approaches (TCPL 201)
09:45 - 10:30 Wrap-up Session (TCPL)
10:30 - 11:00 Coffee Break (TCPL Foyer)
10:54 - 11:33 Werner Stahel: Basic Statistical Issues for Reproducibility: Models, Variability, Extensions (TCPL 201)
11:29 - 11:59 Checkout by Noon
5-day workshop participants are welcome to use BIRS facilities (BIRS Coffee Lounge, TCPL and Reading Room) until 3 pm on Friday, although participants are still required to checkout of the guest rooms by 12 noon.
(Front Desk - Professional Development Centre)
11:30 - 13:30 Lunch (Vistas Dining Room)