Understanding the New Statistics: Expanding Core Statistical Theory (08w5071)
Organizers
Ivan Mizera (University of Alberta)
Rudolf Beran (University of California, Davis)
Iain Johnstone (Stanford University)
Sara van de Geer (Eidgenössische Technische Hochschule Zürich)
Objectives
Statistical data analysis is generally considered one of the major scientific achievements of the twentieth century. In hard sciences such as seismology or meteorology, a theory based on physical law is not considered successful unless it predicts better than purely statistical arguments. In other disciplines such as econometrics, statistical finance, genomics, statistical analysis is essential to identifying phenomena. It is perhaps much less recognized that the constitution in the 20th century of Statistics as an independent scientific discipline is one of the reasons for this intellectual success as well as a consequence. The names of some pioneers of statistical theory such as R. von Mises, R. A. Fisher, W. Hoeffding, C. R. Rao, J. Neyman, A. Kolmogorov, L. Le Cam, C. Stein, J. Tukey, J. Hajek, B. Efron, to name only a few, may not be familiar to every user of statistical software. Nevertheless without the penetrating mathematical presence of these personalities in current statistical theory, this influential software would not exist.
Probably the best demonstration of the vitality of statistical ideas are continuing attempts at their appropriation. For example, a recent Nobel Prize in economics essentially recognized research in statistical time series analysis that had application to econometrics. The past century saw the emergence of what are called the mathematical sciences. These include theoretical physics, quantum chemistry, mathematical statistics, and more. We are now witnessing the emergence of the statistical sciences: disciplines that rely on statistical theory to express their specific ideas. Examples of statistical sciences include biostatistics, econometrics, statistical finance, genomics, signal processing, machine learning, and more. Supporting the growth of the statistical sciences has been the technological revolution in computing.
Until the late 1950's, writers on competing statistical theories thought in terms of virtual data governed by probability models involving relatively few parameters. That same decade found logical paradoxes in the statistical theory of the time which obliged more careful rethinking. By 1970, sophisticated rigorous development of statistical theory had resolved these paradoxes. From the 1960's onward, computing technology and refined concepts of algorithm provided a new environment in which to extend and reconsider statistical ideas developed with probability technology. Case studies and experiments with artificial data increasingly offered non-probabilistic ways of understanding the performance of statistical procedures. The fundamental distinctions among data, probability model, pseudo-random numbers, and algorithm returned to prominence. It became clear once again that data is often not certifiably random. The bonds that had for many decades linked statistical theory closely to probability models began to loosen in favor of broader views. Use of heuristic data-analytic algorithms not supported by theoretical analysis grew rapidly. Our phrase ``the new statistics'' refers to the present, much enlarged, concept of statistical methodology.
The ``core of statistics'' is the subset of statistical activity that is focused inward, on the subject itself, rather than outward, towards the needs of statistics in particular scientific domains. Of necessity, statistical research that seeks to serve the statistical sciences draws on core knowledge for tools as well as for an understanding of the limitations of the tools. Work in the statistical sciences can be a collaboration between a statistician and a scientist in the substantive field. However, with the explosion of data needing attention, such collaborations alone cannot possibly fill the need. The rapidly growing needs of the statistical sciences provide raw material for future core research in statistics and motivates the development of trustworthy, user-friendly statistical methodology. Indeed, statistics fluctuates between import and export mode: importing raw data-analytic ideas inspired by the technology and problems of the moment and exporting refined data-analytic procedures, whose characteristics are understood theoretically and experimentally, to the community of quantitative scientists. The phrase "core of statistics" refers precisely to the intellectual basis for the export mode.
The authors of the 2002/2004 Report on the Future of Statistics mentioned in the Overview, Bruce Lindsay, Jon Kettenring, and David Siegmund, described the current lag in developing core statistical theory to meet the data-analytic needs of the statistical sciences. The previous paragraph draws on ideas in this report. A few examples where statistical theory needs major development include: accepted standards for experimental exploration of new statistical methods; methods for data-mining; automatic methods for discovering new patterns in extremely large data-sets; automatic methods for classifying and recognizing patterns; methods for verifying data integrity; regularization methods for difficult ill-posed problems such as mass tomography based on noisy data.
Our focus in the proposed Banff workshop will be core statistical theory. In the past, we might have called the topic ``mathematical statistics'', were not this name so closely associated with the 1960's vision of statistics before the computer revolution changed our discipline. We put emphasis on ``the new statistics'' to indicate that we not interested in variations on old themes, but rather in relevant theory for the data-analytic circumstances of the present, outlined above. The call for this type of theory - rethinking the old and seeking new approaches - was recently expressed at several places. We point not only to the NSF report already cited but also to a recent article in the IMS Bulletin by Terry Speed, who does research in statistical genomics at Berkeley and Melbourne and is a former president of the Institute of Mathematical Statistics (IMS); let us stress also the synergy with the more specialized ``Statistical Theory and Methods for Complex, High-Dimensional Data'' program at the Isaac Newton Institute for Mathematical Sciences of the University of Cambridge.
The objective of the workshop is to bring together a number of leading-edge researchers in mathematical statistics, together with key people from the statistical sciences and other communities practicing data analysis. Several of these groups have been listed above. The accent will be less on the exhibition of achievements and more on identifying major challenges and promising intellectual approaches for developing the core of statistics. To avoid an amorphous outcome, we would like to concentrate on few topics that we believe are particularly important, as summarized in the Overview of the subject area. Given the breadth of the task, we do not expect to solve it at one workshop; but we see this possibility as an opportunity to make first steps on a likely long way.





