# Statistical issues relevant to significance of discovery claims (10w5068)

Arriving in Banff, Alberta Sunday, July 11 and departing Friday July 16, 2010

## Organizers

James Linnemann (Michigan State University)

Richard Lockhart (Simon Fraser University)

Louis Lyons (University of Oxford)

## Objectives

The outcome of the 2006 Workshop was so encouraging that we have proposed another BIRS workshop for 2010. By that time new

facilities in Particle Physics and Astrophysics (e.g. the Large Hadron

Collider and the GLAST telescope for gamma rays) should be

producing a large amount of data. There is a strong hope that these will

result in exciting new discoveries. There are interesting statistical issues relating to discovery claims, and it is important to be able to give reliable, widely accepted statistical assessments of the evidence that the result is not due just to a statistical fluctuation.

A potentially disturbing example arises from an experimental Particle Physics collaboration who analyzed their data in 2003, and found that, at greater than a 5 sigma level, their data were inconsistent with the null hypothesis, and instead gave evidence for a new type of particle, the penta-quark. However a subsequent calculation of the Bayes factor comparing the null hypothesis with the alternative of a new particle was said to favor mildly the null hypothesis. This apparent sensitivity of an important conclusion to the statistical technique employed is worrying, and needs to be understood. The conflicting papers from the same authors analyzing the same data can be see at:

http://arxiv.org/abs/hep-ex/0307018 and http://arxiv.org/abs/0709.3154

It would be extremely valuable to have in depth discussions between

scientists and statisticians concerning the issues involved. Some of these

are:

1) Why Particle Physicists like 5 sigma as a discovery criterion; for

Statisticians, requiring a 5 standard error deviation from the null, which

corresponds to a significance level on the order of 1 in a million,

is extraordinarily stringent.

2) Allowing for multiple tests; research groups carry out many tests

on the same data.

3) Blind analysis techniques; classical frequency theory analysis

relies on probabilities computed before the data are collected or

analyzed. These probabilities are not relevant if the frequency theory technique is adjusted after seeing the data. In clinical trials this is traditionally achieved by pre-specifying a protocol for data analysis but the proposal here -- rare in statisticians' experience -- is to build randomness into the fitting software which hides the fitted values of parameters from experimenters as models are tuned.

4) Goodness of fit tests for comparing sparse multi-dimensional data with theory.

5) Comparison of different techniques for comparing 2 hypotheses, for example:

i) p-values (including methods for combining p-values for different tests);

ii) The so-called $"CL_s"$ (ratio of p-values for null hypothesis and alternative), an approach to setting upper confidence limits which is little known in the statistical community;

iii) Likelihood ratio tests, even when null and alternative hypotheses are composite;

iv) Difference in chi-squared of 2 separate fits to the same data;

v) Model selection techniques such as AIC or BIC;

vi) Bayesian techniques such as posterior odds or Bayes factors (including the issue of choice of prior).

6) Adjusting for nuisance parameters in p-value and likelihood calculations.

7) Definitions of sensitivity of searches for new phenomena.

The goal is to invite about 25 physicists and astronomers, and 15

statisticians with expertise in theoretical statistics, to develop solutions for these problems. While the specific problems have developed

from particular types of experiments or data collection efforts, they have

common features that are amenable to statistical analyses; the solutions

developed will thus be more broadly useful in physics and astronomy.

We want to bring the latest methods to the attention of the scientific

community, and to develop statistical theory further by considering

special aspects that arise in these scientific contexts.

The topics considered will also fertilize the statistical community by

providing other scientific contexts in which to evaluate the statistical

ideas arising, for instance, from bioinformatics, from remote sensing, and

from climate modeling. In these areas and others the issues of multiple

testing, model assessment and validation (including goodness-of-fit with

very sparse data), appropriate use of Bayes factors, choice of prior

(including sensitivity analysis for this choice), and appropriate elimination

of nuisance parameters have stimulated a great deal of statistical

research. Evaluation of these ideas in the particle and astrophysics

contexts should have multi-way benefits: better analysis of physics

and astronomy data; better understanding by statisticians of their data

analysis suggestions in practical contexts; and perhaps new data

analysis ideas for application back to bioinformatics, climate research and

so on. At the same time the issues surrounding comparison of hypothesis

testing techniques will cause statisticians to reflect on the classical

controversies that are at the foundations of their discipline, but this time they will be informed by experiments with solid data.

We will also keep open the possibility of devoting some of the Workshop

to any burning issues that may arise from measurements obtained with

the new detectors.

By assembling Physicists and Statisticians with direct interests in

discovery questions, we consider that this Workshop would be extremely

useful in clarifying the statistical issues involved. With high-profile results becoming available, the Workshop will be both timely and important.

facilities in Particle Physics and Astrophysics (e.g. the Large Hadron

Collider and the GLAST telescope for gamma rays) should be

producing a large amount of data. There is a strong hope that these will

result in exciting new discoveries. There are interesting statistical issues relating to discovery claims, and it is important to be able to give reliable, widely accepted statistical assessments of the evidence that the result is not due just to a statistical fluctuation.

A potentially disturbing example arises from an experimental Particle Physics collaboration who analyzed their data in 2003, and found that, at greater than a 5 sigma level, their data were inconsistent with the null hypothesis, and instead gave evidence for a new type of particle, the penta-quark. However a subsequent calculation of the Bayes factor comparing the null hypothesis with the alternative of a new particle was said to favor mildly the null hypothesis. This apparent sensitivity of an important conclusion to the statistical technique employed is worrying, and needs to be understood. The conflicting papers from the same authors analyzing the same data can be see at:

http://arxiv.org/abs/hep-ex/0307018 and http://arxiv.org/abs/0709.3154

It would be extremely valuable to have in depth discussions between

scientists and statisticians concerning the issues involved. Some of these

are:

1) Why Particle Physicists like 5 sigma as a discovery criterion; for

Statisticians, requiring a 5 standard error deviation from the null, which

corresponds to a significance level on the order of 1 in a million,

is extraordinarily stringent.

2) Allowing for multiple tests; research groups carry out many tests

on the same data.

3) Blind analysis techniques; classical frequency theory analysis

relies on probabilities computed before the data are collected or

analyzed. These probabilities are not relevant if the frequency theory technique is adjusted after seeing the data. In clinical trials this is traditionally achieved by pre-specifying a protocol for data analysis but the proposal here -- rare in statisticians' experience -- is to build randomness into the fitting software which hides the fitted values of parameters from experimenters as models are tuned.

4) Goodness of fit tests for comparing sparse multi-dimensional data with theory.

5) Comparison of different techniques for comparing 2 hypotheses, for example:

i) p-values (including methods for combining p-values for different tests);

ii) The so-called $"CL_s"$ (ratio of p-values for null hypothesis and alternative), an approach to setting upper confidence limits which is little known in the statistical community;

iii) Likelihood ratio tests, even when null and alternative hypotheses are composite;

iv) Difference in chi-squared of 2 separate fits to the same data;

v) Model selection techniques such as AIC or BIC;

vi) Bayesian techniques such as posterior odds or Bayes factors (including the issue of choice of prior).

6) Adjusting for nuisance parameters in p-value and likelihood calculations.

7) Definitions of sensitivity of searches for new phenomena.

The goal is to invite about 25 physicists and astronomers, and 15

statisticians with expertise in theoretical statistics, to develop solutions for these problems. While the specific problems have developed

from particular types of experiments or data collection efforts, they have

common features that are amenable to statistical analyses; the solutions

developed will thus be more broadly useful in physics and astronomy.

We want to bring the latest methods to the attention of the scientific

community, and to develop statistical theory further by considering

special aspects that arise in these scientific contexts.

The topics considered will also fertilize the statistical community by

providing other scientific contexts in which to evaluate the statistical

ideas arising, for instance, from bioinformatics, from remote sensing, and

from climate modeling. In these areas and others the issues of multiple

testing, model assessment and validation (including goodness-of-fit with

very sparse data), appropriate use of Bayes factors, choice of prior

(including sensitivity analysis for this choice), and appropriate elimination

of nuisance parameters have stimulated a great deal of statistical

research. Evaluation of these ideas in the particle and astrophysics

contexts should have multi-way benefits: better analysis of physics

and astronomy data; better understanding by statisticians of their data

analysis suggestions in practical contexts; and perhaps new data

analysis ideas for application back to bioinformatics, climate research and

so on. At the same time the issues surrounding comparison of hypothesis

testing techniques will cause statisticians to reflect on the classical

controversies that are at the foundations of their discipline, but this time they will be informed by experiments with solid data.

We will also keep open the possibility of devoting some of the Workshop

to any burning issues that may arise from measurements obtained with

the new detectors.

By assembling Physicists and Statisticians with direct interests in

discovery questions, we consider that this Workshop would be extremely

useful in clarifying the statistical issues involved. With high-profile results becoming available, the Workshop will be both timely and important.