Background Material

Topic A1: Upper limits with nuisance parameters

Convenors: Joel Heinrich and David van Dyk

Below we define a problem typical of High Energy physics, whose answer would be stated in the form of an interval in the parameter of interest at some specified confidence or credibility level.

The charge to group A1 is to select or develop one or more methods to solve problems of this type.  Methods should be ranked on criteria to be determined by the group.  The "Commonly proposed methods" and "Typical properties for comparison" listed below are only for illustration; the group must decide for itself what methods and criteria to select.

The problem:

We are performing a counting experiment with the (non-negative integer) number of observed events n being Poisson distributed with mean mu = epsilon * s + b, where s is the parameter of interest for which we wish to set an upper limit (0 <= s < s_u), or a 2-sided interval (s_l < s < s_u).  s (the "cross section") is the parameter of interest, and in principle can have any real value 0 <= s < infinity.

The nuisance parameter epsilon is a factor which converts between n and s in some sense. It must be >= 0, and could be greater than 1. It is either precisely known, or can have an uncertainty (see below).

The nuisance parameter b is the background rate. It must >= 0. It is either precisely known, or can have an uncertainty (see below)

When epsilon and b have uncertainties, we may regard them as having been determined in subsidiary counting experiments. The observed numbers of events in these subsidiary experiments is set to give the required uncertainties on epsilon or on b.  Another variation of the problem is that we just have Bayesian priors for epsilon and b that are derived from a combination of objective information and personal belief.

{It would be helpful if everyone used the above notation}

Typical values: epsilon = 1.0 +- 0.1 b = 3.0 +- 0.3 n having values 0,1,2,...20

Possible extension:

A 2-channel version of the above, with n, epsilon and b (and the errors on epsilon and b) each having two values, one for each channel, while s is common to the 2 channels i.e. n_1 and n_2 are independent Poisson observables, with means epsilon_i * s + b_i
Again it is required to determine an interval for s.

Typical values:  divide equally among the channels, e.g.
 epsilon_1 = 0.5 +- 0.1/sqrt(2)        epsilon_2 = 0.5 +- 0.1/sqrt(2)
 b_1 = 1.5 +- 0.3/sqrt(2)              b_2 = 1.5 +- 0.3/sqrt(2)
where the subsidiary measurements are also divided between the 2 channels.

Commonly proposed methods in High Energy physics:
Bayes: Prior for s = uniform, 1/sqrt(s); 1/s
       Prior for epsilon and for b subsidiary measurements = uniform
Profile likelihood
Modified profile likelihood
Feldman-Cousins, with some fix for nuisance parameters
Fully frequentist, with some ordering rule

Typical properties for comparison:
Coverage versus s at (epsilon,b) = (1,3), (1.1,3) (0.9,3) (1,3.3) (1,2.7)
Bayesian credibility for intervals.
Interval length values for n distribution (Median and quartiles)
Behavior as a function of b, given n = 0 and n=3

Recent papers on this topic.

Prepared for BIRS workshop:

See the e-mail from Gunter Zech (26 April) and a recent note by  Jan Conrad:



Profile:  (Rolke et al)

F-C + Bayes: (Conrad and Tegenfeldt)

Fully Frequentist: (Cranmer)

Bayesian reference analysis  (section 3.2 describes the reference analysis solution to the one-channel case):



Convenors: Luc Demortier and David van Dyk


The aim of this group is to have a general discussion of significance testing, and hopefully to come to a recommendation on how to deal with this subject in practice, especially in the presence of nuisance parameters.  The meetings will start with a talk by Luc Demortier on the issue of dealing with nuisance parameters, and will be followed by open discussion. The meeting will be largely informal, but if you have relevant comments that you would like to talk about, we could consider scheduling some time for you. Please let the co-convenors Luc ( ) and David ( ) know as soon as possible.  Also let us know before the meeting if you have any suggestions concerning the content of this session.

Very often in Particle Physics or Astrophysics, we need to test whether data is consistent with a particular model, which may contain parameters that are completely known, known with some uncertainty, or are free i.e. the null hypothesis H0 can be simple or composite.  The form of the test involves constructing a test statistic which is either designed to be for a specific type of alternative model, or aims to be useful in identifying a wide range of deviations from the null model  (e.g. the test statistic could be the likelihood ratio for H0 and H1 respectively, or it could be chi-squared ). Again  the alternative  hypothesis H1 can be simple or composite. We wish to quote a number quantifying the possible discrepancy between the data and the null hypothesis.

There is a wide variety of situations where we use significance in the hope of discovering interesting effects. We give just two examples:

1) Counting experiment
    We simply count how many observations n are obtained satisfying a set of selection criteria.  These criteria are chosen to have good efficiency for observing some interesting effect, assuming it exists, while reducing the background from standard uninteresting effects.  We expect b counts from these background sources. There may be an error sigma_b on the estimate b (e.g. b may have been estimated in a subsidiary experiment, or from a Monte Carlo simulation, or a combination of both, and may also incorporate a theoretician's best guess about various effects such as the magnitude of higher order corrections to one or more background processes.)  We need to assess whether n is significantly larger than b (+/-sigma_b).  All counts can be assumed to be Poisson distributed.

2) Looking for a peak in a distribution.
    We are interested in testing whether a histogram is consistent with a smooth distribution (H0), or whether it also contains a peak (H1). There are several variants of this:
    a) The possible peak could have unknown position, width, amplitude and/or shape.
    b) The smooth background could be exactly specified, have unknown normalisation, unknown other parameters, or unknown shape.

Topics for discussion
The issues we would like to discuss at Banff include:
1) Are p-values the best way of testing H0?
2) Are there any general rules for choosing an optimal test statistic?
3) What statistic would work for sparse multi-dimensional data ? (e.g. chi-squared needs lots of data per bin, and KS is simple only in 1 dimension, so neither is very appropriate here)

4) We may use the likelihood ratio R for testing for the presence of a peak of unknown position and/or width (i.e. H0 = smooth background; H1 = smooth background with superimposed peak).  This unfortunately does not possess the property that -2ln(R) has a chisquared distribution.  Can circumstances be specified when it approximately does, or when it follows some other known distribution?
5) There are several methods of incorporating nuisance parameters in p-value calculations (see Demortier in references).  What criteria should a good method satisfy?  Can we recommend a method as being generally useful?
6) Is the standard "5 sigma" requirement in Particle Physics for claiming the discovery of a new effect reasonable?
7) Data in Particle Physics or in Astrophysics often consists of many multivariate observations.  It is then possible to construct an almost infinite number of data distributions which are produced by imposing a series of (semi-) arbitrary selections to the full data set.  Sometimes we find a significant-looking discrepancy in one of these distributions with respect to expectations.  Is it possible to assess the significance of such an observation, given that fluctuations could occur in any of the infinite (albeit correlated) number of locations?  Are there useful data-mining techniques that we could borrow from other fields?
8) What can we learn from anomaly detection in other fields?

Reading Material
1) Steffen Lauritzen, "Goodness of Fit" PHYSTAT05 Conference
    Deals with issues related to chi-squared, etc.
2) Luc Demortier, "P values: what they are and how to use them,"
    Discusses the interpretation of p values, some alternatives to p values, methods for eliminating nuisance parameters from p-values, and an     application to a peak search. 3) B Aslan and G Zech "Comparison of Different Goodness-of-Fit Tests", Durham 2002,;
and G. Zech, "A Multivariate Two-Sample Test Based on the Concept of Minimum Energy", PHYSTAT2003,
Describe a test for goodness of fit with sparse multi-dimensional data.

See also the e-mails circulated by Gunter Zech (18 May)  and Wolfgang Rolke (2 June)


Topic C:  Multivariate problems

Convenors: Byron Roe and Nancy Reid

This is intended as a (very) brief introduction, especially for statisticians, into some of the multivariate problems in which physicists are interested. The problems are, broadly speaking, classification (1.), variable selection(2.), and goodness of fit(3.).

1. In many experiments, physicists are interested in separating candidate signal events from background. To do this, they have a number of "particle identification (PID)" variables which have different distributions for signal and various classes of background.

Some problems have a large number of PID variables, up to several hundred, others only a few.

One data set already used for the comparisons sited below is available at the end of Byron Roe's homepage:

See also Ilya Narsky's email of May 28 and the data at,
as well as a write-up at Fig. 2).

A number of statistical classification methods have been employed and the use and comparisons of some of them have been described at previous meetings:

(Several talks [especially those by Prosper, Bock and Vaiciulis] in the 'Multivariate Methods' section of the Durham IPPP meeting in 2002)

(Talk in Session 10 of PHYSTAT2003 by Jerry Friedman)

(The invited talk by Jerry Friedman in PHYSTAT05and the response by Harrison Prosper; also the contributed talks by Bhat, Narsky, and Roe)

Comparisons have been made between the use of various types of boosted decision trees, random forests, bagging, and neural nets among others.

Although each problem is different, it would be useful to extend the comparison to techniques such as SVM and Bayesian neural nets.

Physicists have found that the use of some of these techniques for their problems requires different choices of parameters than the ones commonly used in statistics publications. For example, for the use of boosted decision trees in the miniBooNE experiment, we have found the use of a large number of leaves (around 50) to be more efficient than the small number quoted in most statistical articles.

Although the comparisons are interesting, perhaps even more useful would be contrasting methods statisticians use in approaching data, in the use of graphical methods, and the use of R. Physicists have tended to use C++, C, or even FORTRAN because of unfamiliarity with R and, for some groups, worries about its speed.

2. It is also of considerable interest to examine ways of selecting the best PID (classification) variables to use. In the MiniBooNE experiment, we have over 300 potential PID variables and have worked out various ad hoc methods of winnowing these down to of the order of 100.

3. Methods of dealing with systematics for chi- square and for log likelihood goodness of fit statistics, in several dimensions, is another current problem. In many physics measurements there are some nuisance parameters which are not well estimated and whose uncertainty contributes significantly to the errors in the results for parameters of interest. One method for chi-squared fits is described in:

D. Stump et al.,   Phys. Rev. D65, 014012. This method has several good features, but also some significant problems in practice. Systematics also arises in multivariate event classification; systematic errors in the training sample (model or scale) errors are now being examined.

Graphical methods to examine these systematics, such as 1-D projections on random axes are being pursued. The idea is that if the model fits badly, it may be because it needs more (nuisance) parameters to adequately describe the physical phenomenon.

As a general comment, applying to all three topics, there is some reluctance among a number of physicists to use modern classification techniques as they seem "non-intuitive", and because the physicists believe it to be very difficult to accurately model data in many dimensions. Suggestions from statisticians concerning these issues would be welcome.

Radford Neal has written up some thoughts of his on the BIRS problem, along with some experimental results. The writeup, along with the division of Byron Roe's data into training and test sets that he used for the experiments, can be found at the following URL:

See also the e-mails circulated by Radford Neal (19 May) and Byron Roe's response (19 May), by Ilya Narsky (21 May), and the one below from Jan Conrad.