# Information theory and statistics for large alphabets (11w5127)

Arriving in Banff, Alberta Sunday, October 23 and departing Friday October 28, 2011

## Organizers

Mokshay Madiman (Yale University)

Alon Orlitsky (University of California, San Diego)

Narayana Prasad Santhanam (University of Hawaii, Manoa)

Balazs Szegedy (Alfred Renyi Institute)

Krishnamurthy Viswanathan (HP Labs)

Aaron Wagner (Cornell University)

## Objectives

The goal of the proposed workshop is to provide a space for sharing

recent developments from a number of different perspectives, and for

catalyzing novel development in the study of the

following broadly posed question: How does one estimate a probability

distribution $p$ (or some functional of it) over a countable set

(called the ``alphabet'') if we are allowed to sample with

replacement from the distribution very few times?

In such situations, one does not expect to see all outcomes, and maximum

likelihood does not seem like a very good idea. Indeed, in the context of many applications, one

can do much better;

indeed, several different approaches have been developed (sometimes by

different communities of researchers) to answer this

question. Furthermore, the tools that have emerged to study this

question have also independently emerged as central tools in other

areas of mathematics and computer science, although their equivalence

is only now being appreciated and has been barely tapped.

Relevance and importance

There is a vast body of work in statistics aimed at estimating

probability distributions on finite sets, under the assumption that

``large'' quantities of data are available (specifically, large

compared to the size of the alphabet). However, as has been recognized

since early 20th century statisticians looked at the ``sampling of

species'' problem, this assumption fails to be valid in several

important application domains.

One example is speech recognition. Language models for speech

recognition estimate distributions over English words using text

examples much smaller than the vocabulary. This is especially the case

because the most appropriate models are Markov models of some order,

so that the data available for each context that is conditioned on is

rather small compared to the number of words in the language. Similar situations arise

in population biology and modern medicine.

Tweaking old results and using them for

large alphabets typically yields unsatisfactory results, and one is

therefore forced to rework these problems altogether.

Furthermore, some of the approaches that have been considered for large alphabet

probability estimation are connected at a deep level with work done on

exchangeable random partitions, graph theory, information theory and

combinatorics. The workshop will aim to bring together researchers

who have tackled this problem, or related ideas, from a variety of

perspectives.

Timeliness

In August 2009, a successful workshop was held at the American Institute of

Mathematics (AIM) in Palo Alto, California, on the topic of large

alphabet probability estimation, which emphasized its computational

aspects and connections to combinatorics and complexity theory.

While the AIM workshop helped uncover several new and intriguing

connections between this problem and other areas of probability theory,

a second venue is needed to develop those connections and apply them

to open problems. Furthermore, since several teams are actively working

in this area, we anticipate that new developments will occur before the summer

of 2011. These could be disseminated at the BIRS workshop and could

also provide starting points for collaborative problem-solving at the

workshop.

recent developments from a number of different perspectives, and for

catalyzing novel development in the study of the

following broadly posed question: How does one estimate a probability

distribution $p$ (or some functional of it) over a countable set

(called the ``alphabet'') if we are allowed to sample with

replacement from the distribution very few times?

In such situations, one does not expect to see all outcomes, and maximum

likelihood does not seem like a very good idea. Indeed, in the context of many applications, one

can do much better;

indeed, several different approaches have been developed (sometimes by

different communities of researchers) to answer this

question. Furthermore, the tools that have emerged to study this

question have also independently emerged as central tools in other

areas of mathematics and computer science, although their equivalence

is only now being appreciated and has been barely tapped.

Relevance and importance

There is a vast body of work in statistics aimed at estimating

probability distributions on finite sets, under the assumption that

``large'' quantities of data are available (specifically, large

compared to the size of the alphabet). However, as has been recognized

since early 20th century statisticians looked at the ``sampling of

species'' problem, this assumption fails to be valid in several

important application domains.

One example is speech recognition. Language models for speech

recognition estimate distributions over English words using text

examples much smaller than the vocabulary. This is especially the case

because the most appropriate models are Markov models of some order,

so that the data available for each context that is conditioned on is

rather small compared to the number of words in the language. Similar situations arise

in population biology and modern medicine.

Tweaking old results and using them for

large alphabets typically yields unsatisfactory results, and one is

therefore forced to rework these problems altogether.

Furthermore, some of the approaches that have been considered for large alphabet

probability estimation are connected at a deep level with work done on

exchangeable random partitions, graph theory, information theory and

combinatorics. The workshop will aim to bring together researchers

who have tackled this problem, or related ideas, from a variety of

perspectives.

Timeliness

In August 2009, a successful workshop was held at the American Institute of

Mathematics (AIM) in Palo Alto, California, on the topic of large

alphabet probability estimation, which emphasized its computational

aspects and connections to combinatorics and complexity theory.

While the AIM workshop helped uncover several new and intriguing

connections between this problem and other areas of probability theory,

a second venue is needed to develop those connections and apply them

to open problems. Furthermore, since several teams are actively working

in this area, we anticipate that new developments will occur before the summer

of 2011. These could be disseminated at the BIRS workshop and could

also provide starting points for collaborative problem-solving at the

workshop.