# Information theory and statistics for large alphabets (11w5127)

Arriving in Banff, Alberta Sunday, October 23 and departing Friday October 28, 2011

## Organizers

(Yale University)

(University of California, San Diego)

(University of Hawaii, Manoa)

Balazs Szegedy (Alfred Renyi Institute)

Krishnamurthy Viswanathan (HP Labs)

Aaron Wagner (Cornell University)

## Objectives

The goal of the proposed workshop is to provide a space for sharing
recent developments from a number of different perspectives, and for
catalyzing novel development in the study of the
following broadly posed question: How does one estimate a probability
distribution $p$ (or some functional of it) over a countable set
(called the alphabet'') if we are allowed to sample with
replacement from the distribution very few times?

In such situations, one does not expect to see all outcomes, and maximum
likelihood does not seem like a very good idea. Indeed, in the context of many applications, one
can do much better;
indeed, several different approaches have been developed (sometimes by
different communities of researchers) to answer this
question. Furthermore, the tools that have emerged to study this
question have also independently emerged as central tools in other
areas of mathematics and computer science, although their equivalence
is only now being appreciated and has been barely tapped.

Relevance and importance

There is a vast body of work in statistics aimed at estimating
probability distributions on finite sets, under the assumption that
large'' quantities of data are available (specifically, large
compared to the size of the alphabet). However, as has been recognized
since early 20th century statisticians looked at the sampling of
species'' problem, this assumption fails to be valid in several
important application domains.

One example is speech recognition. Language models for speech
recognition estimate distributions over English words using text
examples much smaller than the vocabulary. This is especially the case
because the most appropriate models are Markov models of some order,
so that the data available for each context that is conditioned on is
rather small compared to the number of words in the language. Similar situations arise
in population biology and modern medicine.
Tweaking old results and using them for
large alphabets typically yields unsatisfactory results, and one is
therefore forced to rework these problems altogether.

Furthermore, some of the approaches that have been considered for large alphabet
probability estimation are connected at a deep level with work done on
exchangeable random partitions, graph theory, information theory and
combinatorics. The workshop will aim to bring together researchers
who have tackled this problem, or related ideas, from a variety of
perspectives.

Timeliness

In August 2009, a successful workshop was held at the American Institute of
Mathematics (AIM) in Palo Alto, California, on the topic of large
alphabet probability estimation, which emphasized its computational
aspects and connections to combinatorics and complexity theory.
While the AIM workshop helped uncover several new and intriguing
connections between this problem and other areas of probability theory,
a second venue is needed to develop those connections and apply them
to open problems. Furthermore, since several teams are actively working
in this area, we anticipate that new developments will occur before the summer
of 2011. These could be disseminated at the BIRS workshop and could
also provide starting points for collaborative problem-solving at the
workshop.