Mathematical and computational approaches to linguistic phylogeny (06frg044)

Arriving in Banff, Alberta Saturday, May 27 and departing Saturday June 3, 2006


(University of California, Berkeley)


Quantitative approaches to linguistic phylogeny have been the subject of considerable recent activity. For example, all of the participants were invitees to a meeting on "Phylogenetic Methods and the Prehistory of Languages'' that was held in July 2004 at the McDonald Institute for Archaeological Research, Cambridge, U.K. Ringe, Warnow, Evans, Nichols and Nicholls are invitees to a meeting on a similar theme in March 2005 at the Program for Evolutionary Dynamics at Harvard.

Mathematically, there has been an absence of sensible stochastic models for the evolution of lexical, phonological and morphological linguistic characters. "Off -the-shelf'' models from biological sequence evolution are clearly not appropriate, and new models are required that address issues of effectively infinite state spaces, lack of reversibility, and differing degrees of homoplasy (that is, back-mutation or parallel evolution). Recent work by Warnow, Evans and Ringe has begun to address this issue. Links to work by the group of Evans, Nakhleh, Ringe, Warnow and their collaborators can be found at their Computational Phylogenetics in Historical Linguistics web-site:

Statistically, linguistic phylogeny raises problems with the quality of data (the data has gone through considerable human pre-processing, so there are important sampling questions to be addressed) and its heterogeneity (the evolutionary processes for different linguistic characters are probably quite different and so there are difficult issues to resolve around how one can accommodate such variation without introducing models that are too parameter-rich for adequate inference). The Evans, Nakhleh, Ringe and Warnow group has also made some progress in this direction.

Computationally, the number of taxa (that is, languages) involved in most data sets of interest is sufficiently great that naive approaches to model fitting and inference by exact maximum likelihood or Bayesian methods is infeasible. This is even the case for non-statistical reconstruction procedures such as maximum parsimony and maximum compatibility. There is thus a need for clever heuristic divide-and-conquer strategies for the optimizations inherent in maximum likelihood, maximum compatibility and maximum likelihood, and for appropriate Markov chain Monte Carlo (MCMC) techniques in Bayesian analysis. Warnow has been the main developer of the family of disk covering methods (DCMs), the most competitive divide-and-conquer algorithms. Nicholls is a major figure in the field of MCMC applied to Bayesian inference, particularly with respect to dating problems.

All of the above quantitative work needs to be performed in close collaboration with linguists who are not only familiar with the primary data but are also sufficiently mathematically literate that they can participate in the development of models and inferential strategies. Moreover, there need to be several such linguists with different perspectives -- be they on different language families (for example, Ringe works on Indo-European languages whereas Poser mainly studies North American languages) or on ``deep time'' relationships between different language families (an interest of Nichols). Poser, Ringe, Embleton and Nichols have all done major work on the applications of statistical methodology to linguistic questions and are extremely well-placed to play such a role. Having a group balanced between four mathematicians/statisticians/computer scientists and four quantitively inclined linguists is the right mix to make serious inroads into the large number of difficult outstanding problems in this field.