# Mathematical Approaches to Evolutionary Trees and Networks (17w5104)

Arriving in Banff, Alberta Sunday, February 12 and departing Friday February 17, 2017

## Organizers

Vincent Moulton (University of East Anglia)

Caroline Colijn (Imperial College London)

Leonid Chindelevitch (Simon Fraser University)

Tandy Warnow (University of Illinois)

Amaury Lambert (UPMC Univ Paris 06)

Marta Luksza (Institute of Advanced Study, Princeton University)

## Objectives

The objectives of the workshop are to bring mathematicians working in three key areas together to make progress in these problems. We will also invite several biologists who are keen to engage with mathematicians on the challenges posed by new data on evolutionary processes. Key challenges in the field at the moment are focused around the following emerging inter-related areas, each of which is raising mathematically interesting problems:

1. Inference with evolutionary trees and networks: Ultimately it is necessary not just to obtain evolutionary trees from data using standard methods, but to infer aspects of an underlying biological process. This requires understanding the likelihood of an evolutionary tree or network, or at least some of its informative features, using some stochastic process as the underlying ecological model. In principle, this approach allows simultaneous inference of both evolutionary trees and parameters of the ecological model. Coalescent theory has made considerable progress, for example, in obtaining tree likelihoods for sparsely sampled populations with geographical structure or with known past demographics (see for just one example [5]). In some simplified cases, epidemiological inference methods can estimate transmission trees [2], branching rates through time [5] and other aspects of epidemic spread [7]. However, none of these approaches is currently applicable if there is non-tree-like evolution, or where datasets are large. Furthermore, the range of models for which we can write down a tree likelihood is very limited. This raising nice new problems in probability, statistical inference and ecological modelling. Recently, more general processes (e.g. Lambda-coalescents, which allow multiple rather than strictly pairwise coalescent events) are beginning to be used to model populations with large offspring variance, or even to model selection in a non-parametric fashion [3]. This is potentially a powerful tool particularly for bacteria, which may acquire resistance to antibiotics and spread rapidly as a consequence, yielding both highly variable effective offspring numbers and a need to model selection carefully.

2. Understanding spaces of evolutionary trees: There are a large number of possible labelled, rooted binary trees for a given set of $n$ tips (ie for a given set of sequence data): $(2n-3)!! = (2n-3)(2n-5) ... (3)(1) $. This works out to $10^{184}$ trees on 100 tips; in contrast, current datasets for evolving bacteria contain thousands of tips. Not even the tools of Bayesian inference, the natural approach in such situations, can systematically explore spaces this big. This motivates the development of mathematical approaches for the exploration of tree space. These include new approaches to continuous tree spaces, including those from tropical geometry [8], and the use of tree metrics [1]. These in turn can lead to tools for averaging trees , and for navigating tree space in efficient ways [6] -- with profound applications in statistical inference from sequence data. Generalizing metrics to the case of evolutionary networks (for example tree-based networks) is another natural and important question.

3. Summarising trees and networks using combinatorial tools: Uncovering shape features, spectral features and other ways to describe trees using quantities that are mathematically tractable will be of considerable interest [4]. As one example, where likelihoods are truly intractable, rapid tools for likelihood-free inference can be used to infer evolutionary processes from sequence data, but only where there are informative ways to summarize key features of the data. Trees are natural combinatorial structures with connections to data; for example, a binary tree is a sequence of partitions of the set of tips (sequences in a dataset), where each partition is one block smaller than the previous one, moving back through time from the partition with each tip on its own to the partition with all tips in one block as we move from the tips of the tree to the root. If the tree is not binary (ie it allows multifurcations), more than two blocks can combine at a branching event. Because of the natural link to partitions, the study of tree shapes links to the enumeration of partitions and to lattice path combinatorics. These in turn allow the characterization and enumeration of possible tree shapes. Meanwhile the study of motifs in other biological networks has been fruitful, and could be extended to tree and evolutionary network shapes. Trees and evolutionary networks are of course also graphs (with an added time dimension); the tools of algebraic graph theory are now finding application in this area of mathematical biology.

The community's response to the idea for this workshop has been very positive. A * beside a participant's name indicates that they have expressed enthusiasm for the workshop, and plan to attend.

[1] Louis J Billera, Susan P Holmes, and Karen Vogtmann. Geometry of the space of phylogenetic trees. Adv. Appl. Math., 27(4):733–767, November 2001.

[2] Xavier Didelot, Jennifer Gardy, and Caroline Colijn. Bayesian inference of infectious disease transmission from whole-genome sequence data. Mol. Biol. Evol., 31(7):1869–1879, July 2014.

[3] Alison M Etheridge, Robert C Griffiths, and Jesse E Taylor. A coalescent dual process in a moran model with genic selection, and the lambda coalescent limit. Theor. Popul. Biol., 78(2):77–92, September 2010.

[4] Fanny Gascuel, Regis Ferriere, Robin Aguilee, and Amaury Lambert. How ecology and landscape dynamics shape phylogenetic trees. Syst. Biol., 64(4):590–607, July 2015.

[5] Amaury Lambert and Tanja Stadler. Birth–death models and coalescent point processes: The shape and probability of reconstructed phylogenies. Theor. Popul. Biol., 90(0):113–128, December 2013.

[6] Tom M W Nye. An algorithm for constructing principal geodesics in phylogenetic treespace. IEEE/ACM Trans. Comput. Biol. Bioinform., 11(2):304–315, March 2014.

[7] David A Rasmussen, Erik M Volz, and Katia Koelle. Phylodynamic inference for structured epidemiological models. PLoS Comput. Biol., 10(4):e1003570, April 2014. 2

[8] David Speyer and Bernd Sturmfels. The tropical grassmannian. Adv. Geom., 4(3):389–411, 2004.

1. Inference with evolutionary trees and networks: Ultimately it is necessary not just to obtain evolutionary trees from data using standard methods, but to infer aspects of an underlying biological process. This requires understanding the likelihood of an evolutionary tree or network, or at least some of its informative features, using some stochastic process as the underlying ecological model. In principle, this approach allows simultaneous inference of both evolutionary trees and parameters of the ecological model. Coalescent theory has made considerable progress, for example, in obtaining tree likelihoods for sparsely sampled populations with geographical structure or with known past demographics (see for just one example [5]). In some simplified cases, epidemiological inference methods can estimate transmission trees [2], branching rates through time [5] and other aspects of epidemic spread [7]. However, none of these approaches is currently applicable if there is non-tree-like evolution, or where datasets are large. Furthermore, the range of models for which we can write down a tree likelihood is very limited. This raising nice new problems in probability, statistical inference and ecological modelling. Recently, more general processes (e.g. Lambda-coalescents, which allow multiple rather than strictly pairwise coalescent events) are beginning to be used to model populations with large offspring variance, or even to model selection in a non-parametric fashion [3]. This is potentially a powerful tool particularly for bacteria, which may acquire resistance to antibiotics and spread rapidly as a consequence, yielding both highly variable effective offspring numbers and a need to model selection carefully.

2. Understanding spaces of evolutionary trees: There are a large number of possible labelled, rooted binary trees for a given set of $n$ tips (ie for a given set of sequence data): $(2n-3)!! = (2n-3)(2n-5) ... (3)(1) $. This works out to $10^{184}$ trees on 100 tips; in contrast, current datasets for evolving bacteria contain thousands of tips. Not even the tools of Bayesian inference, the natural approach in such situations, can systematically explore spaces this big. This motivates the development of mathematical approaches for the exploration of tree space. These include new approaches to continuous tree spaces, including those from tropical geometry [8], and the use of tree metrics [1]. These in turn can lead to tools for averaging trees , and for navigating tree space in efficient ways [6] -- with profound applications in statistical inference from sequence data. Generalizing metrics to the case of evolutionary networks (for example tree-based networks) is another natural and important question.

3. Summarising trees and networks using combinatorial tools: Uncovering shape features, spectral features and other ways to describe trees using quantities that are mathematically tractable will be of considerable interest [4]. As one example, where likelihoods are truly intractable, rapid tools for likelihood-free inference can be used to infer evolutionary processes from sequence data, but only where there are informative ways to summarize key features of the data. Trees are natural combinatorial structures with connections to data; for example, a binary tree is a sequence of partitions of the set of tips (sequences in a dataset), where each partition is one block smaller than the previous one, moving back through time from the partition with each tip on its own to the partition with all tips in one block as we move from the tips of the tree to the root. If the tree is not binary (ie it allows multifurcations), more than two blocks can combine at a branching event. Because of the natural link to partitions, the study of tree shapes links to the enumeration of partitions and to lattice path combinatorics. These in turn allow the characterization and enumeration of possible tree shapes. Meanwhile the study of motifs in other biological networks has been fruitful, and could be extended to tree and evolutionary network shapes. Trees and evolutionary networks are of course also graphs (with an added time dimension); the tools of algebraic graph theory are now finding application in this area of mathematical biology.

The community's response to the idea for this workshop has been very positive. A * beside a participant's name indicates that they have expressed enthusiasm for the workshop, and plan to attend.

### References

[1] Louis J Billera, Susan P Holmes, and Karen Vogtmann. Geometry of the space of phylogenetic trees. Adv. Appl. Math., 27(4):733–767, November 2001.

[2] Xavier Didelot, Jennifer Gardy, and Caroline Colijn. Bayesian inference of infectious disease transmission from whole-genome sequence data. Mol. Biol. Evol., 31(7):1869–1879, July 2014.

[3] Alison M Etheridge, Robert C Griffiths, and Jesse E Taylor. A coalescent dual process in a moran model with genic selection, and the lambda coalescent limit. Theor. Popul. Biol., 78(2):77–92, September 2010.

[4] Fanny Gascuel, Regis Ferriere, Robin Aguilee, and Amaury Lambert. How ecology and landscape dynamics shape phylogenetic trees. Syst. Biol., 64(4):590–607, July 2015.

[5] Amaury Lambert and Tanja Stadler. Birth–death models and coalescent point processes: The shape and probability of reconstructed phylogenies. Theor. Popul. Biol., 90(0):113–128, December 2013.

[6] Tom M W Nye. An algorithm for constructing principal geodesics in phylogenetic treespace. IEEE/ACM Trans. Comput. Biol. Bioinform., 11(2):304–315, March 2014.

[7] David A Rasmussen, Erik M Volz, and Katia Koelle. Phylodynamic inference for structured epidemiological models. PLoS Comput. Biol., 10(4):e1003570, April 2014. 2

[8] David Speyer and Bernd Sturmfels. The tropical grassmannian. Adv. Geom., 4(3):389–411, 2004.