Personally, based on my recent fling with coronavirus genomes, I would never think of inferring a ML tree as estimate of virus phylogeny (see also this related thread on the RAxML Google group, regarding another possible coronavirus matrix – SARS virus genomes have ~30 kpb). ML trees, like any other tree, are one-dimensional, dichotomous graphs. Their internodes represent splits of the taxon set into two parts; hence, a tree gives us a summary of taxon bipartitions.
|A 4-taxon tree has two nodes but one internode representing the inferred taxon bipartition: A,B | C,D. The split is nearly unambiguous: bootstrap (BS) support = 99*. All other splits are trivial (hence, cannot be tested)|
* The bootstrap (BS) values count how often a taxon bipartition (split/ internode) occurred in the BS pseudoreplicate tree sample, thus, are branch supports not "nodal" or "node" supports as one can still read a lot in systematic biological literature.
Inferred unrooted, we typically root them (mostly using a hand-picked outgroup) to be able to interpret the tree as a phylogenetic tree in which each ancestor vanishes and is replaced by exactly two offspring.
Hence, they must be pretty useless for inferring virus phylogenies. A single virus strain with a characteristic genome can have a huge number of offspring.
Furthermore, since virus genomes are naked RNA, they love to get cosy when meeting in the same host; and recombine. Recombination is possibly the worst of all reticulate evolutionary processes when we then use the data and infer a dichotomous tree: bits of the genome may have completely different evolutionary histories.
|A real world example of recombination from the SARS group. Members of 2a and 3b, respectively, form well-supported clades in a ML tree but can differ up to 100% from each other in high-divergent regions excluded from the analysis (full story: Trees and viruses: the SARS group).|
The problemTree inferences cannot handle well ancestor-descendant situations in the data, because, like any other tree-inference programme, RAxML considers each sequence to represent a tip in the tree and not a internal node.
Let's say, we have the five virus lineages A, B, C, D, and E plus their common ancestor, the ancestral strain X, in our data. Any tree-inference will have to optimise X as sort-of-sister to all five offspring lineages A to E although it should be placed on their junction point. We also generate a flat tree space with many alternative trees – any dichotomous permutation of X, A, B, C, D, E – with essentially the same likelihood. If bootstrapping (BS) gets it right, it will need to produce split BS support of 20 for (X,A)|rest vs (X,B)|rest vs (X,C)|rest vs (X,D)|rest vs (X,E)|rest. For this simple reason, and unlike stated in probably hundreds of papers, using a ML (or any other) tree inference on a 12000 (or less) virus genome dataset cannot provide you with an actual phylogenetic tree, it provides you with a tree that reflects genome similarity.
Nextstrain (homepage/code on GitHub) tries to ameliorate these shortcomings of tree inferences by trying to find evolutionary trajectories that consider time and provenance of the samples (so, near-identical sequence can end up in different clades). But it still gives a tree to depict the virus phylogeny, and what one would need for depicting evolution of viruses, is, usually, a network. But that would take much more time than inferring a tree, plus the tools we have for identifying recombination – which makes inferring a single tree impossible – are pretty crude.
On the Genealogical World of Phylogenetic Networks, we recently had some posts dealing with coronavirus phylogenies and their signal issues.
- Problems with the phylogeny of coronaviruses, posted 16/3/2020
- Trees and viruses: the SARS group, posted 30/3/2020
- Using Median networks to study SARS-CoV-2, posted 20/4/2020
- Finding the CoV-2 root, posted 4/5/2020
- A new SARS-CoV-2 variant?, posted 11/5/2020
Related data can be found @ figshare
Grimm GW, Morrison D. 2020. Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare. Dataset. https://doi.org/10.6084/m9.figshare.12046581
How to infer a tree even though virus phylogenies are not tree-likeBig Data has made it impossible to focus on getting good virus phylogenies (see eg. the Nextstrain-generated CoV-2 "phylogeny"). Fast implementations of ML tree inference – fastest probably being the new RAxML-NG (GitHub; make sure to cite the original paper) – at least allows to analyse a data set with 12000 (or more) "taxa". Better an imperfect tree than having nothing (also, no reviewer seems to complain about using dichotomous trees as estimate for reticulate evolution: D. Morrison, 2020, A new SARS-CoV-2 variant?).
Being a probabilistic method, ML needs a certain amount of signal to make a decision. Being a (dichotomous) tree inference, any direct ancestor-descendant relationship will make it
- hard to make a call;
- increase the computation time;
- may lead to several equally optimal topological alternatives (including split BS support).
- Step: Remove duplicates — duplicates increase the computational load while providing no additional information for the tree optimisation: The flatter the likelihood surface of the tree space, the longer takes the bootstrapping and the final optimisation.
- Step: Remove satellite genotypes — same reason. Network, for instance, includes the pre-analysis 'star contraction' (Forster et al. 2001, Mol. Biol. Evol. 18: 1864–1881) which collapses satellite (haplo-)types into their ancestral type ('median'), SplitsTree includes a function to group haplotypes as well. One may even think about removing intermediate types, too, by just filtering all tips that differ less than a fixed threshold of mutations.
- Step: Test for and remove rogues — try RogueNaRok (GitHub, blackbox).
- Step: Infer a backbone tree for the pruned-down set — as framework for next steps.
- Step: Explore topological ambiguity — in case branches have BS < 100, check the BS consensus networks: are there alternatives with similar support (split BS support) or no discriminate signal (all alternatives with very low BS support). BS consensus networks can be generated and viewed with SplitsTree by reading in RAxML's BS tree sample file; or the Phangorn library for R (Schliep et al. 2017, Meth. Ecol. Evol. 8: 1212–1220), which includes further functions to transfer information between trees and networks and uses the same Splits-NEXUS format to export networks (SplitsTree's graphic machine is superior, Phangorn provides more functionality to transfer information between trees and networks, or extract it, when you can 'R').
- Step: Optimise position of filtered rogues using evolutionary placement algorithm — Recombinants may have more than one optimal position (can be very far apart), ancestors will have more than one (but remain in the same subtree, phylogenetic neighbourhood).
- Step: Hack-and-slash — Optimise poorly resolved subtrees using population genetic methods such as minimum-spanning, median-joining, and statistical parsimony. Check the subalignments visually.
By the way, this is also how the largest real Christmas tree, Skeppsbrons Julstämning, in the world is put together each year – it's a super-tree comprising a lot of normal-size trees.
|Apologise the lack of focus, but it was cold and it's really large. Famous Skeppsbronn tree in Stockholm, December 2011.|