Inferring a ML tree with 12000 (or more) virus genomes

Somebody on the RAxML group posted this as a bold question, asking for tips how to speed up the analysis. Since I recently looked at virus genomes (infamous SARS group) myself, I have some ideas for this.

Personally, based on my recent fling with coronavirus genomes, I would never think of inferring a ML tree as estimate of virus phylogeny (see also this related thread on the RAxML Google group, regarding another possible coronavirus matrix – SARS virus genomes have ~30 kpb). ML trees, like any other tree, are one-dimensional, dichotomous graphs. Their internodes represent splits of the taxon set into two parts; hence, a tree gives us a summary of taxon bipartitions.

A 4-taxon tree has two nodes but one internode representing the inferred taxon bipartition: A,B | C,D. The split is nearly unambiguous: bootstrap (BS) support = 99*. All other splits are trivial (hence, cannot be tested)

* The bootstrap (BS) values count how often a taxon bipartition (split/ internode) occurred in the BS pseudoreplicate tree sample, thus, are branch supports not "nodal" or "node" supports as one can still read a lot in systematic biological literature.

Inferred unrooted, we typically root them (mostly using a hand-picked outgroup) to be able to interpret the tree as a phylogenetic tree in which each ancestor vanishes and is replaced by exactly two offspring.

A rooted interpretation of the above tree; rooted assuming that D represents the outgroup (sistertaxon of an ABC clade). Node 2 represents the hypothetical most-recent common ancestor (MRCA) of the ABC clade splitting into the precursors of C and AB (vanished), node 1 the equally vanished MRCA of the sistertaxa A and B.

Hence, they must be pretty useless for inferring virus phylogenies. A single virus strain with a characteristic genome can have a huge number of offspring.

A virus phylogeny: X represents the genotype of the virus that infected Patient Zero, which quickly spread within the human population and eventually split into five main strains characterised by strain-specific mutations. Because the virus spread unmodified, we have both the ancestor and its offspring in our sample. Outgroup-rooting is difficult, since we typically only have anecdotal samples.

Furthermore, since virus genomes are naked RNA (or DNA), they love to get cosy when meeting in the same host; and recombine. Recombination is possibly the worst of all reticulate evolutionary processes when we then use the data and infer a dichotomous tree: bits of the genome may have completely different evolutionary histories.

A real world example of recombination from the SARS group. Members of 2a and 3b, respectively, form well-supported clades in a ML tree but can differ up to 100% from each other in high-divergent regions excluded from the analysis (full story: Trees and viruses: the SARS group).

The problem

Tree inferences cannot handle well ancestor-descendant situations in the data, because, like any other tree-inference programme, RAxML considers each sequence to represent a tip in the tree and not a internal node.

Let's say, we have the five virus lineages A, B, C, D, and E plus their common ancestor, the ancestral strain X, in our data. Any tree-inference will have to optimise X as sort-of-sister to all five offspring lineages A to E although it should be placed on their junction point. We also generate a flat tree space with many alternative trees – any dichotomous permutation of X, A, B, C, D, E – with essentially the same likelihood. If bootstrapping (BS) gets it right, it will need to produce split BS support of 20 for (X,A)|rest vs (X,B)|rest vs (X,C)|rest vs (X,D)|rest vs (X,E)|rest. For this simple reason, and unlike stated in probably hundreds of papers, using a ML (or any other) tree inference on a 12000 (or less) virus genome dataset cannot provide you with an actual phylogenetic tree, it provides you with a tree that reflects genome similarity.

Nextstrain (homepage/code on GitHub) tries to ameliorate these shortcomings of tree inferences by trying to find evolutionary trajectories that consider time and provenance of the samples (so, near-identical sequence can end up in different clades). But it still gives a tree to depict the virus phylogeny, and what one would need for depicting evolution of viruses, is, usually, a network. But that would take much more time than inferring a tree, plus the tools we have for identifying recombination – which makes inferring a single tree impossible – are pretty crude.

On the Genealogical World of Phylogenetic Networks, we recently had some posts dealing with coronavirus phylogenies and their signal issues.

Problems with the phylogeny of coronaviruses, posted 16/3/2020
Trees and viruses: the SARS group, posted 30/3/2020
Using Median networks to study SARS-CoV-2, posted 20/4/2020
Finding the CoV-2 root, posted 4/5/2020
A new SARS-CoV-2 variant?, posted 11/5/2020

Related data can be found @ figshare
Grimm GW, Morrison D. 2020. Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare. Dataset. https://doi.org/10.6084/m9.figshare.12046581

How to infer a tree even though virus phylogenies are not tree-like

Big Data has made it impossible to focus on getting good virus phylogenies (see eg. the Nextstrain-generated CoV-2 "phylogeny"). Fast implementations of ML tree inference – fastest probably being the new RAxML-NG (GitHub; make sure to cite the original paper) – at least allows to analyse a data set with 12000 (or more) "taxa". Better an imperfect tree than having nothing (also, no reviewer seems to complain about using dichotomous trees as estimate for reticulate evolution: D. Morrison, 2020, A new SARS-CoV-2 variant?).

Being a probabilistic method, ML needs a certain amount of signal to make a decision. Being a (dichotomous) tree inference, any direct ancestor-descendant relationship will make it

hard to make a call;
increase the computation time;
may lead to several equally optimal topological alternatives (including split BS support).

Thus, one needs to prune down a 12000 virus genome data set to a set that allows for in-depth tree-inference and provides data suitable for ML tree inference.

Step: Remove duplicates — duplicates increase the computational load while providing no additional information for the tree optimisation: The flatter the likelihood surface of the tree space, the longer takes the bootstrapping and the final optimisation.
Step: Remove satellite genotypes — same reason. Network, for instance, includes the pre-analysis 'star contraction' (Forster et al. 2001, Mol. Biol. Evol. 18: 1864–1881) which collapses satellite (haplo-)types into their ancestral type ('median'), SplitsTree includes a function to group haplotypes as well. One may even think about removing intermediate types, too, by just filtering all tips that differ less than a fixed threshold of mutations.
Step: Test for and remove rogues — try RogueNaRok (GitHub, blackbox).
Step: Infer a backbone tree for the pruned-down set — as framework for next steps.
Step: Explore topological ambiguity — in case branches have BS < 100, check the BS consensus networks: are there alternatives with similar support (split BS support) or no discriminate signal (all alternatives with very low BS support). BS consensus networks can be generated and viewed with SplitsTree by reading in RAxML's BS tree sample file; or the Phangorn library for R (Schliep et al. 2017, Meth. Ecol. Evol. 8: 1212–1220), which includes further functions to transfer information between trees and networks and uses the same Splits-NEXUS format to export networks (SplitsTree's graphic machine is superior, Phangorn provides more functionality to transfer information between trees and networks, or extract it, when you can 'R').
Step: Optimise position of filtered rogues using evolutionary placement algorithm — Recombinants may have more than one optimal position (can be very far apart), ancestors will have more than one (but remain in the same subtree, phylogenetic neighbourhood).
Step: Hack-and-slash — Optimise poorly resolved subtrees using population genetic methods such as minimum-spanning, median-joining, and statistical parsimony. Check the subalignments visually.

A last general tip — More often than not, super-matrices can benefit from super-network/consensus network approaches. Parts of the above protocol (especially Steps 1 to 3, 7 using ML trees instead of population genetics methods) may help to analyse extremely taxon-rich matrices, too, especially when having no access to a large CPU cluster. There are very few reasons for optimising, e.g., a 10,000-tip or more phylogeny regarding most evolutionary questions. Why not first establish genetically coherent groups – low ingroup divergence, substantial intergroup difference; one neat progamme for this is optsil by M. Göker – using quick, distance-based approaches (tip: heat maps have no limit) and then infer a backbone phylogeny using placeholders for each group and stuff it with trees optimising only members of one group. Or population genetic methods, if the groups show very little divergence and tree-like signals.

By the way, this is also how the largest real Christmas tree, Skeppsbrons Julstämning, in the world is put together each year – it's a super-tree comprising a lot of normal-size trees.