Now that RAxML includes all models – a practical tip

The new, faster and (meanwhile) very option-rich version of RAxML, RAxML-NG provides the full plethora of nucleotide substitution models, which can be confusing to the normal user. Hence, a practical tip based on my experiences with very different sets of nucleotide data (and from very different organisms).

A general rule in phylogenetics is: the simpler, the better.

In the case of substitution models "simple" means to put as few assumptions as possible into the model. In contrast to RAxML-III to RAxML 8, which had only one model for nucleotide substitution, the assumption-free general-time-reversible, GTR, model (Rodriguez et al. 1990), RAxML-NG (Kozlov et al. 2019 – open access) implements now the full wealth of substitution models due to popular demand — mainly (anonymous) reviewers asking authors why they didn't do ModelTest (Posada & Crandall 1998; Darriba et al. 2012, 2020) or used PartitionFinder to establish the "best" model (per partition) for their data, or didn't follow those programmes' recommendation. This plethora of models can be confusing (and, to my experience in the low-impact biological sciences, can lead to odd, surely not helpful reviewer comments), hence, a practical tip for the common user.

The reason earlier RAxML versions only had GTR in stock for nucleotide data (and I got frequently into conflict with my usually anonymous "expert" reviewers) was the realisation that when our data follows e.g. a Jukes-Cantor model (JC) – aaaaaa: all rates equal, i.e. highly constrained model – RAxML will optimise a model under GTR that approaches the JC model. (And this is, what I told every reviewer, and the occassional editor, complaining about our model choice, and got always through with it.)

On the other hand, if we apply a too constrained model, we will bias the inferred topology and increase the chance to infer wrong branches. So, unless one works with nucleotide data that knowingly evolved following a certain model, the safest choice is always GTR.

More important than constraining the model can be the proper partitioning of our data. The most obvious case are protein-coding nucleotide sequences. The 3rd codon base can be easily saturated by transitions and the probability for transitions can be much higher than those for transversions, while the 1st and 2nd codon bases follow strongly different models (excess rate of transitions vs. transversions is usually lower). To accommodate such situations, one tells RAxML(-NG) to treat 1st and 2nd codon positions as one GTR partition (or two, in case you have enough data) and the 3rd as another GTR using the partition file. An example for a dataset that included plastid genes and intergenic spacers and nuclear gene regions (noncoding)

DNA, cp_noncoding = 1-237,1753-3307,4736-5533,5569-6113,6164-6360
DNA, cp_3rdcodon = 238-1752\3,3310-4735\3
DNA, cp_1st2ndcodon = 239-1752\3,240-1752\3,3308-4735\3,3309-4735\3
DNA, nr_ITS = 6361-6662,6807-7045
DNA, high_cons = 5534-5568,6114-6163,6663-6806
DNA, nr_LFY = 7046-7757

When you then look at the substitution matrices RAxML optimised for each partition, you will directly see the difference in the models. RAxML always was fast, RAxML-NG is even faster. To explore the "partition-space", one just runs an unpartitioned and then a fully partitioned analysis and compare the result using a tanglegram (read in RAxML's branch labelled result files into Dendroscope) or support consensus networks (e.g. Schliep et al. 2017 – open access) to see if there is any highly supported conflict. If there isn't, partitioning is not an issue for your data.
  • Unpartitioned analysis will under-optimise, fully partitioned analysis will over-optimise the substitution model(s). By using both, we hence set the least and most optimised borders for our data and inference. 
  • The more partitions we use, the more computational effort RAxML has to put into the analysis finding a fitting model. Hence, running a fully partitioned analysis may take too long for datasets with many tips (taxa) and a lot of partitions. In that case, just take the partitioning scheme suggested by PartitionFinder as the one probable extreme boundary of the "partition-space" or use a biologically-genetically sensible partitioning scheme. For instance, when analysing complete chloroplast genomes (plastomes) the maximum partitioning would be to treat each coding gene as three partitions (1st, 2nd and 3rd codon position), four for those including introns, and assign a partition to each intergenic spacer, tRNA and rRNA gene. With the consequence that many partitions would have little discriminate data – a low number of "distinct alignment patterns" – making it hard for RAxML to converge to a model. A genetically sensible partitioning scheme would have six partitions: 
    1. 1st+2nd codon position, 
    2. 3rd codon position, 
    3. introns, [could include low divergent intergenic spacers such as classic "barcodes" atpB-rbcL, trnL-trnF]
    4. (remaining) intergenic spacers (for instance, the trnH-psbA spacer is usually much more divergent than the classic intergenic spacers traditionally used as "barcodes"), 
    5. tRNA genes,
    6. rRNA genes.
In fact, I stopped using ModelTest more than a decade ago and just relied on RAxML to check whether parts of our data followed different models or not.

Note on amino-acid data

It's different for amino-acid (AA) data because the translation of codons into AA chains (= proteins) is strongly constrained, and differently in the three main genomes: the nucleome, the plastome and the mitochondriome. Plus, there are additional functional and structural constrains on the AA chain itself — using a less constrained model has more severe consequences than for nucleotides. In case of doubt, RAxML's speed makes it possible to just infer AA trees using the possible (probable) models (e.g. those getting the best ModelTest-NG scores) and check whether the result is different or not, e.g. by comparing the trees using Dendroscope's tanglegram option or SplitsTree's consensus networks/super-networks.

Important links
  • Input data for RAxML-NG on GitHub (pages compiled by Alexey Kozlov, the main programmer of RAxML-NG).
  • RAxML google group: any question will be answered, and timely. When it's something, you could have picked up from the manual, they will kindly tell you so, too.
  • Modeltest-NG, the latest implementation of the old classic (some do go forth, unlike e.g. surviving parsimonists).
  • Robert Lanfaer's PartitionFinder entry page, also has – like RAxML-NG – a google group, tutorial, exhaustive manual and FAQ section.
  • Phylogenetics in the Genomic Era – non-profit, open access (via the French hal publication server) eBook covering all phylo-Big Data and (tree) inference aspects, a lot of theory but also some introductions for novices.
Cited papers
  • Darriba D, Taboada GL, Doallo R, Posada D. 2012. jModelTest 2: more models, new heuristics and parallel computing. Nature Methods 9:772–772.
  • Darriba D, Posada D, Kozlov AM, Stamatakis A, Morel B, Flouri T. 2019. ModelTest-NG: A New and Scalable Tool for the Selection of DNA and Protein Evolutionary Models. Molecular Biology and Evolution 37:291–294.
  • Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. 2019. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. BMC Bioinformatics 35:4453–4455.
  • Posada D, Crandall KA. 1998. Modeltest: testing the model of DNA substitution. Bioinformatics 14:817–818.
  • Rodriguez F, Oliver JL, Marin A, Medina JR. 1990. The general stochastic model of nucleotide substitution. Journal of Theoretical Biology 142:485–501.
  • Schliep K, Potts AJ, Morrison DA, Grimm GW. 2017. Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution 8:1212–1220.

No comments:

Post a Comment

Enter your comment ...