Res.I.P. – an unprofessional science (and other things) blog: Not that it get's lost – my comment to a (nice) dating of Dravidian languages

Twitter pointed me to a recent dating paper of Dravidian languages by Kolipakam et al. (2018), published with Royal Society Open Science. I'm not a linguist but I found it very interesting. But while other comments are waved through, mine is still on hold for now over 24h. Here it is.

A short introducion into the study
The authors put a lot of effort into generating a new linguistic matrix for the Dravidian languages. They show a distance-based neighbour-net, which gives you a good impression on the signal structure in the data. Then they did a number of Bayesian dating analysis, including a Bayes Factor testing to find the optimal model, and a comparison of the age of origin estimated using different models. Their conclusion is that the Dravidian language family is about 4500 years old, and correlate this result with archeological findings.

Signals from linguistic data are typically complex and non-treelike (see also the many linguistic posts on Genealogical World of Phylogenetic Networks), hence, many branches in the Bayesian trees have (inevitably so) not the support that we usually have in molecular dating studies: the posterior probabilities are much smaller than 1 (unambiguous branch) for most of the (unconstrained) branches. The authors were aware of this and dated a few alternative, constrained topologies in addition to a unconstrained one, which shows that deeper shuffles of branches have little effect on the age estimates. Nonetheless, low ambiguous support can have two reason, a lack of discriminate signal or an equally or better supported alternatives. Using consensus networks, one can easily visualise these using the Bayesian tree sample. So I wrote an according comment:

My comment (yet to be approved by Royal Society Open Science)

I'm not a linguist, but familiar with matrices providing non-trivial signals, hence, have become recently more and more intrigued by lingo-phylogenetics. This study has a nice data set and quite a careful dating approach by testing the various models and topologies (Fig. 6 is just beautiful in demonstrating the possible age range)
However, you are using an inference method that assumes that there is and optimises towards a single tree (Bayesian dating) on a dataset that doesn't appear to have a lot of tree-like signal (Fig. 2). Thus, most of your branches in the chronograms have dramatically low supports, inevitably so. A problem shared with total evidence dating (i.e. dating using matrices combining molecular and morphological data covering fossil taxa). Nonetheless, a PP << 1.0 directly leads to the question, what is the next- or even more probable topological alternative? The trees you selected for the dating experiments are per se just one of many possibilties.
Given the nature of your data, and to deal with the problem of topological ambiguity, you may want to consider using consensus networks (Holland B, Moulton V. 2003. Consensus networks: A method for visualising incompatibilities in collections of trees. In: Benson G, and Page R, eds. Algorithms in Bioinformatics: Third International Workshop, WABI, Budapest, Hungary Proceedings. Berlin, Heidelberg, Stuttgart: Springer Verlag, p. 165–176) to further investigate the topological alternatives supported by your Bayesian inferences in the future. The method is implemented in SplitsTree which you used for generating the neighbour-net. To generate the consensus network, you simply read in the Bayesian-saved topologies sample, choose "Consensus Network" and select "COUNT" in the pop-up menu. This will give you a consensus network where the edge-lengths are proportional to the posterior probability of the corresponding split (= topological alternative), providing you a comprehensive overview about the main competing topological alternatives. Just to make sure not to overlook a more likely topological alternative (in case there is any). See this January post on our Genealogical World of Networks blog for an example using simulated binary and total evidence data by O'Reilly & Donoghue 2017 (see further note below):
http://phylonetworks.blogspot.fr/2018/01/summarizing-non-trivial-bayesian-tree.html
-------------------
Some more technical notes:
It is commonly seen, but one should not report a PP = 1.0 for those branches that have been constrained to be monophyletic; the PP is literally the probability for the branch, which is unknown, if its not free to be modified. You can mark the constrained branch e.g. by asterisk or write ":=1.00" instead of "1.00"
It has been recently shown for simulated binary and real-world total
evidence data, which have similar signal issues than seen here (Bayesian
chronograms with many PP << 1.0 branches), that MCC topologies
can be problematic (O'Reilly JE, Donoghue PCJ. 2017. The
efficacy of consensus tree methods
for summarising phylogenetic relationships from a posterior sample of
trees estimated from morphological data. Systematic Biology https://academic.oup.com/sy....
The authors recommend using MRC tree for dating. Your MRC tree probably approaches a comb. Nonetheless,
an interesting experiment would be, if the dating estimates for the
unresolved MRC would be substantially different. My guess is, not so
much (noting the quite large but consistent HPD-intervals shown in Figs 3–5).

Just some tips for the future, how to deal with the topological uncertainty. Utterly harmless, right? I have no idea, why they are reluctant to have this out. Maybe it's because I'm not a linguist, my comments need more scrutiny than the other four accepted (all within a few hours or less) so far (two of them submitted after hours after I pressed "send"). Maybe it's the link to our network blog (which is not one making any money).

Update (23/03 17:30)
The comment is still not online, but thanks to this post and Twitter (modern times, social media beets open science discussion platforms), I got already a reply: the Bayesian consensus network for the data as tweet.

better quality, hopefully pic.twitter.com/KLI9qfvegW
— Simon J Greenhill (@SimonJGreenhill) 23. März 2018

Simon tweeted that it's not very interesting, but I find it very interesting. Especially since it's a good fit with a ML bootstrap support network using the same data, but a simple binary substitution model.

A follow up question, you used 0.33 as cut-off? I really find this very interesting from a signal/method point of view. Picture below shows a (character-naive) ML bootstrap network (0.2 cut-off) of your data; splits shared with your BayesNet in green (trivial splits collapsed) pic.twitter.com/QNTdoBcAlJ
— Guido Grimm (@Grimmiges) 23. März 2018