MonoPhy: A simple R package to find and visualize monophyly issues

Background. The monophyly of taxa is an important attribute of a phylogenetic tree. A lack of it may hint at shortcomings of either the tree or the current taxonomy, or can indicate cases of incomplete lineage sorting or horizontal gene transfer. Whichever is the reason, a lack of monophyly can misguide subsequent analyses. While monophyly is conceptually simple, it is manually tedious and time consuming to assess on modern phylogenies of hundreds to thousands of species. Results. The R package MonoPhy allows assessment and exploration of monophyly of taxa in a phylogeny. It can assess the monophyly of genera using the phylogeny only, and with an additional input file any other desired higher order taxa or unranked groups can be checked as well. Conclusion. Summary tables, easily subsettable results and several visualization options allow quick and convenient exploration of monophyly issues, thus making MonoPhy a valuable tool for any researcher working with phylogenies.


Introduction
26 Phylogenetic trees are undoubtedly crucial for most research in ecology or evolutionary biology. 27 Whether one is studying trait evolution (e.g. Coddington 1988;Donoghue 1989), diversification 28 (e.g. Gilinsky & Good 1991;Hey 1992), phylogeography (Avise et al. 1987), or simply 29 relatedness within a group (e.g. Czelusniak et al. 1982; Shochat & Dessauer 1981; Sibley & 30 Ahlquist 1981), bifurcating trees representing hierarchically nested relationships are central to 31 the analysis. Exactly because phylogenies are so fundamental to the inferences we make, we 32 need tools that enable us to examine how reconstructed relationships compare with existing 33 assumptions, particularly taxonomy. We have computational approaches to estimate confidence 34 for parts of a phylogeny (Felsenstein 1985;Larget & Simon 1999) or measuring distance 35 between two phylogenies (Robinson 1971), but assessing agreement of a new phylogeny with 36 existing taxonomy is often done manually. This does not scale to modern phylogenies of 37 hundreds to thousands of taxa. Modern taxonomy seeks to name clades: an ancestor and all of its 38 descendants (the descendants thus form a monophyletic group). Discrepancies between the new 39 phylogenetic hypothesis and the current taxonomic classification may indicate that the 40 phylogeny is wrong or poorly resolved. Alternatively, a well-supported phylogeny that conflicts 41 with currently recognized groups might suggest that the taxonomy should be reformed. To 42 identify such discrepancies, one can simply assess whether the established taxa are 43 monophyletic. A lack of group monophyly however, can also be an indicator for conflict 44 between gene trees and the species tree, which may be a result of incomplete lineage sorting or 45 horizontal gene transfer. In any case, monophyly issues in a phylogeny suggest a potential error 46 that can affect downstream analysis and inference. For example, it will mislead ancestral trait or 47 area reconstruction or introduce false signals when assigning unsampled diversity for 48 diversification analyses (e.g. in diversitree (FitzJohn 2012) or BAMM (Rabosky 2014)). In 49 general, a lack of monophyly can blur patterns we might see in the data otherwise. 50 As this problem is by no means new, approaches to solve it have been developed earlier, 51 particularly for large scale sequencing projects in bacteria and archaea, for which taxonomic 52 issues are notoriously challenging. The program GRUNT (Dalevi et al. 2007) uses a tip to root 53 walk approach to group, regroup, and name clades according to certain user defined criteria. The 54 subsequently developed 'taxonomy to tree' approach (McDonald et al. 2012) matches existing 55 taxonomic levels onto newly generated trees, allowing classification of unidentified sequences 56 and proposal of changes to the taxonomic nomenclature based on tree topology. Finally, Matsen 57 & Gallagher (2012) have developed algorithms that find mismatches between taxonomy and 58 phylogeny using a convex subcoloring approach. 59 The new tool presented here, the R package MonoPhy, is a quick and user-friendly method for 60 assessing monophyly of taxa in a given phylogeny. While the R package ape (Paradis et al.

Function name Description
AssessMonophyly Runs the main analysis to assess monophyly of groups on a tree GetAncNodes Returns MRCA nodes for taxa.

GetIntruderTaxa
Returns lists of taxa that cause monophyly issues for another taxon.

GetIntruderTips
Returns lists of tips that cause monophyly issues for a taxon.

GetOutlierTaxa
Returns lists of taxa that have monophyly issues due to outliers.

GetOutlierTips
Returns lists of tips that cause monophyly issues for their taxon by being outliers.

PlotMonophyly
Allows several visualizations of the result.

91
Biologically, identifying a few intruders may suggest that the definition of a group should be 92 expanded; observing some group members in very different parts of the tree than the rest of their 93 taxon may instead suggest that these individuals were misidentified, that their placement is the 94 result of contaminated sequences or due to horizontal gene transfer between members of two 95 remote clades. Moreover, the approach as described above would suggest that the clades that are 96 intruded by the outlier tips would in turn be intruders to the taxon the outliers belong to, which 97 intuitively would not make sense. We thus implemented an option to specify a cutoff value, 98 which defines the minimal proportion of tips among the descendants of a taxon's MRCA that are 99 labeled as being actual members of that taxon. If a given group falls below this value, the 100 function will find the 'core clade' (a subclade for which the proportion matches or exceeds the 101 cutoff value) by moving tipward, always following the descendant node with the greater number 102 of tips in the focal taxon (absolute, relative if tied), and at each step evaluating the subtree rooted 103 at that node to see if it exceeds the cutoff value. Once such a subtree is found, it is then called the 104 'core clade', and taxon members outside this clade are then called 'outliers'. As there is no 105 objective criterion to decide at what point individuals should be considered outliers, a reasonable 106 cutoff value must be chosen by the user.

107
If the tree's tip labels are in the format 'Genus_speciesepithet', the genus names will be 108 extracted and used as taxon assignments for the tips. If the tip labels are in another format, or 109 other taxonomic levels should be tested, taxon names can be assigned to the tips using an input 110 file. To avoid having to manually compose a taxonomy file for a taxon-rich phylogeny, 111 MonoPhy can automatically download desired taxonomic levels from ITIS or NCBI using taxize 112 (Chamberlain & Szocs 2013).
For the second example, we demonstrate the package's performance on a tree of 31,749 144 species of Embriophyta (Zanne et al. 2014; data see Zanne et al. 2013), using an outlier-cutoff of 145 0.9 this time. Just checking monophyly for genera took 1.78 hours, but revealed that 22% of 146 genera on the tree are not monophyletic, while around half of all genera are only represented by 147 one species each. Furthermore, we can see that the largest monophyletic genus is Iris (139 tips), 148 that Justicia had the most intruders (13 tips) and that Acacia produced the most outliers (99 tips). 149 Finally, with 2337 other tips as descendants of their MRCA, the 3 species of Aldina are most 150 spread throughout the tree.