The Prevalence and Impact of Model Violations in Phylogenetics Analysis

In phylogenetic inference we commonly use models which assume that sequence evolution is stationary, reversible and homogeneous (SRH). Although such assumptions are often criticized, the extent of SRH violations and their effects on phylogenetic inference are not well understood. Here, we extend the matched-pairs test of symmetry to assess the scale and impact of SRH model violations on 3,572 partitions from 35 published phylogenetic datasets. We show that many partitions (39.5%) reject the SRH assumptions, and that in most datasets, phylogenies inferred from all partitions differ significantly from those inferred using the subset of partitions that do not reject the SRH assumptions. These results suggest that the extent and effects of model violation in phylogenetics may be substantial. They highlight the importance of testing for model violations and possibly excluding partitions that violate models prior to tree reconstruction. They also suggest that further effort in developing models that do not require SRH assumptions could lead to large improvements in the accuracy of phylogenomic inference. The scripts necessary to perform the analysis are available in https://github.com/roblanf/SRHtests, and the new tests we describe are available as a new option in IQ-TREE (http://www.iqtree.org).

Most phylogenetic studies use parametric models of molecular evolution which assume that the evolutionary process follows stationary, reversible and homogeneous (SRH) conditions. Stationarity implies that the marginal frequencies of the nucleotides or amino-acids are constant over time, reversibility implies that the evolutionary process is stationary and undirected, and homogeneity implies that the instantaneous substitution rates are constant along the tree or over an edge (Felsenstein 2004;Yang and Rannala 2012;). However, these simplifying assumptions are often violated by real data (Foster and Hickey 1999;Tarrío, et al. 2001;Paton, et al. 2002;Goremykin and Hellwig 2005;Murray, et al. 2005;Bourlat, et al. 2006;Hyman, et al. 2007;Sheffield, et al. 2009;Nesnidal, et al. 2010;Nabholz, et al. 2011;Martijn, et al. 2018). Such model violation may lead to systematic error that, unlike stochastic error, cannot be solved simply by increasing the size of a dataset (Felsenstein 2004;Ho and Jermiin 2004;Sullivan and Joyce 2005;Kumar, et al. 2012;Brown and Thomson 2017;Duchene, et al. 2017). As phylogenetic datasets are steadily growing, it is vital that we develop and employ methods to measure and understand the extent to which systematic error affects phylogenetic inference (systematic bias), and explore ways of mitigating systematic bias in empirical studies.
One approach to accounting for data that have evolved under non-SRH conditions is to employ models that relax the SRH assumptions. A number of non-SRH models have been implemented in a variety of software packages (Foster 2004;Lartillot and Philippe 2004;Blanquart and Lartillot 2006;Boussau and Gouy 2006;Knight, et al. 2007;Dutheil and Boussau 2008;Sumner, et al. 2012;Zou, et al. 2012;Groussin, et al. 2013;Jayaswal, et al. 2014;Nguyen, et al. 2015). But despite the availability of these models they remain infrequently used, as searching for optimal phylogenetic trees under these models is computationally demanding (Betancur-r, et al. 2013) and the implementations are often not easy to use. As a result, the vast majority of empirical phylogenetic inferences rely on models which assume that sequences have evolved under SRH conditions, such as the general time reversible (GTR) family of models implemented in the most widely-used phylogenetics software packages (Swofford 2001;Drummond and Rambaut 2007;Guindon, et al. 2010;Ronquist, et al. 2012;Bazinet, et al. 2014;Bouckaert, et al. 2014;Stamatakis 2014;Nguyen, et al. 2015;Höhna, et al. 2016).
Another approach to accounting for data that may have evolved under non-SRH conditions is to test for model violations prior to tree reconstruction. Here, one first screens datasets or parts of datasets, and reconstruct trees exclusively from data that do not reject SRH conditions. A number of methods have been proposed to test for violation of SRH conditions in aligned sequences prior to estimating trees (Bowker 1948;Stuart 1955;Rzhetsky and Nei 1995;Kumar and Gadagkar 2001;Weiss and von Haeseler 2003;Ababneh, et al. 2006;Jermiin, et al. 2009), and there are also a posteriori tests for absolute model adequacy which are employed after trees have been estimated (Goldman 1993;Brown and ElDabaje 2009;Brown 2014;Duchene, et al. 2017;Brown and Thomson 2018).
Allowing the data to reject the model when the assumptions of the model are violated is an important approach to reducing systematic bias in phylogenetic inference Brown 2014). Knowing in advance the sequences and loci that are inconsistent with the SRH assumptions will alert us to choose more complex models for phylogeny reconstruction or potentially to omit these loci from downstream analyses (Kumar and Gadagkar 2001). The need for methods that assess the evolutionary process prior to phylogenetic inference becomes more important as the number of sequences and sites per dataset increases, because systematic bias has an increasing effect on inferences from larger phylogenetic datasets Phillips, et al. 2004;.
In this paper we evaluate the extent and effect of model violation due to non-SRH evolution using 35 empirical datasets with a total of 3,572 partitions. We determine if the SRH assumptions are violated by extending and applying the matched-pairs tests of symmetry  to each partition. We then compare the phylogenetic trees for each dataset estimated from all of the partitions, the partitions that reject the SRH assumptions, and the partitions that do not reject the SRH assumptions, in order to evaluate the effect violating SRH conditions on phylogenetic inference.

Empirical datasets
In order to assess the impact of model violation in phylogenetics, we first gathered a representative sample of 35 partitioned empirical datasets that had been used for phylogenetic analysis in recent studies (Table 1). Within the constraints of selecting data that were publicly available and suitably annotated, we selected the datasets to provide as representative a sample as possible of the data types, taxa, and genomic regions most commonly used to infer bifurcating phylogenetic trees from concatenated alignments. These datasets include nucleotide sequences from nuclear, mitochondrial, plastid and virus genomes, and include protein-coding DNA, introns, intergenic spacers, tRNA, rRNA and ultra-conserved elements.
The number of taxa and sites in these datasets range from 27 to 355 and from 699 to 1,079,052 respectively. The clades represented in these datasets include animals, plants and viruses. All the datasets files can be found on Github in nexus format https://github.com/roblanf/SRHtests/tree/master/datasets, and the full details of every dataset are provided in Table 1.  (Bergsten, et al. 2013a) (Bergsten, et al. 2013b) Dytiscidae 38 2111 Broughton_2013 (Broughton, et al. 2013b) (Broughton, et al. 2013a) Osteichthyes 61 19997 Brown_2012 ) ) Ptychozoon 41 1665 Cannon_2016a (Cannon, et al. 2016a) (Cannon, et al. 2016b Faircloth_2013 ) ) Actinopterygii 27 149366 Fong_2012 (Fong, et al. 2012b) (Fong, et al. 2012a) Vertebrata 110 25919 Horn_2014 (Horn, et al. 2014b) (Horn, et al. 2014a Workflow summary Figure 1 outlines the workflow. For each partition in each dataset, we used three matchedpairs tests of symmetry to ask whether the evolution of the aligned sequences rejects the SRH assumptions. The three tests, described in more detail below, comprise: the binomial matched-pairs test of symmetry (BiSymTest); the binomial matched-pairs test of marginal symmetry (BiSymTestmar); and the binomial matched-pairs test of internal symmetry (BiSymTestint). These approaches test three slightly different assumptions about the historical process that generated the aligned sequences in a given partition, and a significant result from any test suggests that the nature of the evolutionary process required to explain the aligned sequences violates at least one of the three SRH conditions (Jermiin, et al. 2017, and below).
For each test, we classify each partition as pass if the result of the test is non-significant or fail if the result of the test is significant. We then denote the original dataset as Dall, while the concatenation of pass partitions is denoted Dpass and the concatenation of fail partitions as Dfail ( fig. 1).
To investigate the impact of model violation on phylogenetic inference, we infer and compare three phylogenetic trees, Tall, Tpass and Tfail, estimated from Dall, Dpass and Dfail, respectively.

Fig. 1| Flow chart of methodology.
After application of the matched-pairs tests of symmetry on each possible pair of sequences for each partition, we apply a binomial test on the p-values of all possible pairs of sequences of each partition to derive a P-value for that partition.  (Ababneh, et al. 2006). The statistics are computed on divergence matrices derived from a site-by-site comparison of two homologous sequences. Accordingly, using the chi-square test provides an indication of whether or not it is reasonable to assume that the pair of sequences has evolved under SRH conditions ).

Matched-pairs tests of symmetry
The MPTS is a comprehensive and sufficient test to determine whether the data complies with the SRH assumptions ), but it cannot provide any information about the source of this violation. Some information on the underlying source of model violation may be obtained by performing two further tests of symmetry, the MPTMS and the MPTIS.
If the violation of the SRH assumptions stems from differences in base composition between the sequences, this should affect the marginal symmetry of the sequence pair, which can in principle be detected by the MPTMS. While if the violation of the SRH assumptions stems from differences in substitution rates over time, this should affect the internal symmetry of the sequence pair, which can in principle be detected by the MPTIS. However, even after performing all three tests, it is difficult to ascertain which of the three SRH assumptions is violated during the evolutionary process because the relationships between the SRH conditions and the three matched-pair tests is neither bijective nor injective, i.e. there is no one-to-one correspondence between the three tests and violation of the three SRH conditions .

Binomial matched-pairs tests of symmetry
The three matched-pairs tests of symmetry were designed to ask whether any single pair of sequences rejects the SRH conditions . To ask whether a given partition rejects SRH conditions, we extend the matched-pairs tests of symmetry to accommodate datasets with more than two sequences. To do so, we apply the matched-pair tests of symmetry to every pair of sequences in in a partition, resulting in ( 2 ) chi-squared p-values for each partition, where is the number of sequences in the partition ( fig. 1). Under the null hypothesis of SRH evolution, the marginal distribution of each p-value should be uniform on the interval [0,1], suggesting a 5% chance that any one of them falls below 0.05. We use this logic to assess when too many p-values were too small, in which case a partition is deemed to fail the SRH conditions. Specifically, we count how many chi-square p-values are less than 0.05 and compare the result to a Binomial distribution with ( 2 ) trials and success probability 0.05. If the corresponding Binomial p-value is smaller than 0.05, we classify the partition as fail; otherwise, the partition is labelled as pass. While this approach ignores the dependencies among p-values as obtained from pairwise sequence comparisons, a full accounting of the dependencies among p-values would require knowledge of the underlying phylogeny. Given that these tests aim to determine a priori whether it is feasible to build a reliable phylogeny, it would be inappropriate to use an estimated phylogeny as part of the test. Furthermore, even ignoring dependencies among p-values, if the evolutionary process underlying an alignment was generated by an evolutionary process that violates SRH conditions on many branches, we would still expect that alignment to fail the binomial test. To avoid confusion between the binomial and pairwise tests, we denote the binomial extensions of the MPTS, MPTMS, and the MTPIS as BiSymTest, BiSymTestmar, and BiSymTestint respectively.

Phylogenetic inference
We used IQ-TREE (Nguyen, et al. 2015) to infer up to seven phylogenetic trees for every dataset: Tall (all partitions from the original dataset; Dall); and Tpass and Tfail based on the Dpass and Dfail datasets from each of the three tests (BiSymTest, BiSymTestmar, BiSymTestint), provided that there was at least one partition in each category. We ran IQ-TREE using the default settings with the best-fit fully-partitioned model (Chernomor, et al. 2016) with edgelinked rates determined by ModelFinder (Kalyaanamoorthy, et al. 2017) followed 1000 ultrafast bootstrap replicates (Hoang, et al. 2018).

Distance between trees
For each of the three tests (MPTS, MPTMS, MPTIS) we calculated the Normalised Path-Difference (NPD) between all three possible pairs of trees (Tall vs. Tpass; Tall vs. Tfail; and Tpass vs. Tfail), as long as Dpass and Dfail were non-empty and so Tpass and Tfail had been estimated.
The path-difference metric (PD) is defined as the Euclidean distance between pairs of taxa (Steel and Penny 1993;Mir and Russello 2010). In order to compare path distances between trees with different number of taxa, we normalised PD (to obtain NPD) by the mean of a null distribution of PDs generated from 10K random pairs of trees with the same number of taxa (Bogdanowicz, et al. 2012). Thus, an NPD of zero indicates an identical pair of trees, an NPD of 1 indicates that a pair of trees is as similar as a pair of randomly-selected trees with the same number of taxa; and an NPD greater than 1 indicates a pair of trees that are less similar than a randomly-selected pair of trees with the same number of taxa. Since path differences are always non-negative, the NPD is also guaranteed to be non-negative.

Tree topology tests
The NPD gives us a measure of the differences between pairs of trees, but it does not tell us whether the differences are phylogenetically significant differences underpinned by strong differences in signal in the three datatsets (Dpass, Dall, and Dfail) derived from a given test. For example, trees that differ due to stochastic error associated with small datasets may be very different, but such differences may not be statistically significant. To assess the significance of the differences between trees, we used the weighted Shimodaira-Hasegawa (wSH) test (Shimodaira and Hasegawa 1999;Shimodaira 2002) implemented in IQ-TREE with 1000 RELL replicates (Kishino, et al. 1990), between all three pairs of trees for each test (Tall vs. Tpass; Tall vs. Tfail; and Tpass vs. Tfail), as long as Dpass and Dfail were non-empty. The wSH test returns a p-value for a given set of trees and an alignment, and assesses whether the more likely tree given a particular alignment is a significantly better description of the data than the less likely trees. Thus, for each set of trees (e.g. Tall, Tpass and Tfail) we performed three wSH tests, one for each underlying alignment (Dall, Dpass and Dfail). In total, this resulted in up to 9 wSH tests for every partition and each of the three tests (MPTS, MPTMS, and MPTIS).

Comparing symmetry tests to random subsets of partitions
It is conceivable that simply dividing the partitions of any dataset into two groups would produce two different trees. In order to assess whether the division of the datasets into Dpass and Dfail based on the matched-pairs tests of symmetry led to differences between trees that are more extreme than those we might expect by chance, we generated a null distribution of these 20 replicates, we inferred a tree for both subsets of the data and compared these two trees using the wSH test as above. We then calculate a single wSH test using the replicate dataset with the same number of partitions as Dfail as the focal dataset. This gives us a null distribution of 20 wSH p-values for each of the 24 datasets with sufficient partitions to employ this approach. For each dataset, we then compared the observed wSH p-value from comparing Tpass and Tfail using to this null distribution. This allows us to ask whether dividing datasets into two groups based on the results of matched-pairs tests of symmetry leads to more extreme differences between phylogenetic trees than might be expected by chance.

Correlation between number of substitutions and model violation
We hypothesised that partitions with more substitutions may be more likely to violate the SRH assumptions, since substitutions form the raw data for the matched-pairs tests of symmetry. To assess this, we fitted a linear mixed-effects model for each of the three tests using the glmer function from the lme4 package in R (Bates, et al. 2015). In this model, we treat each partition as a datapoint, the number of substitutions measured for that partition as a fixed effect, and the dataset from which that partition was taken as a random effect. This allows us to estimate the extent to which the number of substitutions in a partition correlates with whether a partition fails a given test of symmetry, after accounting for differences between the datasets. To calculate the R squared value we use the r.squaredGLMM function from the MuMIn package in R (Barton 2009;Nakagawa and Schielzeth 2013).

Software implementation
We implemented a new option --bisymtest in IQ-TREE to perform the three matched pairs tests of symmetry and their associated binomial extensions. In addition, the option -bisymtest-remove-bad allows users to remove from the final analysis partitions that fail the BiSymTest. One can change the removal criterion to BySymTestmar or BySemTestint via the --bisymtest-type MAR|INT option. In addition, the cut-off p-value can be changed using the -bisymtest-pval NUM option, where the default value is 0.05.
The GitHub repository https://github.com/roblanf/SRHtests contains the raw data and Python and R scripts necessary to perform all analyses reported in this study.

Violation of SRH conditions is common across 35 empirical datasets
Across all   The propensity of partitions to fail the matched-pairs tests of symmetry also varied with the type of genome (e.g. mitochondrial, chloroplast, or nuclear) and the type of locus (e.g. protein-coding, UCE, tRNA) from which the partition was sequenced (Table 2, Extended Data fig. 7), although we note that a substantial proportion of the partitions from almost every category failed at least one of the tests (Table 2).

Model violation has a strong influence on tree topologies
For the BiSymTest and the BiSymTestmar, the tree estimated from the whole dataset, Tall, tended to be more similar to the tree estimated from the partitions that failed each test, Tfail, than to the tree estimated from the partitions that passed each test, Tpass (Table 3). Whereas Tfail and Tpass tended to be the most different pair of trees. The results of the wSH tests (Table   4) confirm that this pattern is driven by statistically significant differences between the three trees. For example, when using the BiSymTest, Tfail rejects Tall in ~25% of the datasets, and   Differences in tree topologies are frequently more severe than expected by chance In total we were able to perform the randomisation test on 15 datasets. In 10 of these 15 datasets, the observed p-value was smaller than any of the replicate p-values suggesting that the difference between Tpass and Tfail measured by the wSH test was larger than would be expected by randomly dividing a dataset into two subsets of partitions.

The number of substitutions explains up to half of the variance in passing or failing the tests of symmetry
The number of substitutions in a partition explained 52% of the variation in whether or not a partition passed or failed the BiSymTest. This proportion is very similar for the BiSymTestmar (53%), but is dramatically lower for the BiSymTestint (0.5%). Accordingly, the number of substitutions in a partition is a highlight significant (p<2e-16) predictor of passing or failing the BiSymTest and the BiSymTestmar, but is not a significant predictor of passing or failing the BiSymTestint.

Model violation affects the internal relationships of Spiralia and the position of Xenacoelomorpha
To examine the effects of model violation in more detail, we selected a single dataset for more detailed consideration. Conflicting support for the placement of Xenacoelomorpha in the tree of life across different analyses led to various hypotheses regarding to the evolution of Bilateria (Cannon, et al. 2016a). It has been suggested that such inferences might be strongly affected by model violation and systematic error (Philippe, et al. 2011). To assess whether data that pass or fail the BiSymTest show different signals regarding the evolution of the Bilateria, we examined in more detail the Tall, Tpass, and Tfail trees from the dataset of Cannon et al. (2016a). This dataset comprises 76 metazoan taxa, 2 choanoflagellate outgroups, 212 genes and 424 partitions representing the first and second codon positions of the 212 genes (Cannon, et al. 2016b

Discussion
In this paper, we show that model violation is prevalent and has a strong impact on tree reconstruction in many phylogenetic datasets. This impact varies a lot between different datasets and different types of partitions. The trees inferred from different groups of partitions from the same dataset often have topologies that are biologically and statistically significantly different.
Our results show great variation in the extent of model violation among different datasets and partitions. This is demonstrated by the different proportion of partitions that failed the matched-pairs tests of symmetry in each dataset and in each genomic context (codon position, rRNA, tRNA, UCE or other) and type of genome (nuclear, mitochondrial, plastid and virus).
Model violations are most frequently observed in the third codon positions for viral, mitochondrial and nuclear genomes and intergenic spacers in plastid sequences. Yet, our results affirm that non-SRH evolution is not constrained to these genomic regions. For example, in a dataset of first and second codon positions from 212 genes (Cannon, et al. 2016b), 21% of partitions showed significant evidence of violating the SRH assumptions.
The tree inferred from the partitions that show significant model violation differs a great deal in its topology from the tree inferred from the partitions that do not show significant model violation, particularly with respect to the placement of the focal taxon Xenoturbella (fig. 3).
The results of our study also provide some insights into the likely cause of model violation in the datasets we examined. Figure 2 shows that violation of marginal symmetry (assessed with BiSymTestmar) was much more common that violation of internal symmetry (assessed with BiSymTestint). This suggests that non-stationarity, which is associated with marginal symmetry, is a more common cause of systematic bias than non-homogeneity in the datasets that we examined (see also Jayaswal, et al. 2005;Ababneh, et al. 2006;Song, et al. 2010).
This suggests that the development and application of non-stationary models (e.g. Yang and Roberts 1995;Yap and Speed 2005) may be an important avenue towards reducing systematic bias in future analyses.
One limitation of the binomial tests that we propose here is that their power will be limited if there are few differences between the sequences being examined. Indeed, our analyses show that in our representative sample of more than 3500 partitions from published datasets, roughly ~50% of the variance in whether a partition passes or fails a given test can be attributed to the number of observed differences between the sequences. Nevertheless, this suggests that the remaining ~50% of the variance in whether a partition passes or fails a test could be attributable to variation in the extent of model violation among partitions (the number is much higher for BiSymTestint). This suggests that we should be cautiously optimistic: although a lack of power on small or slowly-evolving partitions may induce some false negatives (i.e. failures to identify partitions that have evolved under non-SRH conditions), removing the partitions that show the strongest evidence of model violation may still improve phylogenetic analyses by reducing the overall burden of model violation on the inference of the tree topology. We hope that our implementation of these tests in userfriendly software IQ-TREE will allow empirical phylogeneticists to continue to explore whether this is the case.