Enrichment in conservative amino acid changes among fixed and standing missense variations in slow evolving proteins

Proteins were first used in the early 1960s to discover the molecular clock dating method and remain in common usage today in phylogenetic inferences based on neutral variations. To avoid substitution saturation, it is necessary to use slow evolving genes. However, it remains unclear whether fixed and standing missense changes in such genes may qualify as neutral. Here, based on the evolutionary rates as inferred from identity scores between orthologs in human and Macaca monkey, we found that the fraction of conservative amino acid mismatches between species was significantly higher in slow evolving proteins. We also examined the single nucleotide polymorphisms (SNPs) by using the 1000 genomes project data and found that missense SNPs in slow evolving proteins also had higher fraction of conservative changes, especially for common SNPs, consistent with more natural selection for SNPs, particularly rare ones, in fast evolving proteins. These results suggest that fixed and standing missense variations in slow evolving proteins are more likely to be neutral and hence better qualified for use in phylogenetic inferences.

theory further provides an explanation for such a relationship that the substitution rate under selective neutrality is expected to be equal to the mutation rate (Kimura 1983).
If however, mutations/substitutions are not neutral or under natural selection, the substitution rate would be affected by the population size and the selection coefficient, which are unlikely to be constant among the lineages. Thus, the distance matrix methods are sound provided that one uses neutral variants that accumulate to increase genetic distances in a nearly linear fashion common to the species concerned.
In phylogenetic inferences, one well known factor that may increase the uncertainty of the analysis is substitution saturation (Steel, Lockhart, Penny 1993;Philippe, Forterre 1999;Huang 2010;Brandley et al. 2011;Huang 2012;Soubrier et al. 2012). In saturation, the amount of substitutions or genetic distances would cease to be linearly related to time. Intuitively, the issue of saturation can be easily solved by using slow evolving genes as defined by high identity among species. However, such genes are well known to be under stronger purifying selection as defined by dN/dS ratio, which means that a new mutation has lower probability of being fixed (Cai, Petrov 2010). However, purifying selection as detected by the dN/dS method is largely concerned with non-observed mutations and says little about the fixed or observed variations. And phylogenetic inferences are all concerned with observed variants.
Nonetheless, it remains to be determined whether fixed and standing missense substitutions in slow evolving genes may qualify as neutral.
We here found that the proportion of conservative mismatches between species was higher in the slowest evolving set of proteins than in fast evolving proteins. Using datasets from the 1000 genomes (1KG) project phase 3 dataset (Auton et al. 2015), we also found that missense single nucleotide polymorphisms (SNPs) from the slowest evolving set of proteins, especially those with high minor allele frequency (MAF), were enriched with conservative amino acid changes, consistent with such changes being under less natural selection. We suggest that the distance matrix methods, together with the new method here for selecting slow evolving sequences or neutral variants, may be best qualified for inferring realistic evolutionary trees.

Methods
Classification of proteins as slow and fast evolving. The identification of slow evolving proteins and their associated SNPs were as previously described ). Briefly, we collected the whole genome protein data of Homo sapiens (version 36.3) and Macaca mulatta (version 1) from the NCBI ftp site and then compared the human protein to the monkey protein using local BLASTP program at a cut-off of 1E-10. We only retained one human protein with multiple isoforms and chose the monkey protein with the most significant E-value as the orthologous counterpart of each human protein. The aligned proteins were ranked by percentage identities.
Proteins that show the highest identity between human and monkey were included in the set of slow or the slowest evolving (including 423 genes > 304 amino acid in length with 100% identity and 178 genes > 1102 amino acid in length with 99% identity between monkey and human). The rest are all considered as fast evolving proteins. The cut off criterion was based on the empirical observation of low substitution saturation, and the finding that missense SNPs from the slow set of proteins produced genetic diversity patterns that were distinct from those from the fast set .

SNP selection.
We downloaded the 1KG phase 3 data and assigned SNP categories using ANNOVAR (Auton et al. 2015). We then picked out the missense SNPs located in the slow evolving set of genes from the downloaded VCF files . MAF was derived from AF (alternative allele frequency) values from the VCF files. Missense SNPs in fast evolving genes included all those from 1KG that are not from the slow evolving set.
Scoring conservative amino acid replacements. For fixed substitutions as revealed by BLASTP, conservative changes were scored as those that are counted as "positive" mismatches in BLASTP alignment, which was largely based on having positive values in the BLOSUM62 matrix (Pearson 2013). The degree of physical/chemical change in an amino acid missense mutation was ranked by a scoring series, -3,-2, -1, 0, 1, 2, 3, in the BLOSUM62 matrix with more positive values representing more conservative changes. For missense SNPs, we assigned each amino acid mutation a score based on the BLOSUM62 matrix.

Fixed amino acid mismatches and evolutionary rates of proteins
We determined the evolutionary rates of proteins in the human proteome by the percent mismatches between the human proteins and their orthologs in Macaca monkey. As described previously, the slow set contains the slowest evolving 423 genes (> 304 amino acid in length) with 100% identity and 178 genes (> 1102 amino acid in length) with 99% identity between monkey and human ). This group was identified based on the observation that missense SNPs located in these proteins produced distinct genetic diversity patterns from those by missense SNPs in faster evolving proteins or those by a randomly selected set consisting of mostly noncoding SNPs . We then divided the proteins into several groups of different evolutionary rates and compared the proportion of conservative amino acid mismatches in each group. Conservative changes were scored as those that are counted as "positive" mismatches in BLASTP alignment (>0 based on scale found in the BLOSUM62 matrix) (Pearson 2013).
The mismatches between two species would have one of the two residues or alleles as ancestral in the case of slow evolving proteins yet to reach saturaton (no independent mutations occurring at the same site among species) and so a mismatch due conservative changes would involve a conservative mutation during evolution from the ancestor to extant species. But at saturation for fast evolving proteins where a site may encounter multiple mutations, while a drastic mismatch would necessarily involve a drastic mutation, it is possible for a conservative mismatch to result from at least two independent drastic mutations (If the common ancestor has Arg at some site, a drastic mutation event at this site occurring in each of the two species, Arg to Leu in one and Arg to Ile in the other, may lead to a conservative mismatch of Leu and Ile).
Thus, a conservative mismatch at saturation just means less physical/chemical differences between the two concerned species and says little about the actual mutation events. Lower fraction of conservative mismatches at saturation for fast evolving proteins would mean more physical/chemical differences between the two species, which may more easily translate into functional differences for natural selection to see and act upon.  To verify that the slowest evolving proteins with length >1102 amino acids included in the slow set is distinct from the fast set, we first compared proteins with length >1102 amino acids with no gaps in alignment (Table 1, Figure 1A) or with gaps (Table 1, Figure 1B). There was a general correlation between slower evolutionary rates and higher fraction of conservative changes, with a significant drop in the fraction of conservative changes between the slowest evolving, which was included in the slow set that has monkey-human identity > 99%, and the next set ( Figure 1).
Proteins with alignment gaps showed slightly lower fraction of conservative changes than those without gaps, consistent with lower sequence conservation (hence faster evolving) in proteins with alignment gaps. We further studied the remaining proteins

Standing amino acid variants and evolutionary rates of proteins
We next studied missense SNPs found in proteins of different evolutionary rates by using 1KG dataset (Auton et al. 2015). There were 15271 missense SNPs from the slow evolving set of proteins (>1102aa with 99% identity and > 304 aa with 100% identity) and 546297 missense SNPs in fast set (all proteins that remain after excluding the slow set). We assigned each amino acid change found in a missense SNP a conservation score (-3, -2, -1, 0, 1, 2, or 3) as described in the BLOSUM62 matrix with higher values representing more conservative changes in the physical/chemical properties of the amino acids (Pearson 2013). The amount of SNPs in each score category was then enumerated. We found that missense SNPs in the slow evolving set of proteins had lower fraction of drastic mutations (with score -3, and -2) and higher fraction of conservative mutations (with scores 3, 2, and 1) relative to those in the fast evolving set of proteins (P<0.001 for all score categories except for -1 and 0, Figure 3). The fraction of conservative mutations (with scores >0) in the slow evolving set was significantly higher than that of fast set (P<0.001, Figure 3).

Discussion:
Our results here showed that, contrary to naï ve expectations, fixed and standing changes in slow evolving proteins were enriched with conservative amino acid mismatches. Furthermore, such conservative changes are also more neutral or under less natural selection. It is easy to tell optimum/maximum saturation genetic distances from linear distances as described previously (Huang 2010). Imagine a 100 amino acid protein with only 1 neutral site. In a multispecies alignment involving at least 3 species, if one finds only one species with a mutation at this neutral site while all other species have the same non mutated residue, there is no saturation. However, if one finds that nearly every species has a unique amino acid, one would conclude mutation saturation as there are multiple independent substitution events among different species at the same site and repeated mutations at the same site do not increase distance. We have termed those sites with repeated mutations "overlap" sites (Huang 2010). So, a diagnostic criterion for saturated maximum distance is the proportion of overlap sites among mismatched sites between two species. Saturation would typically have 50-60% overlapped sites that are 2-3 fold higher than that expected under no-saturation (Huang 2010;Luo, Huang 2016). It is not expected to have near 100% overlapped sites because certain sites may only accommodate 2 or very few amino acid residues at saturation equilibrium, which would prevent them from presenting as overlapped sites even though they are in fact saturated sites. This overlap ratio method is free of uncertain assumptions and hence more realistic than other methods of testing for saturation, such as comparing to inferred number of mutations based on uncertain phylogenetic trees derived from maximum parsimony or maximum likelihood methods or assuming saturation to mean fully random composition when in fact saturation (as a result of selection) could just as well mean a fraction of all possible choices at a site (Steel, Lockhart, Penny 1993;Philippe et al. 1994;Xia et al. 2003). Our findings here show that saturation is maintained by selection as fast evolving proteins have lower fraction of conservative changes, which corrects the presently popular view that variant sites at saturation are fully neutral.
By using the overlap ratio method, we have verified that the vast majority of proteins show maximum distances and only a small proportion, the slowest evolving, are still at the linear phase of changes (Huang 2010;Luo, Huang 2016;Yuan et al. 2017). Variations at most genomic sites within human populations are also at optimum equilibrium as evidenced by the observation that a slight increase above the present genetic diversity level in normal subjects is associated with patient populations suffering from complex diseases (Yuan et al. 2012;Yuan et al. 2014;Zhu et al. 2015;He et al. 2017;Lei et al. 2018), as well as the observation that sharing of SNPs among different racial groups is an evolutionary rate dependent phenomenon with more sharing in fast evolving sequences ).
It turns out that the original hemoglobin and cytochrome C alignment results of Zuckerkandl, Pauling, and Margoliash (Zuckerkandl, Pauling 1962;Margoliash 1963) also have high proportions of overlap sites and so represent in fact saturation distances (Huang 2010). Thus, the interpretation of those results as molecular clock or a linear distance phenomenon (mutations always increase distances and distances always correlate with time) was mistaken.
If fast evolving proteins are at saturation, they would not be expected to reveal authentic phylogenetic relationships. However, most past studies have used them and some meaningful results do emerge including the results of Zuckerkandl, Pauling, and Margoliash (Zuckerkandl, Pauling 1962;Margoliash 1963 and ignored time, one would find a strong correlation of sequence identity with species complexity. One also finds that simple species is equidistant to all more complex species. So, the distance hierarchy with humans as measured by fast evolving proteins at saturation distance is a result of lower and lower complexity of species in more ancient times and hence higher and higher within species maximum genetic diversity (MGD). Lower complexity means wider tolerable error range in DNA building parts. The saturation distance to human for a lower complexity species is equal to the MGD of the lower species (Huang 2008a;Huang 2008b;Hu et al. 2013;Huang 2016).
The correlation of sequence identity with complexity makes sense as closer physiology should mean closer gene identity and less complex species should be able to tolerate more mutation variations ( Therefore, even though fast evolving proteins only represent maximum saturation distances, they may in some cases reveal proper phylogenetic topology due to the fact that complex species with lower MGD evolved more recently. Nonetheless, a proper method of phylogenetics must measure only ancestry-linked relationships and excludes sequence variations involved in physiology, which can only be accomplished by using slow evolving genes with fixed and standing neutral variations. The results here provided the key evidence for the neutrality of the fixed and standing variations in the slow evolving genes and should help with the popularization of using such genes in phylogenetic inferences. Among the existing methods of phylogeny inferences, most, such as the maximum likelihood methods and Bayesian methods, require the assumption of certain evolutionary models of amino acid or nucleotide changes, which may be unrealistic (Felsenstein 1981;Rannala, Yang 1996). Distance matrix methods do not rely on such models but requires the molecular clock and hence the neutral variants.
While maximum likelihood and related methods do not require the molecular clock, they do assume substitutions to be stochastic, probabilistic, and independent events (Felsenstein 1981), which only neutral variants can satisfy. As slow evolving proteins vary less dramatically in rates among species, they are inherently better suited for the distance matrix methods. We expect far more consistent future results of evolutionary trees to emerge once researchers have adopted a method that is largely free of uncertain assumptions. Declarations: