Calculating site-specific evolutionary rates at the amino-acid or codon level yields similar rate estimates

Site-specific evolutionary rates can be estimated from codon sequences or from amino-acid sequences. For codon sequences, the most popular methods use some variation of the dN∕dS ratio. For amino-acid sequences, one widely-used method is called Rate4Site, and it assigns a relative conservation score to each site in an alignment. How site-wise dN∕dS values relate to Rate4Site scores is not known. Here we elucidate the relationship between these two rate measurements. We simulate sequences with known dN∕dS, using either dN∕dS models or mutation–selection models for simulation. We then infer Rate4Site scores on the simulated alignments, and we compare those scores to either true or inferred dN∕dS values on the same alignments. We find that Rate4Site scores generally correlate well with true dN∕dS, and the correlation strengths increase in alignments with greater sequence divergence and more taxa. Moreover, Rate4Site scores correlate very well with inferred (as opposed to true) dN∕dS values, even for small alignments with little divergence. Finally, we verify this relationship between Rate4Site and dN∕dS in a variety of empirical datasets. We conclude that codon-level and amino-acid-level analysis frameworks are directly comparable and yield very similar inferences.


148
Both Rate4Site scores and per-site dN/dS values are measures of the extent to which selection acts 149 on individual protein sites. The Rate4Site model decomposes evolutionary distances in amino-acid 150 alignments into a site-specific component r k and a branch-specific component t i , such that the total 151 divergence at site k along branch i can be written as r k t i . Here, r k is the Rate4Site score at site k and t i is 152 the branch length of branch i in the phylogenetic tree. Importantly, r k is the same at all branches in the 153 tree and t i is the same at all sites for each branch i. Because the rate decomposition is invariant under a 154 rescaling of r ′ k = Cr k and t ′ i = t i /C, Rate4Site scores are not unique unless an additional normalization 155 condition is specified as well. The Rate4Site software solves this uniqueness problem by turning the r k 156 into z-scores. However, the more natural normalization is to divide all r k by their mean, r ′ k = r k / ∑ j r j , 157 where the sum runs over all sites in the protein. These normalized r ′ k scores have the simple interpretation 158 of providing the relative increase or decrease in substitution rate at site k compared to the average rate of 159 substitution in the rest of the protein. 160 In contrast to Rate4Site scores, which are calculated from amino-acid alignments, dN/dS ratios are 161 calculated on nucleotide alignments. They estimate the rate of non-synonymous divergence relative to 162 the rate of synonymous divergence. However, just like in Rate4Site, in a site-specific dN/dS model 163 evolutionary divergence is decomposed into a site-specific dN/dS value and a site-independent branch 164 length. Thus, Rate4Site and site-specific dN/dS measure fundamentally the same quantity. The main 165 difference is the input data (amino-acid sequences vs. codon sequences) and the normalization (relative to 166 mean across sites vs. relative to the synonymous divergence rate dS).

167
Relationship between Rate4Site scores and true dN/dS 168 To determine the relationship between Rate4Site and dN/dS models, we began by simulating sequence 169 evolution with known, site-specific dN/dS values and then comparing these true dN/dS values to 170 Rate4Site scores inferred from the simulated alignments ( Figure 1). We first considered the case of 171 constant dS among all sites. Thus, for each site in each alignment, we randomly drew a dN from a uniform 172 distribution ranging from 0.1 to 1.6. We set dS = 1 for all sites, such that the dN/dS ratios similarly 173 varied from 0.1 to 1.6. We ran simulations along a set of 25 balanced trees with different branch lengths 174 and numbers of taxa, as used previously in a study of dN/dS inference (Spielman et al., 2016). Simulated 175 sequences were 100 codon sites long, and we generated 50 replicate simulations for each simulation 176 condition. 177 We calculated the correlation between each site's true dN/dS and its inferred Rate4Site score to assess 178 how well the Rate4Site scores agreed with the simulated rates. We then plotted the mean correlation 179 strengths in replicate simulations against the simulations' branch lengths and number of taxa. We found 180 that correlation strengths systematically increased with both increasing branch lengths and number of 181 taxa ( Figure 2A). While correlations were low to moderate for the least-diverged and smallest alignments, 182 for larger and/or more diverged alignments correlations approached values ranging from 0.8 to 1.0. 183 We also performed a comparison of the magnitude of Rate4Site scores and dN/dS scores, by calculat-184 ing root-mean-square deviations (RMSD) between these scores. Because these two types of scores are not 185 measured in the same units, this comparison may not seem meaningful. However, we can convert both 186 types of scores into normalized, relative scores by dividing them by their mean score. These normalized 187 scores have comparable interpretations and RMSDs between them are meaningful quantities. 188 We found that RMSD values were generally moderate, between 0.1 and 0.6 ( Figure 2B). They 189 declined with both increasing number of taxa and increasing sequence divergence. However, overall 190 RMSD depended more strongly on branch length than on the number of taxa. Visual inspection of 191 normalized Rate4Site scores plotted against normalized dN/dS scores revealed no major systematic 192 differences between these scores ( Figure 3). Differences seemed to be driven primarily by the sampling 193 noise inherent in estimating site-specific evolutionary rates. 194 We repeated the same analysis but now using simulations in which dS was allowed to vary among 195 sites as well. The dN/dS range was kept the same as before (0.1 to 1.6), but now each site had its own 196 unique dS, randomly chosen from a uniform distribution ranging from 0.5 to 2. Overall, we found similar The above simulations utilized rates that were uniformly distributed across sites. In empirical datasets, 255 however, the majority of sites evolves slowly and only very few sites evolve rapidly, with a resulting 256 distribution that is approximately gamma (see e.g., Meyer and Wilke 2015b). Therefore, we wanted 257 to assess if the observed relationships between Rate4Site scores and dN/dS hold for simulations with 258 gamma-distributed true rates. We simulated sequences with gamma-distributed true dN/dS values, using 259 the shape and rate parameters obtained by Meyer and Wilke (2015b) for six different HIV-1 proteins. For 260 these simulations, we also included a range of smaller numbers of taxa (16, 32, and 64) than before, to 261 increase realism in the simulations.

262
We found a wide range of correlations between Rate4Site and true dN/dS ( Figure 8A and Rate4Site scores and true dN/dS were higher than in our previous analysis (compare Figure 8B to Figure   269 2B).

270
We also inferred dN/dS from the simulated sequences and compared them to the Rate4Site scores.

271
Unlike in our previous analysis on inferred dN/dS ( Figure 6), the correlations between Rate4Site and 272 inferred dN/dS now spanned a wide range of values ( Figure 8C and Parts C of Figures  that Rate4Site scores and dN/dS were highly correlated, with correlation coefficients exceeding 0.8 in all 287 cases ( Figure 9). Thus, we can conclude that our simulation results carry over to empirical datasets, and 288 that Rate4Site scores are generally comparable to inferred dN/dS values.

289
Note that we plotted raw (non-normalized) dN/dS in Figure 9, and as a result points are not expected 290 to fall onto the x = y line. We plotted the data in this way because dN/dS values are not typically 291 normalized to their mean, and we wanted to assess how similar in magnitude raw dN/dS scores are to 292 Rate4Site scores. We found that for proteins under strong positive selection (such as HIV-1 gp120), raw 293 dN/dS is virtually identical to Rate4Site. However, for more typical proteins that experience little positive 294 selection, raw dN/dS values tend to be up to a factor 10 smaller than the corresponding Rate4Site scores. Manuscript to be reviewed of these assumptions and limitations fundamentally invalidate our main findings. Amino-acid level and 363 codon-level analyses of sequence data will generally yield comparable estimates of site-specific rates of 364 evolution.