Abstract
In many applications of evolutionary inference, a model of protein evolution needs to be fitted to the amino acid variation at individual sites in a multiple sequence alignment. Most existing models fall into one of two extremes: Either they provide a coarse-grained description that lacks biophysical realism (e.g., dN/dS models), or they require a large number of parameters to be fitted (e.g., mutation–selection models). Here, we ask whether a middle ground is possible: Can we obtain a realistic description of site-specific amino acid frequencies while severely restricting the number of free parameters in the model? We show that a distribution with a single free parameter can accurately capture the variation in amino acid frequency at most sites in an alignment, as long as we are willing to restrict our analysis to predicting amino acid frequencies by rank rather than by amino acid identity. This result holds equally well both in alignments of empirical protein sequences and of sequences evolved under a biophysically realistic all-atom force field. Our analysis reveals a near universal shape of the frequency distributions of amino acids. This insight has the potential to lead to new models of evolution that have both increased realism and a limited number of free parameters.
Similar content being viewed by others
References
Arenas M (2015) Trends in substitution models of molecular evolution. Front Genet 6:319
Arenas M, Posada D (2014) Simulation of genome-wide evolution under heterogeneous substitution models and complex multispecies coalescent histories. Mol Biol Evol 31:1295–1301
Arenas M, Sánchez-Cobos A, Bastolla U (2015) Maximum-likelihood phylogenetic inference with selection on protein folding stability. Mol Biol Evol 32:2195–2207
Ashkenazy H, Abadi S, Martz E, Chay O, Mayrose I, Pupko T, Ben-Tal N (2016) ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules. Nucl Acids Res 44(W1):W344–W350
Bastolla U, Arenas M (2019) The influence of protein stability on sequence evolution: applications to phylogenetic inference. In: Sikosek T (ed) Computational methods in protein evolution. Springer, New York, pp 215–231
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol) 57:289–300
Bruno WJ (1996) Modeling residue usage in aligned protein sequences via maximum likelihood. Mol Biol Evol 13:1368–1374
Conant GC, Stadler PF (2009) Solvent exposure imparts similar selective pressures across a range of yeast proteins. Mol Biol Evol 26:1155–1161
Dokholyan NV, Mirny LA, Shakhnovich EI (2002) Understanding conserved amino acids in proteins. Physica A 314:600–606
Dokholyan NV, Shakhnovich EI (2001) Understanding hierarchical protein evolution from first principles. J Mol Biol 312:289–307
Echave J, Jackson EL, Wilke CO (2015) Relationship between protein thermodynamic constraints and variation of evolutionary rates among sites. Phys Biol 12:025002
Echave J, Wilke CO (2017) Biophysical models of protein evolution: understanding the patterns of evolutionary sequence divergence. Annu Rev Biophys 46:85–103
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33:1
Goldman N, Yang Z (1994) A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 11:725–736
Goldstein RA, Pollock DD (2016) The tangled bank of amino acids. Protein Sci 25:1354–1362
Halpern AL, Bruno WJ (1998) Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol 15:910–917
Jackson EL, Ollikainen N, Covert AW, Kortemme IT, Wilke CO (2013) Amino-acid site variability among natural and designed proteins. PeerJ 1:211
Jiang Q, Teufel AI, Jackson EL, Wilke CO (2018) Beyond thermodynamic constraints: evolutionary sampling generates realistic protein sequence variation. Genetics 208:1387–1395
Jimenez MJ, Arenas M, Bastolla U (2018) Substitution rates predicted by stability-constrained models of protein evolution are not consistent with empirical data. Mol Biol Evol 35:743–755
Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: Munro HN (ed) Mammalian protein metabolism III. Academic Press, New York, pp 21–132
Kimura M (1977) Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267:275–276
Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16:111–120
Kosakovsky Pond SL, Frost SDW (2005) Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol Biol Evol 22:1208–1222
Koshi JM, Goldstein RA (1998) Models of natural mutations including site heterogeneity. Proteins 32:289–295
Kryazhimskiy S, Plotkin JB (2008) The population genetics of \(dN/dS\). PLoS Genet 4:e1000304
Meyer AG, Wilke CO (2013) Integrating sequence variation and protein structure to identify sites under selection. Mol Biol Evol 30:36–44
Porto M, Roman HE, Vendruscolo M, Bastolla U (2005) Prediction of site-specific amino acid distributions and limits of divergent evolutionary changes in protein sequences. Mol Biol Evol 22:630–638
Puller V, Sagulenko P, Neher R. A (2020). Efficient inference, potential, and limitations of site-specific substitution models. bioRxiv
Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N (2002) Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 18(Suppl 1):S71–S77
R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Ramsey DC, Scherrer MP, Zhou T, Wilke CO (2011) The relationship between relative solvent accessibility and evolutionary rate in protein evolution. Genetics 188:479–488
Rodrigue N (2013) On the statistical interpretation of site-specific variables in phylogeny-based substitution models. Genetics 193:557–564
Rodrigue N, Lartillot N (2014) Site-heterogeneous mutation-selection models within the PhyloBayes-MPI package. Bioinformatics 30:1020–1021
Rodrigue N, Philippe H, Lartillot N (2010) Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles. Proc Natl Acad Sci USA 107:4629–4634
Spielman SJ, Kosakovsky Pond SL (2018) Relative evolutionary rate inference in HyPhy with LEISR. PeerJ 6:e4339
Spielman SJ, Wilke CO (2015) The relationship between \(dN/dS\) and scaled selection coefficients. Mol Biol Evol 32:1097–1108
Spielman SJ, Wilke CO (2016) Extensively parameterized mutation-selection models reliably capture site-specific selective constraint. Mol Biol Evol 33:2990–3002
Strait BJ, Dewey TG (1996) The Shannon information entropy of protein sequences. Biophys J 71:148–155
Strauß ME, Reid JE, Wernisch L (2019) GPseudoRank: a permutation sampler for single cell orderings. Bioinformatics 35:611–618
Tamuri AU, dos Reis M, Goldstein RA (2012) Estimating the distribution of selection coefficients from phylogenetic data using sitewise mutation-selection models. Genetics 190:1101–1115
Tamuri AU, Goldman N, dos Reis M (2014) A penalized-likelihood method to estimate the distribution of selection coefficients from phylogenetic data. Genetics 197:257–271
Teufel AI, Wilke CO (2017) Accelerated simulation of evolutionary trajectories in origin-fixation models. J R Soc Interface 14:20160906
Wickham H, Averick M, Bryan J, Chang W, D’Agostino McGowan L, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Lin Pedersen T, Miller E, Milton Bache S, Müller K, Ooms J, Robinson D, Paige Seidel D, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019) Welcome to the tidyverse. J Open Source Softw 4:1686
Wilson DJ, McVean G (2006) Estimating diversifying selection and functional constraint in the presence of recombination. Genetics 172:1411–1425
Yang Z, Bielawski JP (2000) Statistical methods for detecting molecular adaptation. Trends Ecol Evol 15:496–503
Yang Z, Nielsen R (2008) Mutation-selection models of codon substitution and their use to estimate selective strengths on codon usage. Mol Biol Evol 25:568–579
Acknowledgements
This work was supported by National Institutes of Health (NIH) Grant R01 GM088344. M.M.J. acknowledges support from NIH training Grant T32 LM012414-01A1.
Author information
Authors and Affiliations
Corresponding author
Additional information
Handling editor: David Liberles
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Johnson, M.M., Wilke, C.O. Site-Specific Amino Acid Distributions Follow a Universal Shape. J Mol Evol 88, 731–741 (2020). https://doi.org/10.1007/s00239-020-09976-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-020-09976-8