Site-Specific Amino Acid Distributions Follow a Universal Shape

Johnson, Mackenzie M.; Wilke, Claus O.

doi:10.1007/s00239-020-09976-8

Site-Specific Amino Acid Distributions Follow a Universal Shape

Original Article
Published: 24 November 2020

Volume 88, pages 731–741, (2020)
Cite this article

Journal of Molecular Evolution Aims and scope Submit manuscript

591 Accesses
3 Citations
8 Altmetric
Explore all metrics

Abstract

In many applications of evolutionary inference, a model of protein evolution needs to be fitted to the amino acid variation at individual sites in a multiple sequence alignment. Most existing models fall into one of two extremes: Either they provide a coarse-grained description that lacks biophysical realism (e.g., dN/dS models), or they require a large number of parameters to be fitted (e.g., mutation–selection models). Here, we ask whether a middle ground is possible: Can we obtain a realistic description of site-specific amino acid frequencies while severely restricting the number of free parameters in the model? We show that a distribution with a single free parameter can accurately capture the variation in amino acid frequency at most sites in an alignment, as long as we are willing to restrict our analysis to predicting amino acid frequencies by rank rather than by amino acid identity. This result holds equally well both in alignments of empirical protein sequences and of sequences evolved under a biophysically realistic all-atom force field. Our analysis reveals a near universal shape of the frequency distributions of amino acids. This insight has the potential to lead to new models of evolution that have both increased realism and a limited number of free parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

The Influence of Protein Stability on Sequence Evolution: Applications to Phylogenetic Inference

Sequence entropy of folding and the absolute rate of amino acid substitutions

Article 23 October 2017

Richard A. Goldstein & David D. Pollock

Causes of evolutionary rate variation among protein sites

Article 19 January 2016

Julian Echave, Stephanie J. Spielman & Claus O. Wilke

References

Arenas M (2015) Trends in substitution models of molecular evolution. Front Genet 6:319
PubMed PubMed Central Google Scholar
Arenas M, Posada D (2014) Simulation of genome-wide evolution under heterogeneous substitution models and complex multispecies coalescent histories. Mol Biol Evol 31:1295–1301
CAS PubMed PubMed Central Google Scholar
Arenas M, Sánchez-Cobos A, Bastolla U (2015) Maximum-likelihood phylogenetic inference with selection on protein folding stability. Mol Biol Evol 32:2195–2207
CAS PubMed PubMed Central Google Scholar
Ashkenazy H, Abadi S, Martz E, Chay O, Mayrose I, Pupko T, Ben-Tal N (2016) ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules. Nucl Acids Res 44(W1):W344–W350
CAS PubMed Google Scholar
Bastolla U, Arenas M (2019) The influence of protein stability on sequence evolution: applications to phylogenetic inference. In: Sikosek T (ed) Computational methods in protein evolution. Springer, New York, pp 215–231
Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol) 57:289–300
Google Scholar
Bruno WJ (1996) Modeling residue usage in aligned protein sequences via maximum likelihood. Mol Biol Evol 13:1368–1374
CAS PubMed Google Scholar
Conant GC, Stadler PF (2009) Solvent exposure imparts similar selective pressures across a range of yeast proteins. Mol Biol Evol 26:1155–1161
CAS PubMed Google Scholar
Dokholyan NV, Mirny LA, Shakhnovich EI (2002) Understanding conserved amino acids in proteins. Physica A 314:600–606
CAS Google Scholar
Dokholyan NV, Shakhnovich EI (2001) Understanding hierarchical protein evolution from first principles. J Mol Biol 312:289–307
CAS PubMed Google Scholar
Echave J, Jackson EL, Wilke CO (2015) Relationship between protein thermodynamic constraints and variation of evolutionary rates among sites. Phys Biol 12:025002
PubMed PubMed Central Google Scholar
Echave J, Wilke CO (2017) Biophysical models of protein evolution: understanding the patterns of evolutionary sequence divergence. Annu Rev Biophys 46:85–103
CAS PubMed PubMed Central Google Scholar
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33:1
PubMed PubMed Central Google Scholar
Goldman N, Yang Z (1994) A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 11:725–736
CAS PubMed Google Scholar
Goldstein RA, Pollock DD (2016) The tangled bank of amino acids. Protein Sci 25:1354–1362
CAS PubMed PubMed Central Google Scholar
Halpern AL, Bruno WJ (1998) Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol 15:910–917
CAS PubMed Google Scholar
Jackson EL, Ollikainen N, Covert AW, Kortemme IT, Wilke CO (2013) Amino-acid site variability among natural and designed proteins. PeerJ 1:211
Google Scholar
Jiang Q, Teufel AI, Jackson EL, Wilke CO (2018) Beyond thermodynamic constraints: evolutionary sampling generates realistic protein sequence variation. Genetics 208:1387–1395
CAS PubMed PubMed Central Google Scholar
Jimenez MJ, Arenas M, Bastolla U (2018) Substitution rates predicted by stability-constrained models of protein evolution are not consistent with empirical data. Mol Biol Evol 35:743–755
CAS PubMed Google Scholar
Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: Munro HN (ed) Mammalian protein metabolism III. Academic Press, New York, pp 21–132
Google Scholar
Kimura M (1977) Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267:275–276
CAS PubMed Google Scholar
Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16:111–120
CAS PubMed Google Scholar
Kosakovsky Pond SL, Frost SDW (2005) Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol Biol Evol 22:1208–1222
PubMed Google Scholar
Koshi JM, Goldstein RA (1998) Models of natural mutations including site heterogeneity. Proteins 32:289–295
CAS PubMed Google Scholar
Kryazhimskiy S, Plotkin JB (2008) The population genetics of \(dN/dS\). PLoS Genet 4:e1000304
PubMed PubMed Central Google Scholar
Meyer AG, Wilke CO (2013) Integrating sequence variation and protein structure to identify sites under selection. Mol Biol Evol 30:36–44
CAS PubMed Google Scholar
Porto M, Roman HE, Vendruscolo M, Bastolla U (2005) Prediction of site-specific amino acid distributions and limits of divergent evolutionary changes in protein sequences. Mol Biol Evol 22:630–638
CAS PubMed Google Scholar
Puller V, Sagulenko P, Neher R. A (2020). Efficient inference, potential, and limitations of site-specific substitution models. bioRxiv
Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N (2002) Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 18(Suppl 1):S71–S77
PubMed Google Scholar
R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Google Scholar
Ramsey DC, Scherrer MP, Zhou T, Wilke CO (2011) The relationship between relative solvent accessibility and evolutionary rate in protein evolution. Genetics 188:479–488
CAS PubMed PubMed Central Google Scholar
Rodrigue N (2013) On the statistical interpretation of site-specific variables in phylogeny-based substitution models. Genetics 193:557–564
PubMed PubMed Central Google Scholar
Rodrigue N, Lartillot N (2014) Site-heterogeneous mutation-selection models within the PhyloBayes-MPI package. Bioinformatics 30:1020–1021
CAS PubMed Google Scholar
Rodrigue N, Philippe H, Lartillot N (2010) Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles. Proc Natl Acad Sci USA 107:4629–4634
CAS PubMed Google Scholar
Spielman SJ, Kosakovsky Pond SL (2018) Relative evolutionary rate inference in HyPhy with LEISR. PeerJ 6:e4339
PubMed PubMed Central Google Scholar
Spielman SJ, Wilke CO (2015) The relationship between \(dN/dS\) and scaled selection coefficients. Mol Biol Evol 32:1097–1108
CAS PubMed PubMed Central Google Scholar
Spielman SJ, Wilke CO (2016) Extensively parameterized mutation-selection models reliably capture site-specific selective constraint. Mol Biol Evol 33:2990–3002
CAS PubMed PubMed Central Google Scholar
Strait BJ, Dewey TG (1996) The Shannon information entropy of protein sequences. Biophys J 71:148–155
CAS PubMed PubMed Central Google Scholar
Strauß ME, Reid JE, Wernisch L (2019) GPseudoRank: a permutation sampler for single cell orderings. Bioinformatics 35:611–618
PubMed Google Scholar
Tamuri AU, dos Reis M, Goldstein RA (2012) Estimating the distribution of selection coefficients from phylogenetic data using sitewise mutation-selection models. Genetics 190:1101–1115
PubMed PubMed Central Google Scholar
Tamuri AU, Goldman N, dos Reis M (2014) A penalized-likelihood method to estimate the distribution of selection coefficients from phylogenetic data. Genetics 197:257–271
PubMed PubMed Central Google Scholar
Teufel AI, Wilke CO (2017) Accelerated simulation of evolutionary trajectories in origin-fixation models. J R Soc Interface 14:20160906
PubMed PubMed Central Google Scholar
Wickham H, Averick M, Bryan J, Chang W, D’Agostino McGowan L, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Lin Pedersen T, Miller E, Milton Bache S, Müller K, Ooms J, Robinson D, Paige Seidel D, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019) Welcome to the tidyverse. J Open Source Softw 4:1686
Google Scholar
Wilson DJ, McVean G (2006) Estimating diversifying selection and functional constraint in the presence of recombination. Genetics 172:1411–1425
CAS PubMed PubMed Central Google Scholar
Yang Z, Bielawski JP (2000) Statistical methods for detecting molecular adaptation. Trends Ecol Evol 15:496–503
CAS PubMed PubMed Central Google Scholar
Yang Z, Nielsen R (2008) Mutation-selection models of codon substitution and their use to estimate selective strengths on codon usage. Mol Biol Evol 25:568–579
CAS PubMed Google Scholar

Download references

Acknowledgements

This work was supported by National Institutes of Health (NIH) Grant R01 GM088344. M.M.J. acknowledges support from NIH training Grant T32 LM012414-01A1.

Author information

Authors and Affiliations

Department of Integrative Biology, The University of Texas at Austin, Austin, TX, 78712, USA
Mackenzie M. Johnson & Claus O. Wilke
Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX, 78712, USA
Mackenzie M. Johnson

Authors

Mackenzie M. Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Claus O. Wilke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claus O. Wilke.

Additional information

Handling editor: David Liberles

Electronic supplementary material

Below is the link to the electronic supplementary material.

Electronic supplementary material 1 (PDF 1830 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Johnson, M.M., Wilke, C.O. Site-Specific Amino Acid Distributions Follow a Universal Shape. J Mol Evol 88, 731–741 (2020). https://doi.org/10.1007/s00239-020-09976-8

Download citation

Received: 05 August 2020
Accepted: 17 November 2020
Published: 24 November 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s00239-020-09976-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Site-Specific Amino Acid Distributions Follow a Universal Shape

Abstract

Access this article

Similar content being viewed by others

The Influence of Protein Stability on Sequence Evolution: Applications to Phylogenetic Inference

Sequence entropy of folding and the absolute rate of amino acid substitutions

Causes of evolutionary rate variation among protein sites

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Electronic supplementary material 1 (PDF 1830 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Site-Specific Amino Acid Distributions Follow a Universal Shape

Abstract

Access this article

Similar content being viewed by others

The Influence of Protein Stability on Sequence Evolution: Applications to Phylogenetic Inference

Sequence entropy of folding and the absolute rate of amino acid substitutions

Causes of evolutionary rate variation among protein sites

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Electronic supplementary material 1 (PDF 1830 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation