Skip to main content
Log in

Site-Specific Amino Acid Distributions Follow a Universal Shape

  • Original Article
  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript

Abstract

In many applications of evolutionary inference, a model of protein evolution needs to be fitted to the amino acid variation at individual sites in a multiple sequence alignment. Most existing models fall into one of two extremes: Either they provide a coarse-grained description that lacks biophysical realism (e.g., dN/dS models), or they require a large number of parameters to be fitted (e.g., mutation–selection models). Here, we ask whether a middle ground is possible: Can we obtain a realistic description of site-specific amino acid frequencies while severely restricting the number of free parameters in the model? We show that a distribution with a single free parameter can accurately capture the variation in amino acid frequency at most sites in an alignment, as long as we are willing to restrict our analysis to predicting amino acid frequencies by rank rather than by amino acid identity. This result holds equally well both in alignments of empirical protein sequences and of sequences evolved under a biophysically realistic all-atom force field. Our analysis reveals a near universal shape of the frequency distributions of amino acids. This insight has the potential to lead to new models of evolution that have both increased realism and a limited number of free parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Arenas M (2015) Trends in substitution models of molecular evolution. Front Genet 6:319

    PubMed  PubMed Central  Google Scholar 

  • Arenas M, Posada D (2014) Simulation of genome-wide evolution under heterogeneous substitution models and complex multispecies coalescent histories. Mol Biol Evol 31:1295–1301

    CAS  PubMed  PubMed Central  Google Scholar 

  • Arenas M, Sánchez-Cobos A, Bastolla U (2015) Maximum-likelihood phylogenetic inference with selection on protein folding stability. Mol Biol Evol 32:2195–2207

    CAS  PubMed  PubMed Central  Google Scholar 

  • Ashkenazy H, Abadi S, Martz E, Chay O, Mayrose I, Pupko T, Ben-Tal N (2016) ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules. Nucl Acids Res 44(W1):W344–W350

    CAS  PubMed  Google Scholar 

  • Bastolla U, Arenas M (2019) The influence of protein stability on sequence evolution: applications to phylogenetic inference. In: Sikosek T (ed) Computational methods in protein evolution. Springer, New York, pp 215–231

    Google Scholar 

  • Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol) 57:289–300

    Google Scholar 

  • Bruno WJ (1996) Modeling residue usage in aligned protein sequences via maximum likelihood. Mol Biol Evol 13:1368–1374

    CAS  PubMed  Google Scholar 

  • Conant GC, Stadler PF (2009) Solvent exposure imparts similar selective pressures across a range of yeast proteins. Mol Biol Evol 26:1155–1161

    CAS  PubMed  Google Scholar 

  • Dokholyan NV, Mirny LA, Shakhnovich EI (2002) Understanding conserved amino acids in proteins. Physica A 314:600–606

    CAS  Google Scholar 

  • Dokholyan NV, Shakhnovich EI (2001) Understanding hierarchical protein evolution from first principles. J Mol Biol 312:289–307

    CAS  PubMed  Google Scholar 

  • Echave J, Jackson EL, Wilke CO (2015) Relationship between protein thermodynamic constraints and variation of evolutionary rates among sites. Phys Biol 12:025002

    PubMed  PubMed Central  Google Scholar 

  • Echave J, Wilke CO (2017) Biophysical models of protein evolution: understanding the patterns of evolutionary sequence divergence. Annu Rev Biophys 46:85–103

    CAS  PubMed  PubMed Central  Google Scholar 

  • Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33:1

    PubMed  PubMed Central  Google Scholar 

  • Goldman N, Yang Z (1994) A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 11:725–736

    CAS  PubMed  Google Scholar 

  • Goldstein RA, Pollock DD (2016) The tangled bank of amino acids. Protein Sci 25:1354–1362

    CAS  PubMed  PubMed Central  Google Scholar 

  • Halpern AL, Bruno WJ (1998) Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol 15:910–917

    CAS  PubMed  Google Scholar 

  • Jackson EL, Ollikainen N, Covert AW, Kortemme IT, Wilke CO (2013) Amino-acid site variability among natural and designed proteins. PeerJ 1:211

    Google Scholar 

  • Jiang Q, Teufel AI, Jackson EL, Wilke CO (2018) Beyond thermodynamic constraints: evolutionary sampling generates realistic protein sequence variation. Genetics 208:1387–1395

    CAS  PubMed  PubMed Central  Google Scholar 

  • Jimenez MJ, Arenas M, Bastolla U (2018) Substitution rates predicted by stability-constrained models of protein evolution are not consistent with empirical data. Mol Biol Evol 35:743–755

    CAS  PubMed  Google Scholar 

  • Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: Munro HN (ed) Mammalian protein metabolism III. Academic Press, New York, pp 21–132

    Google Scholar 

  • Kimura M (1977) Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267:275–276

    CAS  PubMed  Google Scholar 

  • Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16:111–120

    CAS  PubMed  Google Scholar 

  • Kosakovsky Pond SL, Frost SDW (2005) Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol Biol Evol 22:1208–1222

    PubMed  Google Scholar 

  • Koshi JM, Goldstein RA (1998) Models of natural mutations including site heterogeneity. Proteins 32:289–295

    CAS  PubMed  Google Scholar 

  • Kryazhimskiy S, Plotkin JB (2008) The population genetics of \(dN/dS\). PLoS Genet 4:e1000304

    PubMed  PubMed Central  Google Scholar 

  • Meyer AG, Wilke CO (2013) Integrating sequence variation and protein structure to identify sites under selection. Mol Biol Evol 30:36–44

    CAS  PubMed  Google Scholar 

  • Porto M, Roman HE, Vendruscolo M, Bastolla U (2005) Prediction of site-specific amino acid distributions and limits of divergent evolutionary changes in protein sequences. Mol Biol Evol 22:630–638

    CAS  PubMed  Google Scholar 

  • Puller V, Sagulenko P, Neher R. A (2020). Efficient inference, potential, and limitations of site-specific substitution models. bioRxiv

  • Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N (2002) Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 18(Suppl 1):S71–S77

    PubMed  Google Scholar 

  • R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

    Google Scholar 

  • Ramsey DC, Scherrer MP, Zhou T, Wilke CO (2011) The relationship between relative solvent accessibility and evolutionary rate in protein evolution. Genetics 188:479–488

    CAS  PubMed  PubMed Central  Google Scholar 

  • Rodrigue N (2013) On the statistical interpretation of site-specific variables in phylogeny-based substitution models. Genetics 193:557–564

    PubMed  PubMed Central  Google Scholar 

  • Rodrigue N, Lartillot N (2014) Site-heterogeneous mutation-selection models within the PhyloBayes-MPI package. Bioinformatics 30:1020–1021

    CAS  PubMed  Google Scholar 

  • Rodrigue N, Philippe H, Lartillot N (2010) Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles. Proc Natl Acad Sci USA 107:4629–4634

    CAS  PubMed  Google Scholar 

  • Spielman SJ, Kosakovsky Pond SL (2018) Relative evolutionary rate inference in HyPhy with LEISR. PeerJ 6:e4339

    PubMed  PubMed Central  Google Scholar 

  • Spielman SJ, Wilke CO (2015) The relationship between \(dN/dS\) and scaled selection coefficients. Mol Biol Evol 32:1097–1108

    CAS  PubMed  PubMed Central  Google Scholar 

  • Spielman SJ, Wilke CO (2016) Extensively parameterized mutation-selection models reliably capture site-specific selective constraint. Mol Biol Evol 33:2990–3002

    CAS  PubMed  PubMed Central  Google Scholar 

  • Strait BJ, Dewey TG (1996) The Shannon information entropy of protein sequences. Biophys J 71:148–155

    CAS  PubMed  PubMed Central  Google Scholar 

  • Strauß ME, Reid JE, Wernisch L (2019) GPseudoRank: a permutation sampler for single cell orderings. Bioinformatics 35:611–618

    PubMed  Google Scholar 

  • Tamuri AU, dos Reis M, Goldstein RA (2012) Estimating the distribution of selection coefficients from phylogenetic data using sitewise mutation-selection models. Genetics 190:1101–1115

    PubMed  PubMed Central  Google Scholar 

  • Tamuri AU, Goldman N, dos Reis M (2014) A penalized-likelihood method to estimate the distribution of selection coefficients from phylogenetic data. Genetics 197:257–271

    PubMed  PubMed Central  Google Scholar 

  • Teufel AI, Wilke CO (2017) Accelerated simulation of evolutionary trajectories in origin-fixation models. J R Soc Interface 14:20160906

    PubMed  PubMed Central  Google Scholar 

  • Wickham H, Averick M, Bryan J, Chang W, D’Agostino McGowan L, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Lin Pedersen T, Miller E, Milton Bache S, Müller K, Ooms J, Robinson D, Paige Seidel D, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019) Welcome to the tidyverse. J Open Source Softw 4:1686

    Google Scholar 

  • Wilson DJ, McVean G (2006) Estimating diversifying selection and functional constraint in the presence of recombination. Genetics 172:1411–1425

    CAS  PubMed  PubMed Central  Google Scholar 

  • Yang Z, Bielawski JP (2000) Statistical methods for detecting molecular adaptation. Trends Ecol Evol 15:496–503

    CAS  PubMed  PubMed Central  Google Scholar 

  • Yang Z, Nielsen R (2008) Mutation-selection models of codon substitution and their use to estimate selective strengths on codon usage. Mol Biol Evol 25:568–579

    CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This work was supported by National Institutes of Health (NIH) Grant R01 GM088344. M.M.J. acknowledges support from NIH training Grant T32 LM012414-01A1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Claus O. Wilke.

Additional information

Handling editor: David Liberles

Electronic supplementary material

Below is the link to the electronic supplementary material.

Electronic supplementary material 1 (PDF 1830 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Johnson, M.M., Wilke, C.O. Site-Specific Amino Acid Distributions Follow a Universal Shape. J Mol Evol 88, 731–741 (2020). https://doi.org/10.1007/s00239-020-09976-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00239-020-09976-8

Keywords

Navigation