Abstract
Gene duplication is a fundamental process that has the potential to drive phenotypic differences between populations and species. While evolutionarily neutral changes have the potential to affect phenotypes, detecting selection acting on gene duplicates can uncover cases of adaptive diversification. Existing methods to detect selection on duplicates work mostly inter-specifically and are based upon selection on coding sequence changes, here we present a method to detect selection directly on a copy number variant segregating in a population. The method relies upon expected relationships between allele (new duplication) age and frequency in the population dependent upon the effective population size. Using both a haploid and a diploid population with a Moran Model under several population sizes, the neutral baseline for copy number variants is established. The ability of the method to reject neutrality for duplicates with known age (measured in pairwise dS value) and frequency in the population is established through mathematical analysis and through simulations. Power is particularly good in the diploid case and with larger effective population sizes, as expected. With extension of this method to larger population sizes, this is a tool to analyze selection on copy number variants in any natural or experimentally evolving population. We have made an R package available at https://github.com/peterbchi/CNVSelectR/ which implements the method introduced here.
Similar content being viewed by others
Data Availability
No original research data were presented in this paper. Code used to perform the analysis is available at https://github.com/TristanLStark/DetectingSelection. An R script to run the full analysis has been made available at https://github.com/peterbchi/CNVSelectR/blob/master/R/CNVSelect_test.R.
References
Anisimova M, Liberles D (2012) Detecting and understanding natural selection. In: Cannarozzi GM, Schneider A (eds) Codon evolution: mechanisms and models, vol 6. Oxford University Press, Oxford, pp 73–96
Arvestad L, Lagergren J, Sennblad B (2009) The gene evolution model and computing its associated probabilities. J ACM 56(2):1–100. https://doi.org/10.1145/1502793.1502796
Bornholdt D, Atkinson TP, Bouadjar B, Catteau B, Cox H, De Silva D, Grzeschik K (2013) Genotype-phenotype correlations emerging from the identification of missense mutations in MBTPS2. Hum Mutat 34(4):587–594. https://doi.org/10.1002/humu.22275
Conant GC, Wagner A (2003) Asymmetric sequence divergence of duplicate genes. Genome Res 13(9):2052–2058. https://doi.org/10.1101/gr.1252603
De Sanctis B, Krukov I, de Koning AJ (2017) Allele age under non-classical assumptions is clarified by an exact computational Markov chain approach. Sci Rep 7(1):1–11. https://doi.org/10.1038/s41598-017-12239-0
Force A, Lynch M, Pickett FB, Amores A, Yan Y-L, Postlethwait J (1999) Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151(4):1531–1545
Gonzalez E, Kulkarni H, Bolivar H, Mangano A, Sanchez R, Catano G, Ahuja SK (2005) The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science 307(5714):1434–1440. https://doi.org/10.1126/science.1101160
Guerin MN, Weinstein DJ, Bracht JR (2019) Stress adapted Mollusca and Nematoda exhibit convergently expanded hsp70 and AIG1 gene families. J Mol Evol 87(9–10):289–297. https://doi.org/10.1007/s00239-019-09900-9
Hsieh P, Vollger MR, Dang V, Porubsky D, Baker C, Cantsilieris S, Sorensen M et al (2019) Adaptive archaic introgression of copy number variants and the discovery of previously unknown human genes. Science 366:6463
Hughes AL (1994) The evolution of functionally novel proteins after gene duplication. Proc R Soc Lond B 256(1346):119–124. https://doi.org/10.1098/rspb.1994.0058
Innan H, Kondrashov F (2010) The evolution of gene duplications: classifying and distinguishing between models. Nat Rev Genet 11(2):97–108. https://doi.org/10.1038/nrg2689
Itsara A, Wu H, Smith JD, Nickerson DA, Romieu I, London SJ, Eichler EE (2010) De novo rates and selection of large copy number variation. Genome Res 20(11):1469–1481. https://doi.org/10.1101/gr.107680.110
Katju V, Lynch M (2006) On the formation of novel genes by duplication in the Caenorhabditis elegans genome. Mol Biol Evol 23(5):1056–1067. https://doi.org/10.1093/molbev/msj114
Konrad A, Teufel AI, Grahnen JA, Liberles DA (2011) Toward a general model for the evolutionary dynamics of gene duplicates. Genome Biol Evol 3:1197–1209. https://doi.org/10.1093/gbe/evr093
Latouche G, Ramaswami V (1999) Introduction to matrix analytic methods in stochastic modeling. ASA-SIAM series on statistics and applied mathematics. Society for Industrial and Applied Mathematics, Philadelphia
Lauer S, Avecilla G, Spealman P, Sethia G, Brandt N, Levy SF, Gresham D (2018) Single-cell copy number variant detection reveals the dynamics and diversity of adaptation. PLoS Biol. 16(12):e3000069
Liberles DA, Teufel AI, Liu L, Stadler T (2013) On the need for mechanistic models in computational genomics and metagenomics. Genome Biol. Evol. 5(10):2008–2018
Lynch M, Force A (2000a) The probability of duplicate gene preservation by subfunctionalization. Genetics 154(1):459–473
Lynch M, Force AG (2000b) The origin of interspecific genomic incompatibility via gene duplication. Am Nat 156(6):590–605. https://doi.org/10.1086/316992
Lynch M, O’Hely M, Walsh B, Force A (2001) The probability of preservation of a newly arisen gene duplicate. Genetics 159(4):1789–1804
Maruyama T (1974) The age of an allele in a finite population. Genet Res 23(2):137–143. https://doi.org/10.1017/S0016672300014750
Moran PAP (1958) Random processes in genetics. In: Mathematical proceedings of the cambridge philosophical society, vol 54, Cambridge University Press, Cambridge, pp 60–71. https://doi.org/10.1017/S0305004100033193
Nei M, Gojobori T (1986) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 3(5):418–426. https://doi.org/10.1093/oxfordjournals.molbev.a040410
Ohno S (1970) The enormous diversity in genome sizes of fish as a reflection of nature’s extensive experiments with gene duplication. Trans Am Fish Soc 99(1):120–130. https://doi.org/10.1577/1548-8659(1970)99h120:TEDIGSi2.0.CO;2
Perry GH, Dominy NJ, Claw KG, Lee AS, Fiegler H, Redon R, Stone AC (2007) Diet and the evolution of human amylase gene copy number variation. Nat Genet 39(10):1256–1260. https://doi.org/10.1038/ng2123
Platt A, Pivirotto A, Knoblauch J, Hey J (2019) An estimator of first coalescent time reveals selection on young variants and large heterogeneity in rare allele ages among human populations. PLoS Genet 15:8. https://doi.org/10.1371/journal.pgen.1008340
Rodrigue N, Philippe H (2010) Mechanistic revisions of phenomenological modeling strategies in molecular evolution. Trends Genet 26(6):248–252
Sidje RB (1998) Expokit: a software package for computing matrix exponentials. ACM Trans Math Softw 24(1):130–156. https://doi.org/10.1145/285861.285868
Siltberg-Liberles J, Grahnen JA, Liberles DA (2011) The evolution of protein structures and structural ensembles under functional constraint. Genes 2(4):748–762
Stark TL, Liberles DA, Holland BR, O’Reilly MM (2017) Analysis of a mechanistic Markov model for gene duplicates evolving under subfunctionalization. BMC Evol Biol 17(1):1–16. https://doi.org/10.1186/s12862-016-0848-0
Steel M (2005) Should phylogenetic models be trying to ‘fit an elephant’? Trends Genet 21(6):307–309
Tofigh A, Hallett M, Lagergren J (2010) Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Trans Comput Biol Bioinform 8(2):517–535. https://doi.org/10.1109/TCBB.2010.14
Wagner A (2005) Energy constraints on the evolution of gene expression. Mol Biol Evol 22(6):1365–1374. https://doi.org/10.1093/molbev/msi126
Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24(8):1586–1591. https://doi.org/10.1093/molbev/msm088
Yohe LR, Liu L, Dávalos LM, Liberles DA (2019) Protocols for the molecular evolutionary analysis of membrane protein gene duplicates. In: Sikosek T (ed) Computational methods in protein evolution, vol 1851. Springer, New York, pp 49–62. https://doi.org/10.1007/978-1-4939-8736-83
Zhang C, Zhang C, Chen S, Yin X, Pan X, Lin G, Wang W (2013) A single cell level based method for copy number variation analysis by low coverage massively parallel sequencing. PloS ONE 8:1. https://doi.org/10.1371/journal.pone.0054236
Zhang J (2003) Evolution by gene duplication: an update. Trends Ecol Evol 18(6):292–298. https://doi.org/10.1016/S0169-5347(03)00033-8
Acknowledgements
We would like to thank the Australian Research Council for partially funding this research through Discovery Project DP180100352. We would also like to thank Ryan Houser for careful reading of an early version of the manuscript and for helpful discussions, Gene Maltepes for computational support, and Catherine Browne for technical assistance in the preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
This study was conceived by DAL and TLS. Modeling and theoretical results were generated by TLS and RSK. Computer code for simulations was written and run by TLS, RSK, MAM, and PBC. The manuscript was written by DAL, TLS, RSK, and MAM.
Corresponding authors
Additional information
Handling editor: Liang Liu.
Rights and permissions
About this article
Cite this article
Stark, T.L., Kaufman, R.S., Maltepes, M.A. et al. Detecting Selection on Segregating Gene Duplicates in a Population. J Mol Evol 89, 554–564 (2021). https://doi.org/10.1007/s00239-021-10024-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-021-10024-2