Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter February 19, 2019

Variational Inference for Coupled Hidden Markov Models Applied to the Joint Detection of Copy Number Variations

  • Xiaoqiang Wang , Emilie Lebarbier , Julie Aubert and Stéphane Robin EMAIL logo

Abstract

Hidden Markov models provide a natural statistical framework for the detection of the copy number variations (CNV) in genomics. In this context, we define a hidden Markov process that underlies all individuals jointly in order to detect and to classify genomics regions in different states (typically, deletion, normal or amplification). Structural variations from different individuals may be dependent. It is the case in agronomy where varietal selection program exists and species share a common phylogenetic past. We propose to take into account these dependencies inthe HMM model. When dealing with a large number of series, maximum likelihood inference (performed classically using the EM algorithm) becomes intractable. We thus propose an approximate inference algorithm based on a variational approach (VEM), implemented in the CHMM R package. A simulation study is performed to assess the performance of the proposed method and an application to the detection of structural variations in plant genomes is presented.

Acknowledgements

This work was supported by the CNV-Maize program funded by the french National Research Agency (ANR-10-GENM-104) and France Agrimer (11000415). Xiaoqiang Wang was financed by CNV-Maize project and National Natural Science Foundation of China (11601286). We are grateful to Stéphane Nicolas for providing the maize dataset.

References

[1] Zarrei M, MacDonald JR, Merico D, Scherer SW. A copy number variation map of the human genome. Nat Rev Genet. 2015:172–83.10.1038/nrg3871Search in Google Scholar PubMed

[2] Lucito R, Healy J, Alexander J, Reiner A, Esposito D, Chi M, et al. Representational oligonucleotide microarray analysis: a high-resolution method to detect genome copy number variation. Genome Res. 2003;13:2291–305.10.1101/gr.1349003Search in Google Scholar PubMed PubMed Central

[3] Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, et al. Large-scale copy number polymorphism in the human genome. Science. 2004;305:525–8.10.1126/science.1098918Search in Google Scholar PubMed

[4] MacDonald JR, Ziman R, Yuen RKC, Feuk L, Scherer SW. The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 2014;42:D986–92.10.1093/nar/gkt958Search in Google Scholar PubMed PubMed Central

[5] Carvalho CM, Lupski JR. Mechanisms underlying structural variant formation in genomic disorders. Nat Rev Genet. 2016;17:224–38.10.1038/nrg.2015.25Search in Google Scholar PubMed PubMed Central

[6] Weischenfeldt J, Symmons O, Spitz F, Korbel JO. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat Rev Genet. 2013;14:125–38.10.1038/nrg3373Search in Google Scholar PubMed

[7] Xu L, Cole JB, Bickhart DM, Hou Y, Song J, VanRaden PM, et al. Genome wide CNV analysis reveals additional variants associated with milk production traits in holsteins. BMC Genomics. 2014;15:683.10.1186/1471-2164-15-683Search in Google Scholar PubMed PubMed Central

[8] Zhou Y, Utsunomiya YT, Xu L, Hay EH, Bickhart DM, Alexandre PA, et al. Genome-wide CNV analysis reveals variants associated with growth traits in bos indicus. BMC Genomics. 2016;17:419.10.1186/s12864-016-2461-4Search in Google Scholar PubMed PubMed Central

[9] Lu F, Romay MC, Glaubitz JC, Bradbury PJ, Elshire RJ, Wang T, et al. High-resolution genetic mapping of maize pan-genome sequence anchors. Nat Commun. 2015;6:6914 EP –.10.1038/ncomms7914Search in Google Scholar PubMed PubMed Central

[10] Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat Rev Genet. 2011;12:363–76.10.1038/nrg2958Search in Google Scholar PubMed PubMed Central

[11] Lai WR, Johnson MD, Kucherlapati R, Park PJ. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005;21:3763.10.1093/bioinformatics/bti611Search in Google Scholar PubMed PubMed Central

[12] Dellinger AE, Saw SM, Goh LK, Seielstad M, Young TL, Li YJ. Comparative analyses of seven algorithms for copy number variant identification from single nucleotide polymorphism arrays. Nucleic Acids Res, 2010;38:e105.10.1093/nar/gkq040Search in Google Scholar PubMed PubMed Central

[13] Winchester L, Yau C, Ragoussis J. Comparing CNV detection methods for SNP arrays. Briefings Funct Genomics Proteomics. 2009;8:353–66.10.1093/bfgp/elp017Search in Google Scholar PubMed

[14] Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinf. 2013;14:S1.10.1186/1471-2105-14-S11-S1Search in Google Scholar PubMed PubMed Central

[15] Magi A, Tattini L, Pippucci T, Torricelli F, Benelli M. Read count approach for DNA copy number variants detection. Bioinformatics. 2012;28:470–8.10.1093/bioinformatics/btr707Search in Google Scholar PubMed

[16] Ji T, Chen J. Statistical models for dna copy number variation detection using read-depth data from next generation sequencing experiments. Aust N Z J Stat. 2016;58:473–91.10.1111/anzs.12175Search in Google Scholar

[17] Picard F, Lebarbier E, Budinska E, Robin S. Joint segmentation of multivariate Gaussian processes using mixed linear models. Comput Stat Data Anal. 2011;55:1160–70.10.1016/j.csda.2010.09.015Search in Google Scholar

[18] Tai YC, Kvale MN, Witte JS. Segmentation and estimation for SNP microarrays: a Bayesian multiple change-point approach. Biometrics. 2010;66:675–83.10.1111/j.1541-0420.2009.01328.xSearch in Google Scholar PubMed PubMed Central

[19] Hu J, Zhang L, Wang HJ. Sequential model selection-based segmentation to detect DNA copy number variation. Biometrics. 2016;72:815–26.10.1111/biom.12478Search in Google Scholar PubMed PubMed Central

[20] Shah SP, Cheung Jr KJ, Johnson NA, Alain G, Gascoyne RD, Horsman DE, et al. Model-based clustering of array cgh data. Bioinformatics. 2009;25:i30–i38.10.1093/bioinformatics/btp205Search in Google Scholar PubMed PubMed Central

[21] Wang K, Chen Z, Tadesse MG, Glessner J, Grant SF, Hakonarson H, et al. Modeling genetic inheritance of copy number variations. Nucleic Acids Res. 2008;36:e138.10.1093/nar/gkn641Search in Google Scholar

[22] Liu Y, Liu J, Lu J, Peng J, Juan L, Zhu X, et al. Joint detection of copy number variations in parent-offspring trios. Bioinformatics. 2016;32:1130–7.10.1093/bioinformatics/btv707Search in Google Scholar

[23] Collilieux X, Lebarbier E, Robin S. A factor model approach for the joint segmentation with between-series correlation. arXiv preprint arXiv:1505.05660, 2015.Search in Google Scholar

[24] Sun W, Wright FA, Tang Z, Nordgard SH, Loo PV, Yu T, et al. Integrated study of copy number states and genotype calls using high-density SNP arrays. Nucleic Acids Res. 2009;37:5365–77.10.1093/nar/gkp493Search in Google Scholar

[25] Rezek I, Gibbs M, Roberts SJ. Maximum a posteriori estimation of coupled hidden Markov models. J VLSI Signal Process Syst Signal Image Video Technol. 2002;32:55–66.10.1023/A:1016363317870Search in Google Scholar

[26] Nock H, Ostendorf M. Parameter reduction schemes for loosely coupled HMMs. Comput Speech Lang. 2003;17:233–62.10.1016/S0885-2308(03)00009-3Search in Google Scholar

[27] Sherlock C, Xifara T, Telfer S, Begon M. A coupled hidden Markov model for disease interactions. J Royal Stat Soc C: Appl Stat. 2013;62:609–27.10.1111/rssc.12015Search in Google Scholar PubMed PubMed Central

[28] Ghahjaverestan NM, Masoudi S, Shamsollahi MB, Beuchée A, Pladys P, Ge D, et al. Coupled hidden Markov model-based method for apnea bradycardia detection. IEEE J Biomed Health Inf. 2016;20:527–38.10.1109/JBHI.2015.2405075Search in Google Scholar PubMed

[29] Choi H, Fermin D, Nesvizhskii AI, Ghosh D, Qin ZS. Sparsely correlated hidden Markov models with application to genome-wide location studies. Bioinformatics. 2013;29:533–41.10.1093/bioinformatics/btt012Search in Google Scholar PubMed PubMed Central

[30] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological). 1977;39:1–38Search in Google Scholar

[31] Saul LK, Jordan MI. Mixed memory Markov models: decomposing complex stochastic processes as mixtures of simpler ones. Mach Learn. 1999;37:75–87.10.1023/A:1007649326333Search in Google Scholar

[32] Saul LK, Jaakkola T, Jordan MI. Mean field theory for sigmoid belief networks. J Artif Intell Res. 1996;4:61–76.10.1613/jair.251Search in Google Scholar

[33] Jaakkola TS. Tutorial on variational approximation methods, Advanced mean field methods: theory and practice. Cambridge: MIT Press, 2000Search in Google Scholar

[34] Wainwright MJ, Jordan MI. Graphical models, exponential families, and variational inference. Found Trends Mach Learn. 2008;1:1–305.10.1561/9781601981851Search in Google Scholar

[35] Blei DM, Kucukelbir A, McAuliffe JD. Variational inference: a review for statisticians. J Am Stat Assoc. 2017;112:859–77.10.1080/01621459.2017.1285773Search in Google Scholar

[36] Ormerod JT, Wand MP. Explaining variational approximations. Am Stat. 2010;64:140–53.10.1198/tast.2010.09058Search in Google Scholar

[37] Ghahramani Z, Jordan MI. Factorial hidden Markov models. Mach Learn. 1997;29:245–73.10.21236/ADA307097Search in Google Scholar

[38] Astle W, Balding DJ. Population structure and cryptic relatedness in genetic association studies. Stat Sci. 2009;24:451–71.10.1214/09-STS307Search in Google Scholar

[39] Speed D, Balding DJ. Relatedness in the post-genomic era: is it still useful? Nat Rev Genet. 2015;16:33–44.10.1038/nrg3821Search in Google Scholar PubMed

[40] Saul L, Jordan MI. Exploiting tractable substructures in intractable networks. Advances in neural information processing systems 8. Cambridge: MIT Press, 1995:486–92.Search in Google Scholar

[41] Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–410.1214/aos/1176344136Search in Google Scholar

[42] Daudin JJ, Picard F, Robin S. A mixture model for random graphs. Stat Comput. 2008;18:173–83.10.1007/s11222-007-9046-7Search in Google Scholar

[43] Bouchet S, Servin B, Bertin P, Madur D, Combes V, Dumas F, et al. Adaptation of maize to temperate climates: mid-density genome-wide association genetics and diversity patterns reveal key genomic regions, with a major contribution of the Vgt2 (ZCN8) locus. PLoS ONE 2013;8:e71377.10.1371/journal.pone.0071377Search in Google Scholar PubMed PubMed Central

[44] Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome snp genotyping data. Genome Res. 2007;17:1665–74.10.1101/gr.6861907Search in Google Scholar PubMed PubMed Central

[45] Lai J, Li R, Xu X, Jin W, Xu M, Zhao H, et al. Genome-wide patterns of genetic variation among elite maize inbred lines. Nat Genet. 2010;42:1027–30.10.1038/ng.684Search in Google Scholar PubMed

[46] Springer NM, Ying K, Fu Y, Ji T, Yeh CT, Jia Y, et al. Maize inbreds exhibit high levels of copy number variation (CNV) and presence/absence variation (PAV) in genome content. PLoS Genet. 2009;5.10.1371/journal.pgen.1000734Search in Google Scholar PubMed PubMed Central

[47] Swanson-Wagner RA, Eichten SR, Kumari S, Tiffin P, Stein JC, Ware D, et al. Pervasive gene content variation and copy number variation in maize and its undomesticated progenitor. Genome Res. 2010;20:1689–99.10.1101/gr.109165.110Search in Google Scholar PubMed PubMed Central

[48] Schnable P, Ware D, Fulton R, Stein JC, Wei F, Pasternak S, et al. The b73 maize genome: complexity, diversity, and dynamics. Science. 2009;326:1112–510.1126/science.1178534Search in Google Scholar PubMed

[49] Beló A, Beatty MK, Hondred D, Fengler KA, Li B, Rafalski A. Allelic genome structural variations in maize detected by array comparative genome hybridization. Theor Appl Genet. 2010;120:355–67.10.1007/s00122-009-1128-9Search in Google Scholar PubMed

[50] Darracq A, Vitte C, Nicolas S, Duarte J, Pichon J, Aubert J, et al. Sequence analysis of European maize inbred line FV2 provides new insights into molecular and chromosomal characteristics of presence/absence variants. Submitted, 2017.10.1186/s12864-018-4490-7Search in Google Scholar PubMed PubMed Central

[51] Wang H, Nettleton D, Ying K. Copy number variation detection using next generation sequencing read counts. BMC Bioinf. 2014;15:109–109.10.1186/1471-2105-15-109Search in Google Scholar PubMed PubMed Central

[52] Colella S, Yau C, Taylor JM, Mirza G, Butler H, Clouston P, et al. QuantiSNP: an objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res 2007;35:2013–25.10.1093/nar/gkm076Search in Google Scholar PubMed PubMed Central

[53] R Core Team. R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2015. Available at: https://www.R-project.org/.Search in Google Scholar


Supplementary Material

The online version of this article offers supplementary material (DOI:https://doi.org/10.1515/ijb-2018-0023).


Received: 2018-02-22
Revised: 2018-11-15
Accepted: 2018-11-21
Published Online: 2019-02-19

© 2019 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 26.4.2024 from https://www.degruyter.com/document/doi/10.1515/ijb-2018-0023/html
Scroll to top button