Skip to main content

Statistical Considerations on NGS Data for Inferring Copy Number Variations

  • Protocol
  • First Online:
Deep Sequencing Data Analysis

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2243))

Abstract

The next-generation sequencing (NGS) technology has revolutionized research in genetics and genomics, resulting in massive NGS data and opening more fronts to answer unresolved issues in genetics. NGS data are usually stored at three levels: image files, sequence tags, and alignment reads. The sizes of these types of data usually range from several hundreds of gigabytes to several terabytes. Biostatisticians and bioinformaticians are typically working with the aligned NGS read count data (hence the last level of NGS data) for data modeling and interpretation.

To horn in on the use of NGS technology, researchers utilize it to profile the whole genome to study DNA copy number variations (CNVs) for an individual subject (or patient) as well as groups of subjects (or patients). The resulting aligned NGS read count data are then modeled by proper mathematical and statistical approaches so that the loci of CNVs can be accurately detected. In this book chapter, a summary of most popularly used statistical methods for detecting CNVs using NGS data is given. The goal is to provide readers with a comprehensive resource of available statistical approaches for inferring DNA copy number variations using NGS data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Redon R, Ishiwaka S, Fitch KR, Feuk L, Perry GH, Andrews D, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME (2006) Global variation in copy number in the human genome. Nature 444:444–454

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Stranger B, Forrest M, Dunning M, Ingle C, Beazley C, Thorne N, Redon R, Bird C, de Grassi A, Lee C, Tyler-Smith C, Carter N, Scherer SW, Tavaré S, Deloukas P, Hurles ME, Dermitzakis ET (2007) Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315:848

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Ji T, Chen J (2016) Statistical methods for DNA copy number variation detection using the next generation sequencing data. Aust N Z J Stat 58:473–491

    Article  Google Scholar 

  4. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  5. Cheung MS, Down TA, Latorre I, Ahringer J (2011) Systematic bias in high-throughput sequencing data and its correction by BEADS. Nucleic Acids Res 39:e103

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Benjamini Y, Speed T (2011) Estimation and correction for GC-content bias in high throughput sequencing. Technical Report 804, Department of Statistics, University of California, Berkeley

    Google Scholar 

  7. Chiang DY, Getz G, Jaffe DB, O’Kelly MJ, Zhao X, Carter SL, Russ C, Nusbaum C, Meyerson M, Lander ES (2009) High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat Methods 6:99–103

    Article  CAS  PubMed  Google Scholar 

  8. Kim TM, Luquette LJ, Xi R, Park PJ (2010) rSW-seq: algorithm for detection of copy number alterations in deep sequencing data. BMC Bioinf 11(432):1471–2105

    Google Scholar 

  9. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197

    Article  CAS  PubMed  Google Scholar 

  10. Price TS, Regan R, Mott R, Hedman A, Honey B, Daniels RJ, Smith L, Greenfield A, Tiganescu A, Buckle V, Ventress N, Ayyub H, Salhan A, Pedraza-Diaz S, Broxholme J, Ragoussis J, Higgs DR, Flint J, Knight SJ (2005) SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. Nucleic Acids Res 33(11):3455–3464

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Baum LE, Petrie T (1966) Statistical inference for probabilistic functions of finite state Markov chains. Ann Math Stat 37(6):1554–1563

    Article  Google Scholar 

  12. Baum LE, Eagon JA (1967) An inequality with applications to statistical estimation for probabilistic functions of a Markov process and to a model for ecology. Bull Am Math Soc 73(3):360–363

    Article  Google Scholar 

  13. Baum LE, Petrie T, Soules G, Weiss N (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Stat 41(1):164–171

    Article  Google Scholar 

  14. Baum LE (1972) An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. In: Shisha O (ed) Proceedings of the third symposium on inequalities. Academic, New York, pp 1–8

    Google Scholar 

  15. Guha S, Li Y, Neuberg D (2008) Bayesian hidden Markov modeling of array CGH Data. J Am Stat Assoc 103:485–497

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Marioni JC, Thorne NP, Tavare S (2006) BioHMM: a heterogeneous Hidden Markov model for segmenting array CGH data. Bioinformatics 22:1144–1146

    Article  CAS  PubMed  Google Scholar 

  17. Ivakhno S, Royce T, Cox AJ, Evers DJ, Cheetham RK, Tavare S (2010) CNAseg – a novel framework for identification of copy number changes in cancer from second-generation sequencing data. Bioinformatics 26:3051–3058

    Article  CAS  PubMed  Google Scholar 

  18. Wang H, Nettleton D, Ying K (2014) Copy number variation detection using next generation sequencing read counts. BMC Bioinf 15:109

    Article  Google Scholar 

  19. Magi A, Bnelli M, Yoon S, Roviello F, Torricelli F (2011) Detecting common copy number variants in high-throughput sequencing data by using JointSLM algorithm. Nucleic Acids Res 39:e65

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Magi A, Benelli M, Marseglia G, Nannetti G, Scordo MR, Torricelli F (2010) A shifting level model algorithm that identifies aberrations in array-CGH data. Biostatistics 11:265–280

    Article  PubMed  Google Scholar 

  21. Shaban SA (1980) Change-point problem and two phase regression: an annotated bibliography. Int Stat Rev 48:83–93

    Google Scholar 

  22. Basseville M (1988) Detecting changes in signals and systems – a survey. Automatica 24:309–326

    Article  Google Scholar 

  23. Chen J, Gupta AK (2012) Parametric statistical change point analysis - with applications to genetics, medicine, and finance, 2nd edn. Birkhauser, New York

    Book  Google Scholar 

  24. Olshen AB, Venkatraman ES, Lucito R, Wigler M (2004) Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5(4):557–572

    Article  PubMed  Google Scholar 

  25. Venkatraman ES, Olshen AB (2007) A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23:657–663

    Article  CAS  PubMed  Google Scholar 

  26. Siegmund DO (1988) Approximate tail probabilities for the maxima of some random fields. Ann Probab 16:487–501

    Article  Google Scholar 

  27. Yao Q (1989) Large deviations for boundary crossing probabilities of some random fields. J Math Res Expo 9:181–192

    Google Scholar 

  28. Yao Q (1993) Tests for change-points with epidemic alternatives. Biometrika 80:179–191

    Article  Google Scholar 

  29. Miller CA, Hampton O, Coarfa C, Milosavljevic A (2011) ReadDepth: a parallel R package for detecting copy number alterations from short sequencing reads. PLoS One 6(1):e16327

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Shen JJ, Zhang NR (2012) Change-point model on nonhomogeneous Poisson process with application in copy number profiling by next-generation DNA sequencing. Ann Appl Stat 6(2):476–496

    Article  Google Scholar 

  31. Rabinowitz D (1994) Detecting clusters in disease incidence. In: Change-point problems (South Hadley, MA, 1992). Institute of Mathematical Statistics Lecture Notes–Monograph Series, vol 23. IMS, Hayward, pp 255–275

    Google Scholar 

  32. Zhang NR, Siegmund DO (2007) A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics 63:22–32

    Article  CAS  PubMed  Google Scholar 

  33. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464

    Article  Google Scholar 

  34. Li H, Vallandingham J, Chen J (2013) SeqBBS: a change-point model based algorithm and R package for searching CNV regions via the ratio of sequencing reads. In: Proceedings of the 2013 IEEE international workshop on genomic signal processing and statistics, pp 46–49

    Google Scholar 

  35. Chen J, Yiğiter A, Chang KC (2011) A Bayesian approach to inference about a change point model with application to DNA copy number experimental data. J Appl Stat 38:1899–1913

    Article  Google Scholar 

  36. Ji T, Chen J (2015) Modeling the next generation sequencing read count data for DNA copy number variant study. Stat Appl Genet Mol Biol 14:361–374

    Article  CAS  PubMed  Google Scholar 

  37. Anscombe FJ (1948) The transformation of Poisson, binomial and negative-binomial data. Biometrika 35:246–254

    Article  Google Scholar 

  38. Yiğiter A, Chen J, Lingling An L, Danacioğlu N (2015) An on-line CNV detection method for short sequencing reads. J Appl Stat 42(7):1556–1571

    Article  Google Scholar 

  39. Fearnhead P, Liu Z (2007) On-line inference for multiple changepoint problems. J R Stat Soc B 69:589–605

    Article  Google Scholar 

  40. Lee J, Chen J (2019) A penalized regression approach for DNA copy number study using the sequencing data. Stat Appl Genet Mol Biol 18(4). https://doi.org/10.1515/sagmb-2018-0001

  41. Tibshirani RJ (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58(1):267–288

    Google Scholar 

  42. Tibshirani R et al (2005) Sparsity and smoothness via the fused LASSO. J R Stat Soc Ser B (Stat Methodol) 67:91–108

    Article  Google Scholar 

  43. Tibshirani RJ, Taylor J (2011) The solution path of the generalized LASSO. Ann Stat 39:1335–1371

    Article  Google Scholar 

  44. Qian J, Su L (2016) Shrinkage estimation of regression models with multiple structural changes. Economet Theory 32(6):1376–1433

    Article  Google Scholar 

  45. Nowak G, Hastie T, Pollack JR, Tibshirani R (2011) A fused lasso latent feature model for analyzing multi-sample aCGH data. Biostatistics 12(4):776–791

    Article  PubMed  PubMed Central  Google Scholar 

  46. Chen J, Deng S (2018) Detection of copy number variation regions using the DNA-sequencing data from multiple profiles with correlated structure. J Comput Biol 25:1128–1140

    Article  PubMed  CAS  Google Scholar 

  47. Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B (Stat Methodol) 67:91–108

    Article  Google Scholar 

  48. The 1000 Genomes Project Consortium (2010) A map of human genome variation from population scale sequencing. Nature 467(7319):1061–1073

    Article  PubMed Central  CAS  Google Scholar 

  49. The 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature 526(7571):68–74

    Article  CAS  Google Scholar 

  50. Diskin SJ, Eck T, Greshock J, Mosse YP, Naylor T, Stoeckert CJ, Weberm BL, Maris JM, Grant GR (2006) STAC: a method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments. Genome Res 16(9):1149–1158

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Zhao M, Wang Q, Wang Q, Jia P, Zhao Z (2012) Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinf 14(Suppl 11):S1

    Google Scholar 

  52. Layer RM, Chiang C, Quinlan AR, Hall IM (2014) LUMPY: a probabilistic framework for structural variant discovery. Genome Biol 15:R84

    Article  PubMed  PubMed Central  Google Scholar 

  53. Lindberg MR, Hall IM, Quinlan AR (2015) Population-based structural variation discovery with Hydra-Multi. Bioinformatics 31:1286–1289

    Article  PubMed  Google Scholar 

  54. Abyzov A, Urban AE, Snyder M, Gerstein M (2011) CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res 21(6):974–984

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Klambauer G, Schwarzbauer K, Mayr A et al (2012) cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res 40:e69

    Google Scholar 

  56. Handsaker RE, Van Doren V, Berman JR et al (2015) Large multiallelic copy number variations in humans. Nat Genet 47:296–303

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Nguyen HT, Merriman TR, Black MA (2014) The CNVrd2 package: measurement of copy number at complex loci using high-throughput sequencing data. Front Genet 5:248

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  58. Hollox EJ (2009) Beta-defensins and Crohn’s disease: confusion from counting copies. Am J Gastroenterol 105:360–362

    Article  Google Scholar 

  59. Shrestha S, Tang J, Kaslow RA (2009) Gene copy number: learning to count past two. Nat Med 15:1127–1129

    Article  CAS  PubMed  Google Scholar 

  60. Alkan C, Kidd JM, Marques-Bonet T et al (2009) Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet 41:1061–1067

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Holt C, Losic B, Pai D, Zhao Z, Trinh Q, Syam S, Arshadi N, Jang GH, Ali J, Beck T, McPherson J, Muthuswamy LB (2014) WaveCNV: allele-specific copy number alterations in primary tumors and xenograft models from next-generation sequencing. Bioinformatics 30(6):768–774

    Article  CAS  PubMed  Google Scholar 

  62. Boeva V, Zinovyev A, Bleakley K, Vert JP, Janoueix-Lerosey I, Delattre O, Barillot E (2011) Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. Bioinformatics 27(2):268–269

    Article  CAS  PubMed  Google Scholar 

  63. Xie C, Tammi MT (2009) CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinf 10:80

    Article  CAS  Google Scholar 

  64. Xi R, Hadjipanayis AG, Luquette LJ, Kim TM, Lee E, Zhang J, Johnson MD, Muzny DM, Wheeler DA, Gibbs RA, Kucherlapati R, Park PJ (2011) Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion. Proc Natl Acad Sci 108:E1128–E1136

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Gusnanto A, Wood HM, Pawitan Y, Rabbitts P, Berri S (2012) Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data. Bioinformatics 28:40–47

    Article  CAS  PubMed  Google Scholar 

  66. Yoon S, Xuan Z, Makarov V, Ye K, Sebat J (2009) Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res 19:1586–1592

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Zhang Q, Ding L, Larson DE, Koboldt DC, McLellan MD, Chen K, Shi X, Kraja A, Mardis ER, Wilson RK, Borecki IB, Province MA (2010) CMDS: a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data. Bioinformatics 26(4):464–469

    Article  CAS  PubMed  Google Scholar 

  68. Wang Z, Hormozdiari F, Yang WY, Halperin E, Eskin E (2013) CNVeM: copy number variation detection using uncertainty of read mapping. J Comput Biol 20(3):224–236

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  69. Sinha R, Samaddar S, De RK (2015) CNV-CH: a convex hull based segmentation approach to detect copy number variations (CNV) using next-generation sequencing data. PLOS One 10(8):e0135895

    Article  PubMed  PubMed Central  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jie Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Chen, J. (2021). Statistical Considerations on NGS Data for Inferring Copy Number Variations. In: Shomron, N. (eds) Deep Sequencing Data Analysis. Methods in Molecular Biology, vol 2243. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1103-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-1103-6_2

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-1102-9

  • Online ISBN: 978-1-0716-1103-6

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics