Modeling the next generation sequencing read count data for DNA copy number variant study

Tieming Ji; Jie Chen

doi:10.1515/sagmb-2014-0054

Published by De Gruyter July 3, 2015

Modeling the next generation sequencing read count data for DNA copy number variant study

Tieming Ji and Jie Chen

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2014-0054

Showing a limited preview of this publication:

Abstract

As one of the most recent advanced technologies developed for biomedical research, the next generation sequencing (NGS) technology has opened more opportunities for scientific discovery of genetic information. The NGS technology is particularly useful in elucidating a genome for the analysis of DNA copy number variants (CNVs). The study of CNVs is important as many genetic studies have led to the conclusion that cancer development, genetic disorders, and other diseases are usually relevant to CNVs on the genome. One way to analyze the NGS data for detecting boundaries of CNV regions on a chromosome or a genome is to phrase the problem as a statistical change point detection problem presented in the read count data. We therefore provide a statistical change point model to help detect CNVs using the NGS read count data. We use a Bayesian approach to incorporate possible parameter changes in the underlying distribution of the NGS read count data. Posterior probabilities for the change point inferences are derived. Extensive simulation studies have shown advantages of our proposed methods. The proposed methods are also applied to a publicly available lung cancer cell line NGS dataset, and CNV regions on this cell line are successfully identified.

Keywords: Bayesian analysis; change point analysis; copy number variation; moving window algorithm; next generation sequencing reads

Corresponding author: Jie Chen, Department of Biostatistics and Epidimeology, Medical College of Georgia, Georgia Regents University, Augusta, GA 30912, USA, e-mail: jiechen@gru.edu

Appendix

Derivation of the posterior for the constant prior

Given the priors in (5) and (6), the joint posterior is found to be:

(15)

π(λ1, λ2, k)∝L(λ1, λ2, k|Y′is)π0(λ1, λ2|k)π0(k), (15)

where L(λ1, λ2, k|Y′is) is the joint likelihood of parameters given observations Y_i’s. Thus, the posterior probability that position k is the change point given observations can be derived as

(16)

π1(k|Y′is)∝∫0∞∫0∞exp{−2∑i=1k(Yi−λ1+1/8)2−2∑i=k+1n(Yi−λ2+1/8)2}dλ1dλ2=∫0∞exp{−2∑i=1k(Yi−λ1+1/8)2}dλ1×∫0∞exp{−2∑i=k+1n(Yi−λ2+1/8)2}dλ2=I1×I2 (16)

where

(17)

I1=∫0∞ exp {−2∑i=1k(Yi−λ1+1/8)2}dλ1 and (17)

(18)

I2=∫0∞ exp {−2∑i=k+1n(Yi−λ2+1/8)2}dλ2. (18)

Hence, if we can derive I₁ and I₂, respectively, we can multiply them to get the results for π(k|Y′is). The derivations for I₁ and I₂ are similar, thus we give the derivations for I₁ in the following, and give the results for I₂ directly.

Let z=λ1+1/8, then

(19)

I1=∫1/8∞ exp{−2(∑i=1kYi2+k(z−Y¯1)2−kY¯12)}2zdz=exp{−2(∑i=1kYi2−kY¯12)}∫1/8∞2exp(−2k(z−Y¯1)2)zdz. (19)

Let t=2k(z−Y¯1), then

(20)

I1=exp{−2(∑i=1kYi2−kY¯12)}∫2k(1/8−Y¯1)∞exp(t22)(t2k+Y¯1)1kdt=exp{−2(∑i=1kYi2−kY¯12)}(∫0∞exp(−t2/2)2kd(t22)+Y¯1k∫2k(1/8−Y¯1)∞exp(−t22)dt)=exp{−2SS1}{exp(−k(1/8−Y¯1)2)2k+2πkY¯1(1−Φ(k2−2kY¯1))}. (20)

Similarly, we have

(21)

I2=exp{−2SS2}{exp(−(n−k)(1/8−Y¯2)2)2(n−k)+2πn−kY¯2(1−Φ(n−k2−2n−kY¯2))}, (21)

where Y̅₁, SS₁, Y̅₂ and SS₂ are defined as in (short-symbol).

Let π*(k|Y′is)=I1×I2, we have

π*(k|Y′is)=I1×I2=exp(−2(k−1)S12){exp(−k(1/8−Y¯1)2)2k+2πkY¯1(1−Φ(k2−2kY¯1))} ×exp(−2(n−k−1)S22){exp(−(n−k)(1/8−Y¯2)2)2(n−k)+2πn−kY¯2(1−Φ(n−k2−2n−kY¯2))},

and π1(k|Y′is)=π*(k|Y′is)/∑t=2n−2π*(t|Y′is) for k=2, …, n−2.

Derivation of the posterior distribution for the Jeffreys prior

With the Jeffreys prior given in (prior2-lambda) and the change point location prior of (5), the joint posterior probability is given by

(22)

π(λ1, λ2, k)∝L(λ1, λ2, k|Y′is)πJ0(λ1, λ2|k)π0(k), (22)

and then the posterior probability of the position k is

πJ1(k|Y′is)∝∫0∞∫0∞L(λ1, λ2, k|Y′is)πJ0(λ1, λ2|k)π0(k)dλ1dλ2∝1n−2∫0∞∫0∞exp{−2∑i=1k(Yi−λ1+18)2−2∑i=k+1n(Yi−λ2+1/8)2}×k(n−k)(λ1+18)−1/2(λ2+18)−1/2dλ1dλ2∝k∫0∞exp{−2∑i=1k(Yi−λ1+18)2}(λ1+18)−1/2dλ1×n−k∫0∞exp{−2∑i=k+1n(Yi−λ2+18)2}(λ2+18)−1/2dλ2=I3×I4,

where

I3=k∫0∞exp{−2∑i=1k(Yi−λ1+18)2}(λ1+18)−1/2dλ1, andI4=n−k∫0∞exp{−2∑i=k+1n(Yi−λ2+18)2}(λ2+18)−1/2dλ2.

After intergation and algebraic simplification, we obtain:

I3=exp(−2SS1)×(1−Φ[4k(18−Y¯1)]), andI4=exp(−2SS2)×(1−Φ[4(n−k)(18−Y¯2)]).

Therefore, πJ*(k|Y′is)=I3×I4 as presented in (13). Hence the posterior πJ1(k|Y′is), based on Jeffreys’ prior, is obtained as claimed in (12).

References

Abyzov, A., A. E. Urban, Snyder and M. Gerstein (2011): “CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing,” Genome Res., 21, 974–984.Search in Google Scholar

Anscombe, F. J. (1948): “The transformation of poisson, binomial and negative-binomial data,” Biometrika, 35, 246–254.10.1093/biomet/35.3-4.246Search in Google Scholar

Chen, J. and A. K. Gupta (2012): Parametric statistical change point analysis – with applications to genetics, medicine, and finance, 2nd edition, New York: Birkhauser.10.1007/978-0-8176-4801-5Search in Google Scholar

Chen, J. and Y. P. Wang (2009): “A statistical change point model approach for the detection of DNA copy number variations in array CGH data,” IEEE/ACM Transact. Comput. Biol. Bioinformatics, 6, 529–541.Search in Google Scholar

Chen, J., A. Yiğiter and K. C. Chang (2011): “A Bayesian approach to inference about a change point model with application to DNA copy number experimental data,” J. Appl. Stat., 38, 1899–1913.Search in Google Scholar

Chiang, D. Y., G. Getz, D. B. Jaffe, M. J. T. O’Kelly, X. Zhao, S. L. Carter, C. Russ, C. Nusbaum, M. Meyerson and E. S. Lander (2009): “High-resolution mapping of copy-number alterations with massively parallel sequencing,” Nat. Methods, 6, 99–103.Search in Google Scholar

Guha, S., Y. Li and D. Neuberg (2008): “Bayesian hidden markov modeling of array CGH data,” J. Am. Stat. Assoc., 103, 485–497.Search in Google Scholar

He, D., N. Furlotte and E. Eskin (2010): “Detection and reconstruction of tandemly organized de novo copy number variations,” BMC Bioinformatics, 11, S12.10.1186/1471-2105-11-S11-S12Search in Google Scholar PubMed PubMed Central

Ivakhno, S., T. Royce, A. J. Cox, D. J. Evers, R. K. Cheetham and S. Tavaré (2010): “CNVseq: a novel framework for identification of copy number changes in cancer from second-generation sequencing data,” Bioinformatics, 26, 3051–3058.10.1093/bioinformatics/btq587Search in Google Scholar PubMed

Jeffreys, H. (1946): “An invariant form for the prior probability in estimation problems,” Proc. R. Soc. London. Series A, Mathematic. Phys. Sci., 186, 453–461.Search in Google Scholar

Magi, A., L. Tattini, T. Pippucci, F. Torricelli and M. Benelli (2012): “Read count approach for DNA copy number variants detection,” Bioinformatics, 28, 470–478.10.1093/bioinformatics/btr707Search in Google Scholar PubMed

Metzker, M. L. (2010): “Sequencing technologies – the next generation,” Nat. Rev. Genet., 11, 31–46.Search in Google Scholar

Miller, C. A., O. Hampton, C. Coarfa and A. Milosavljevic (2011): “ReadDepth: A parallel R package for detecting copy number alterations from short sequencing reads,” PLoS One, 6(1), e16327.10.1371/journal.pone.0016327Search in Google Scholar PubMed PubMed Central

Olshen, A. B., E. S. Venkatraman, R. Lucito and M. Wigler (2004): “Circular binary segmentation for the analysis of array-based DNA copy number data,” Biostatistics, 5, 557–572.10.1093/biostatistics/kxh008Search in Google Scholar PubMed

Patel, L. R., M. Nykter, K. Chen and W. Zhang (2013): “Cancer genome sequencing: Understanding malignancy as a disease of the genome, its conformation, and its evolution,” Cancer Lett., 340, 152–160.Search in Google Scholar

Ritz A., P. L. Paris, M. M. Ittmann, C. Collins and B. J. Raphael (2011): “Detection of recurrent rearrangement breakpoints from copy number data,” BMC Bioinformatics, 12, 114.10.1186/1471-2105-12-114Search in Google Scholar PubMed PubMed Central

Scheinin I., D. Sie, H. Bengtsson, M. A. van de Wiel, A. B. Olshen, H. F. van Thuijl, H. F. van Essen, P. P. Eijk, F. Rustenburg, G. A. Meijer, J. C. Reijneveld, P. Wesseling, D. Pinkel, D. G. Albertson and B. Ylstra (2014): “DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly,” Genome Res., 24(12), 2022–32.10.1101/gr.175141.114Search in Google Scholar PubMed PubMed Central

Seshan, V. E. and A. Olshen (2014) DNAcopy: DNA copy number data analysis. R package version 1.38.1.Search in Google Scholar

University of California Santa Cruz (UCSC) Genome Browser. (2014). http://genome.ucsc.edu/.Search in Google Scholar

Venkatraman, E. S. and A. B. Olshen (2007): “A faster circular binary segmentation algorithm for the analysis of array CGH data,” Bioinformatics, 23, 657–663.10.1093/bioinformatics/btl646Search in Google Scholar PubMed

Xie, C. and M. Tammi (2009): “CNV-seq, a new method to detect copy number variation using high-throughput sequencing,” BMC Bioinformatics, 10, 80.10.1186/1471-2105-10-80Search in Google Scholar PubMed PubMed Central

Yoon, S., Z. Xuan, V. Makarov, K. Ye and J. Sebat (2009): “Sensitive and accurate detection of copy number variants using read depth of coverage,” Genome Res., 19, 1586–1592.Search in Google Scholar

Zhang, J., R. Chiodini, A. Badr and G. Zhang (2011): “The impact of next-generation sequencing on genomics,” J. Genet. Genomics, 38, 95–109.Search in Google Scholar

Published Online: 2015-7-3

Published in Print: 2015-8-1

Modeling the next generation sequencing read count data for DNA copy number variant study

Abstract

Appendix

Derivation of the posterior for the constant prior

Derivation of the posterior distribution for the Jeffreys prior

References

Journal and Issue

Articles in the same Issue