A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection

Background The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example. Results We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as “R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG’s accuracy in CNV detection. Conclusions Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power. Electronic supplementary material The online version of this article (10.1186/s12859-018-2077-6) contains supplementary material, which is available to authorized users.

1. The Proof of Consistency of RGE 1.1. Using IRLS to estimate GLM-NB model parameters. We first demonstrate using IRLS to estimate GLM-NB model parameters.
The log likelihood for a negative binomial model is where y is the response vector of length n, β is the coefficients vector of length p, φ is the negative binomial over-dispersion parameter, and µ i = E(y i ). Consider the generic form of a GLM model where y is the response vector of length n, β is the coefficients vector of length p, ϕ is the GLM dispersion parameter. Assuming the over-dispersion parameter φ is fixed, then a negative binomial distribution belongs to the exponential family. Thus matching it with the generic form of a GLM model, we have ϕ = 1 where g is a link function. In our model, η i = g(µ i ) = log(µ i ).
To derive the the MLE of β j , we start with the score function and Fisher's information matrix. The score function is Let I jk be the (j, k)-th element of the Fisher's information matrix, and the last equation is due to the fact that E[(y i − µ i ) 2 ] = V (µ i ).
Let I (t−1) = I(β)| β=β (t−1) and S (t−1) = ∂l/∂β| β=β (t−1) . By Fisher scoring, the update of β from the (t − 1)-th iteration to the t-th iteration is .., n. Then based on equations (2) and (3), the score function and information matrix can be written as where X is the design matrix, ζ is a vector of length n and ζ i = (y i − µ i )g (µ i ). When W is evaluate based on β (t−1) , we write it as W (t−1) . Then the Fisher scoring equation can be written as Therefore, β (t) is the solution of weighted least squares with working response being z = η + ζ, Here we use log link function g(µ i ) = log(µ i ), and thus 1.2. Proof of Theorem 1. Here we provide proof details of Theorem 1. For equation where X ∈ R n×p is the original design matrix times the square root of weight matrix W , n is the number of rows, p is the number of columns, y ∈ R n is a n-dimensional known vector. Let m i be the sampling indicator for the i-th entry, i = 1, ..., n; m i = 1 means that the i-th entry is sampled, m i = 0 means otherwise. We denote X ∞ the L ∞ norm of a matrix X, which is the maximum absolute row sum of the matrix. We denote v ∞ the L ∞ norm of a vector v, which is the maximum absolute value of the elements. Theorem 1 states that there exists a solution when Equation 3 equals 0 that is inside the hypercube of the true coefficients β 0 . Let Thus proving Theorem 1 is equivalent to prove that there exists a solution inside the hypercube N to satisfy f (δ) = 0.
After expanding X T (m • Xδ) around β 0 by the second order Taylor expansion and inserting it into f (δ) we have, .., r p ) T are Lagrange reminders and for each j = 1, ..., p, The range u ∞ , is determined by ∞ and ξ ∞ (we already know that r ∞ = 0).
. Condition 1.1 ensures that the sampled matrix X T diag (m) X is not singular. The CNV problem setting (where copy number and intercept are set as two covariates) could be used as an illustration example to explain Condition 1.1 is reasonable. Without the loss the generality we assume the j-th column (j = 1, 2) of the sampled data has been standardized such that m • x j = 0, and m • x j 2 = √ n 0 , where n 0 is the size of the sampled data. If the copy number and the intercept have not been standardized, the conclusion still holds with m • x j 2 assumed to be in the order of We study ξ ∞ from probability perspective. We first define the event E = ξ ∞ ≤ c −1/2 √ n 0 log n 0 , where √ n 0 is the L 2 norm of vector m • x j . [5] proves the following proposition, where a ∈ R n , ε ∈ (0, a 2 / a ∞ ], c = 1/(2v 0 + 2M ) for some M, v 0 ∈ (0, ∞) such that Y , M , v 0 , and b (θ 0 ) satisfy the moment condition (20) in [5], and ψ (ε) = 2 exp −cε 2 . Using the proposition, the probability that event E happens could be calculated as The probability goes to 1 when n 0 goes to ∞. Thus the event E holds when n 0 goes to ∞.

GENSENG CNV Detection Framework
GENSENG's analytic protocol comprises three steps: 1. Input data preparation (including read quality control and Computation of read-depth and covariate values); 2. HMM inference of copy number while correcting for biases; 3. Post-segmentation processing. We introduced each step in the following sections.
2.1. Data Preparation. We first applied the following steps to control the quality of the raw sequencing reads. 1. Remove any read that fails platform/vendor quality checks, or either a PCR duplicate or an optical duplicate. 2. Extract all single-end reads and properly paired paired-end reads. 3. Extract confidently aligned reads with MAPQ ≥ a specified threshold. In this study, we use MAPQ ≥10, which was empirically determined.
We then divided the genome into consecutive windows. In selection of window size, we used a sliding window approach of 200bps non-overlapping windows. The size of the window was empirically determined.
Finally we obtained the read-depth in each window by counting the number of reads in each window. Each read (e.g. 36-mer or 51-mer from the 1000 Genomes Project data [1,7]) is represented by its middle base pair. A fragment is counted where read mapping information is available.
(1) If two ends of a pair fall in two windows, assign 1/2 to each window where the ends fall; (2) If both ends of a pair fall in the same window, assign 1 to the window; (3) If paired-ended but only one-end present, assign 1/2 to the window where the ends fall; (4) If single-end, always assign 1 to the window where the end falls.
Covariates were calculated as a quantitative measurement of bias at each window. In this study, the set of covariates include GC content and mappability score. GC content is computed as in the following steps. (1) Calculate the proportion of G or C bases in each window from a given reference genome.(2) Apply a cubic spline smoothing and then transform the GC proportion based on the fitted curve so that the transformed GC proportion and logarithm of the read-depth are linearly correlated. (3) The transformed GC proportion is median-centered and is referred to as GC content hereafter. Mappability score is computed as in the following steps. (1) Align the K-mers starting at each base position back to reference genome using a desired aligner, e.g. BWA [6]. (2) Identify base positions where the corresponding K-mers are correctly aligned (i.e. there is a unique best hit and it is the true position of the K-mer). (3) Compute mappability score as the proportion of correctly-aligned bases (a.k.a. mappable bases) in a given window.
In summary, the input data is a triplet for each window represent by where T is the total number of windows of a chromosome, o t denotes the read-depth, g t denotes the GC content, and l t denotes the mappability score of the t th window.
2.2. HMM setup. We use a time-homogeneous discrete hidden Markov model (HMM) to segment the genome to regions of same copy number. In our HMM, time represents the sliding windows tiled along a chromosome, denoted by t.
• The state represents the underlying copy number (CN). The state variable q t = CN t is hidden and discrete with N possible values, (0, 1, ..., N − 1), where N , is derived from the data. A particular sequence of the states is described by q = (q 1 , ..., q T ), where T is the total number of sliding windows of a chromosome. Let π j be the initial state probability, the probability that the state of the first window is state j. The underlying hidden Markov chain is defined by state transitions P (q t |q t−1 ) and is represented by a time-independent stochastic transition matrix A = {a jz } = P (q t = z|q t−1 = j). • Each copy number state emits an observation, the read-depth. The observation variable, O t , is a discrete count variable. A particular sequence of the observations is described by o = (o 1 , ..., o T ). The emission probability of a particular observation at a particular time t for state j is described by e(t, j) = P (O t = o t |q t = j). For a detailed description of the emission probability, see Section 2.3. • We use the Baum-Welch algorithm [2] to find the maximum likelihood estimates (MLE) of the HMM parameters. Following Bilmes [3], we define the complete-date likelihood and solve the Q function in order to find the maximum likelihood estimates (MLE) of the HMM parameters.
2.3. Emission probability. The emission probability of the read-depth, e(t, j) = P (O t = o t |q t = j), is modeled as a mixture of a uniform distribution and a negative binomial distribution.
where c is the proportion of the random uniform component and is fixed as constant for each state; and R m is the largest read-depth among all windows and thus 1/R m is the uniform density.
To describe the negative binomially distributed component, e N B (t, j), we first explain the relationship between the Poisson and the negative binomial distributions. The Poisson distribution imposes that the variance equals to the mean. The negative binomial distribution allows overdispersion. Specifically, if O follows a Poisson distribution with mean µ, and µ follows a gamma distribution, the resulting distribution for O is a negative binomial distribution. The variance of negative binomial distribution is µ t + φµ 2 t , where φµ 2 t is the overdispersion part of the variance. As φ → 0, f N B (o t ; µ t , φ) reduces to a Poisson distribution with mean µ t and variance µ t .
Next, the mean value of the negative binomially distributed component is expressed as a function of a set of covariates to account for confounders.
where t denotes the t th window, j is the index of the copy number state, j emphasizes the dependency of the mean µ t on the copy number CN t , l t is the mappability score, g t is the GC content. For computational convenience, we set CN t = 0.5 when j = 0, and set CN t = j when j > 0.
We then employ a log link function to acknowledge the fact that µ tj > 0 and obtain: β 0 , β 1 , β 2 , β 3 are the regression coefficients. Specifically, β 0 = log(α 0 ), is the intercept parameter and is interpreted as the average level of read-depth signal when all covariates are equal to zero. β 1 is the amount of increase of read-depth for every unit increase of copy number, CN. β 2 is the amount of increase of read-depth for every unit increase of the mappability score, l. β 3 is the amount of increase of read-depth for every unit increase of the GC content, g.
Thus, given the above regression model for the mean, the negative binomial probability distribution function is expressed as the following: The complete emission probability is then expressed as the following: Assume normal copy number (CN=2) for all windows. Set β 3 = 0.5. β 3 is the coefficient for GC content. 0.5 is the empirically determined value. intercept=log(median(O)) − log(2) − median(log(L)) − β 3 median(G). for j = 0, offset = log(0.5) + log(l t ).
(f) Initial mixing probability, c: c = 0.01. The mixing probability is the same for the normal state and the other states.
(g) Initial parameter for the uniform distribution, R m : The details about the E-step of the EM procedures were as follows: Given the current parameter estimates Λ 0 , we efficiently compute the desired quantities.
The details about the M-step of the EM procedure were given below: 2.4.5. Estimate the initial state probability π j . The initial probability π j is simply the posterior probability of being state j at position 1, therefore the new estimate of π j , denoted by π j , is computed as the following: 2.4.6. Estimate the transition probability a jz . The estimated a jz is denoted by a jz , for j = z, and is computed as the following: log(ζ(t, j, z)) = log(f (t, j)) + log(e(t + 1, z)) + log(b(t + 1, z)) (28) 2.4.7. Estimate the emission parameters, an overview: Because we fix c and R m as constant, parameter estimation will only concern the negative binomially distributed component.
To estimate the negative binomial parameters, a weighted GLM function is implemented in C++. The argument of this function include "family", "observation", "covariate", "offset", and "prior". The argument "family" means either Poisson or negative binomial. The argument "prior" means the probability that each observation belongs to the negative binomially distributed component. For the t th window and state j, the "prior" is denoted by p t,j and is computed as the following: Following the implementation function MASS/glm.nb in R [9], we use an alternating iterative estimation procedure to obtain the new estimate of log(µ tj ), denoted by log(µ tj ), and the new estimate of φ, denoted by φ.
• First, we fix φ and compute log(µ tj ) by fitting weighted GLM using the iteratively reweighted least squares (IRLS) method. For details, see Section 2.4.8. • Then, we fix log(µ tj ) and compute φ using the Newton-Raphson method with weight. For details, see Section 2.4.8. • The above two steps alternated until convergence. ., g t } (g t repeats for N times) ., offset T } • "The weighted log-likelihood function" Step 2. Fit a weighted Poisson regression model using the IRLS procedure.
Step 3. Perform a score test The score test [4] is used to test whether the overdispersion parameter, φ, is significantly greater than 0. If the score test is significant, we • Estimate φ using the Newton-Raphson method as described in Section 2.4.9.
Step 4. Fit a weighted negative binomial regression model using the IRLS method.
2.4.9. Estimation of overdispersion using the Newton-Raphson method. Given log(µ tj ), we use the Newton-Raphson method to estimate the overdispersion parameter, φ. In this study, we estimate one overdispersion parameter φ jointly for all states, and set φ j = φ for j = 0..N − 1.
The following weighted log-likelihood and its first, second derivatives are used in the Newton-Raphson method to estimate φ. The weighted log-likelihood is the same as Equation 31.
We use the Newton-Raphson method given the score function and the fisher information. Initialize Inf o(ϕ) ϕ = ϕ+Dev } 2.5. Post-segmentation Processing. Following the discovery in [8], it is crucial to merge CNV calls and filter false positives. We applied the same procedure to merge CNV calls. In addition, we used a combination of read-depth accessible (RDA) filter and confidence score as used in [8] as the filter to remove low confidence calls.