A Stochastic Segmentation Model for Recurrent Copy Number Alteration Analysis

Recurrent DNA copy number alterations (CNAs) are key genetic events in the study of human genetics and disease. Analysis of recurrent DNA CNA data often involves the inference of individual samples’ true signal levels and the crosssample recurrent regions at each location. We propose for the analysis of multiple samples CNA data a new stochastic segmentation model and an associated inference procedure that has attractive statistical and computational properties. An important feature of our model is that it yields explicit formulas for posterior probabilities of recurrence at each location, which can be used to estimate the recurrent regions directly. We propose an approximation method whose computational complexity is only linear in sequence length, which makes our model applicable to data of higher density. Simulation studies and analysis of an ovarian cancer dataset with 15 samples and a lung cancer dataset with 10 samples are conducted to illustrate the advantage of the proposed model. Journal of Biometrics & Biostatistics J o u rn al of Bio metrics & Bistatis t i c s


Introduction
Copy number alterations (CNAs) are key genetic events in the development and progression of numerous human diseases. Recent advances in high density microarray technologies enable high-throughput genome-wide profiling of DNA copy number; see [1,2]. Using the array-based comparative genomic hybridization (array-CGH) technology, the average genomic DNA copy number at thousands of locations linearly ordered along the chromosomes can be quantitatively measured [3]. Since cancer genes are more likely to be found in common or recurrent regions in the sequence of CNAs across patients of the same cancer [4], one is more interested in finding recurrent CNA regions that consist of continuous probes and show evidence of alteration in some samples [5].
During the past years, a large number of computational and statistical methods have been developed to locate the recurrent CNA regions across samples, see reviews and comparisons of these methods in [5,6]. Most of these methods involve a two-step procedure, in which the first step is to identify the gain and loss regions in individual samples and the second step is to make inference on recurrent regions based on a threshold for occurrence frequencies. Examples of these approaches can be found in [7][8][9][10][11][12][13][14]. As the first steps of these approaches require segmentation of individual samples, they may strengthen or weaken some important information in recurrent regions. In contrast to twostep methods, one-stage approaches analyze raw data directly and avoid the information change in the two-step process. Recently, several statistical approaches have been proposed along this line, including score-based approach [15,16] hierarchical hidden Markov model [17], Bayesian hidden Markov model [18] kernel smoothing methods [19,20] analysis of variance approach, and likelihood-based test for simultaneous change-points [21]. Most of these methods involve a hard segmentation procedure. However, for complex alteration profiles across samples, identified recurrent CNA regions vary greatly. This indicates that hard segmentation procedures may be difficult for identification of recurrent regions, and instead, an inference procedure on the probability of recurrent regions might be necessary.
In this paper, we propose for the analysis of recurrent CNAs a stochastic segmentation model and associated inference framework.
The proposed model has a hierarchical hidden Markov structure which make the inference framework associated to our model possess attractive statistical and computational properties. The hierarchical hidden Markov structure in our model is similar to that in Shah et al. [7], but our model allows different "quantitative" states conditional on a given "categorical" state, while the model in Shah et al. [7], assumes all "quantitative" states are same for a "categorical" state. Specifically, we assume a finite state hidden Markov chain for (categorical) states of recurrent regions across samples, and then conditional on the categorical state, signal levels (or "quantitative states") of CNAs in each sample follow a sample-specific continuous state hidden Markov chain. As a working model, although these assumptions seem to complicate for obtaining an inference procedure, they actually provide us more flexibility to model the non-simultaneity feature of break points across samples and yields explicit recursive formulas for posterior distributions of hidden "categorical" states (i.e., the recurrent CNA region) and sample-specific "quantitative" states (i.e., the signal levels of CNAs in individual samples) at each probe, whereas the model in Shah et al. [7], has to rely on Monte Carlo simulations. Our stochastic segmentation model assumes that the log fluorescence ratios y lt for sample l∈ {1, . . . , L} measured through the array-CGH technology follow that y lt =θ lt + σ l ∈ lt (t=1, . . . , T), in which θ lt are independent standard normal random variables, and θ lt are piecewise constant whose values at location t follow a prior distribution that depends on a hidden Markov chain s t with three categorical states (gain, baseline 0 or loss). In this specification, θ it and s t represent the true signal levels of CNAs in sample l and the gain-loss states across the L samples at location t. When s t shifts from one categorical state to another, signal levels (or quantitative states) θ lt in sample l jump to a new state, whose prior distribution depends on s t , hence θ lt may be different from the quantitative states whose corresponding categorical states are same as s t . Making use of this specific hierarchical model structure, we compute the posterior distributions of θ lt and s l based on explicit recursive formulas that are derived using forward and backward filters of the hidden Markov model. The forward-backward filters in our model can be considered as a non-trivial extension of the Baum-Welch algorithm and similar to those developed by [22][23][24]. The difference of our forward-backward filters from previous work is that the hidden categorical and quantitative states in our model have a hierarchical structure and the top layer hidden states become a finitestate Markov chain. As the problem of locating recurrent CNA regions intrinsically involves much computation, to reduce the computational complexity of our inference procedure, we further consider a bounded complexity mixture approximation scheme so that the computational complexity becomes linear. Another discussion we have made is that, since all model hyper parameters are unknown in real applications, we propose to estimate all hyper parameters by an expectationmaximization (EM) algorithm.
The rest of the paper is organized as follows. Section 2 provides the model details and develops an inference procedure. It also discusses some computational issues and proposes a bounded complexity mixture approximation scheme and a hyper parameter estimation algorithm. Section 3 shows the performance of our model and associated inference procedure via intensive simulation studies. Section 4 analyzes two groups of CNA data, one is on ovarian cancer based on 15 samples and the other is on lung cancer based on 10 patients. We identify the recurrent CNA regions related to those cancers and demonstrate that our result is consistent with that in current medical studies. Section 5 provides some conclusive remarks.

Model specification
We consider the problem of analyzing DNA copy number profiles from multiple distinct biological samples {y lt : 1 ≤ l ≤ L, 1 ≤ t ≤ T }, where y lt is the observed log fluorescence ratio at location t of sample l, T is the number of probes, and L is the number of samples. To estimate recurrent signals, we assume the following model for y lt : in which ǫ lt are independent Gaussian random errors with mean 0 and variance 1, σ l 2 are sample-specific error variances and {θ lt } is the true signal level of CNA of sample l at location t. Since we want to find recurrent regions across all L samples, we assume that recurrent regions can be represented by a "master" sequence of categorical states {s t }, where s t ∈ {G, O, S} (gain, baseline 0, loss) is an irreducible hidden Markov chain with probability transition matrix P=(p ij ) and stationary distribution π. Then given the master sequence {s t }, the dynamics of θ lt in sample l is expressed as s s s s z (2) in which z lt are independent normal variables with mean z (l,st) and covariance v (l,st) .
In the above model assumption, the existence of stationary distribution π could define us a reversed chain for {s t }, and it further implies that the Markov chain {θ lt } has a stationary distribution. Moreover, if we further assume that θl0 is initialized at the stationary distribution, {θ lt } become a reversible Markov chain, which provides substantial simplification for the smoothing estimates of {θ lt } and s t . We should note that this assumption is to simplify the estimation procedure. It may not reflect the real situation since the probability of amplification or deletion might be different across the whole chromosome.
The assumption that the master sequence of states is common across all the samples may not be necessarily true in practice, and it is only an approximation for the fact that most samples share a unique profile signature. This assumption is used in some models to obtain an estimation procedure with reasonable computational complexity for identifying recurrent CNAs; see Shah et al. The assumption also implies that the model is not suitable for a class of samples that consists of several disease subclasses with each subclass having a unique profile signature. For sample with disease subclass, we need to know the information about disease subclasses before applying the above model. Furthermore, assumption (2) indicates that signal levels θ lt with same categorical states could be different.

Filtering estimate
s k be the most recent location where s t switches to state k from other states prior or equal to . Then given Y 1,t and the most conditional distribution of θ lt given Y 1,t becomes a mixture of normal distributions: , ,

Smoothing estimate
Our model assumptions imply that the stationary distribution of θ lt exists and is given by . This indicates that, if θ lt is initialized at its stationary distribution, its time-reversed Markov chain can be defined. This substantially simplifies the smoothing estimates of θ lt. Actually, it further implies that {θ lt } is a reversible Markov chain, so we can reverse time and obtain a backward filter that is analogous to (4): Where the mixture weight is the transition probability of the reversed chain of {s t }. Since Next, we shall use Bayes' theorem to combine the forward filter (4) with its backward variant (9) to estimate θ lt given T , which is expressed as the following mixture of normal distributions In which the mixture weights Therefore, from (10), it follows that

Bounded complexity mixture approximations and hyper parameter estimation
The number of mixture weights in the above discussion increases dramatically with t (or n), resulting in rapidly increasing computational complexity and memory requirements in estimating θ lt as t keeps increasing. To address the issue of computational efficiency, we follow Lai and Xing and use a bounded complexity mixture (BCMIX) approximation procedure with linear computational complexity; see Web Appendix B for details of the algorithm.
The inference procedures involve the hyper parameters p, probability transition matrix P, and In practical applications, we should also notice that the three categorical states in the above model are exchangeable; hence the categorical states s t could be very difficult to identify. A remedy for this is to replace the normal priors for θ lt by truncated priors, then the filtering and smoothing formulas in Sections 2.2 and 2.3 needs to be modified somewhat. Specifically, the normal distribution in conditional distribution (10) needs to be replaced by corresponding truncated normal distributions. Another way to mitigate the identification issue is to put constraints on hyper parameters. For example, a prior normal distribution with smaller variances could limit the estimated quantitative signals staying around its prior mean, so the distinction between the categorical states becomes clearer.

Simulation Studies
We now perform intensive simulations to evaluate the performance of the proposed model and inference procedure from frequentist and Bayesian viewpoints. To demonstrate the performance, we consider two measures for different purpose. One measure is mean square error (MSE), which provides the mean errors between the estimates θ lt and the true θ lt , i.e., We first evaluate the performance of BCMIX estimates in the frequentist setting by considering the following four cases of hidden state {s t } with K=3.  N (z (l, st ) , v (l, st ) ) with (z (l,1) , z (l,2) , z (l,3) )=(1, 0, −1), v (l,1) =v (l,2) =v (l, 3) =0.22 (hence the standard deviation is about 0.47). We further assume L=10, T=1000, and generate observations y lt by (1)   ratios are about 70%), but they are typically smaller than ours.

Real data studies Analysis for Ovarian cancer data
Ovaries are reproductive organs that produce eggs in women, and ovarian cancer is the fifth leading cause of cancer death in women. Ovarian cancers display a high degree of complex genetic variations. The existing literature show that the most frequently affected chromosomes in ovarian cancer are chromosome 1, 8, and 17. We use our model to analyze the copy number alteration (CNA) data for Ovarian serous cystadenocarcinoma (OV) based on Array based-CGH technology. The data in our analysis, consisting of the CNAs from 15 OV cancer patients, were published on April 1 st , 2010 in the Cancer Genome Atlas (TCGA) database (http://cancergenome.nih.gov/). We analyze the whole 23 chromosomes. Since the existing literature shows that the most frequently affected chromosomes in ovarian cancer are chromosomes 1, and 17, we only present our result on these two chromosomes.
There are 55,274 and 20,009 probes on chromosomes 1 and 17. Let k=1, 2 and 3 denote amplification, baseline and deletion, respectively. We first use the proposed model and inference procedure to estimate the hyper parameters, and then the signal levels θ lt and probability of s t for chromosomes 1 and 17. We run our model on a desktop with Intel core i5-3210M and 4G memory, and it takes 223 and 109 seconds for chromosomes 1 and 17, respectively. The results are summarized in Web  Figures 2 and 3, respectively. We can see that our procedure analyzes all samples and estimates signal θ lt for each sample simultaneously, which avoid the weakness of two-stage analysis. As our interest here is the recurrent CNA region, we now focus on the estimated probabilities of s t , which are plotted in Figure 2. Those estimated probabilities indicate that the recurrent copy number amplifications involve regions 1p34. 2 [26][27][28]. It is important to note that for chromosome 17, we focus on detecting the recurrent regions of algorithm to estimate the hyper parameters and compute the BCMIX estimates with (M, m)= (10,5), (20,10), (30,15) and (40,20). For comparison purpose, we also compute oracle estimate which assume {s t } is known. We then run such simulation for each case 500 times, and summarize the MSE of two estimates and corresponding standard errors (in parentheses) in Web Table 1. We can see that the oracle and BCMIX estimates are quite similar, and the difference among BCMIX estimates with different (M, m) are quite small. Therefore, we will use BCMIX estimate with (M, m)= (20,10) in the following discussion.
We then evaluate the performance of the inference procedure under our model assumption. Let K= 3, (z (l,1) , z (l,2) , z (l,3) )= (1, 0, −1), v (l,1) =v (l,2) Tables 2 and 3 summarizes the MSE and IR of our estimate, and also provided in parentheses are the corresponding standard errors based on 500 simulations in each cell. We can see that the MSE are very small, and the IR is all larger than 84%.
Since data generation procedures in above studies do not deviate from our model assumption too much, to show the variability of our model, we also evaluate the performance of our algorithm on the data in Willenbrock and Fridly and [25], which are generated from a completely different procedure, and compute the MSE between the estimates θ lt and the true signals θ lt . The MSE of 100 datasets with 20 samples and each sample with 500 clones on Chromosome 1 is 0.011 with standard error 5.89e-4, indicating the estimates for signals in individual samples is still pretty good. Web Figure 1 demonstrates a randomly selected simulated y lt and θ lt for 20 samples. We next compare our model to the hierarchical hidden Markov model (HMM) in Shah et al. [7]. Specifically, we estimate all hyper parameters by the EM algorithm, and then fit the hierarchical HMM model to the simulated data generated in Scenarios S 1 -S 9 . Since Shah et al. assume the signal levels θ lt of individual CNAs follow a normal distribution with the mean and variance depending on the hidden state s t directly; it implies that the individual signal levels are fixed within the same segment of recurrent CNA regions. This is different from our model which allows signal levels θ lt of sample l have different values at different locations t even if their categorical states s t are same, which is more realistic in practice. Furthermore, our algorithm avoids the use of Markov Chain Monte Carlo algorithm, hence computationally is very fast. We run all simulations on a desktop with Intel core i5-3210M and 4G memory, for each simulation of 10 samples with sample length T=3000, 4000, 5000 and 6000, our algorithm takes about 2.8-6 seconds to obtain the estimates for θ lt and s l , while the hierarchical HMM model takes over 10-20 minutes to get its estimates. Web Table  3 summarizes the identification ratios and the corresponding standard errors (in parentheses) using Shah et al.'s model for different settings. Each cell is based on 100 simulations. We shall note that the hidden states s t in our setting are very close to each other due to the large signalto-noise ratios, hence it is not easy to make a correct state calling. The identification ratios of the hierarchical HMM are very good (all their copy number deletions, since the most common alterations for serous histology of OV cancer are deletions of 17p [26,28,29]. There are totally 178 and 136 unique known genes involving in recurrent CNA regions for chromosomes 1 and 17 respectively. These known genes are subjected to pathway exploration using the Ingenuity Pathway Analysis (IPA) software (Ingenuity Systems, Redwood City, CA). Significantly enriched pathways with Fishers exact P-values less than 0.05 are listed in this bar plot as shown in Figure 3. Yellow square line in the figure represents ratio which is the number of focus genes in the pathway divided by the total number of genes that make up that pathway. For chromosome 1, most of the pathways are related to cancer. Notably, as listed on the 7 th , the breast cancer signaling was found enriched. Furthermore, a few hormone metabolism pathways are involved, which includes PXR/RXR activation, Estrogen receptor signaling and Aldosterone signaling in Epithelial Cells. This is consistent with current knowledge of disrupted hormone metabolism pathways as important causal factors in breast cancer [30][31][32]. In addition, a few important cellular pathways are revealed: The NRF2mediated oxidative stress response turns out to be most significantly changed in the list, which has been related to breast cancer. The G-protein signaling pathway, which is well-known to be related to cancer, is listed on the second. A basic transcription factor related pathway is ranked on the 3 rd . And listed on the 4 th , Ubiquitination regulates degradation of cellular proteins by the ubiquitin proteasome system, controlling a proteins half-life and expression levels. A change of ubiquitination activity is associated with ovarian tumorigenesis, so the protein ubiquitination pathway might be involved in breast ovarian progression. Finally, one of the most important developmental pathway in mammals, Notch signaling also known to play a role in cancer [33]. For chromosome 17, the pathway enrichment result reveals some biological mechanisms and pathway changes involved in ovarian cancer. First obviously, the ovarian cancer signaling pathway was found enriched. Particularly, the GADD45 and p53 signaling pathways are enriched. Both these two factors, especially p53, are well established tumor suppressor proteins. More importantly, almost half of the pathways are basic and critical cellular processes such as DNA repair, cell cycle regulation and apoptosis. Changes in these pathways indicate severe disruptions of normal cellular functions. This could be either the cause or the result of cancer.

Analysis for lung cancer data
There are two main types of lung cancer, small cell lung carcinoma (SCLC) and non-small cell lung carcinoma (NSCLC) [34]. NSCLC is the most common type of lung cancer, accounting for about 85% of total lung cancers. NSCLC is mainly comprised of adnenocarcinoma, squamous cell carcinoma and large cell carcinoma. About 30% of lung cancers are squamous cell carcinoma. Previous cancer studies have revealed that multiple tumor suppressor genes are involved in deletions at multiple chromosomal regions in lung carcinogenesis, and the most frequent deletions in lung cancer tissues are at chromosome 3, 13 and 17.
We use our model to analyze the CNA data for Lung squamous cell carcinoma (one type of non-small cell lung cancer) based on Array based-CGH technology. The data used in our study, consisting of the CNAs from 10 cancer patients, were published on October 22 nd , 2010 in the Cancer Genome Atlas (TCGA) database (http://cancergenome. nih.gov/). We analyze the whole 23 chromosomes. Since the existing literature shows that the most frequently affected chromosome in this type of lung cancer is chromosomes 17, we present our result on chromosome 17.
There are totally 13,575 probes on chromosome 17. Let k=1, 2 and 3 denote the amplification, baseline and deletion. We first use the proposed model and inference procedure to estimate hyper parameters and then fit the model to the data. We run our model on a desktop with Intel core i5-3210M and 4G memory, and it takes 30 seconds for chromosome 17. The estimated signal levels θ lt for chromosomes 17 are summarized in Web Figure 4. We can see that our procedure analyzes  all samples and estimates signal θ lt for each sample simultaneously, which avoids the weakness of two-stage analysis. As our interest here is the recurrent CNA region, we now focus on the estimated probabilities of s t , which are plotted in Figure 4. Those estimated probabilities indicate that the recurrent copy number amplifications at long arm of chromosome 17, which contains the oncogene ERBB2 at 17q12. Two regions of deletion can be found at short arm and long arm of chromosome 17 respectively, which contain the well-known tumor suppressor genes TP53 at 17p13.1, BRCA1 and CRHR1 at 17q21.31. Our results are consistent with earlier studies [35,36].

Conclusions
We have developed a stochastic segmentation model and an associated inference procedure for recurrent CNA data. The model implies explicit recursive formulas for both the posterior distribution of individual samples' signal levels and the probabilities of the crosssample recurrent events at each probe. This further suggests the estimate of the recurrent states of CNAs. To speed up the computation for practical purpose, an approximation to the exact explicit formulas is developed, and the computational complexity is reduced to linear order. Estimation of hyper parameters involves an explicit EM algorithm which is described in the Web Appendix D.
In Section 4, we have analyzed two real datasets to illustrate the application of our model. In particular, we identify the recurrent CNAs regions using the copy number data for ovarian serous cystadenovarcinoma and non-small lung cancer carcinoma that are produced by the array-CGH technology. The estimated CNA regions by our model are consistent with the biological discovery in medical study. For ovarian serous cystadenovarcinoma, we further perform a canonical pathway analysis to evaluate our result, and find our pathway enrichment results yield significant pathways and most of them are cancer related pathways. Our result based on chromosomes 1 and 17 already reveals certain biological mechanisms and pathway changes involved in ovarian cancer. These facts demonstrate that our model can successfully capture recurrent CNA regions and generate promising results in biological context.