hsegHMM: hidden Markov model-based allele-specific copy number alteration analysis accounting for hypersegmentation

Background Somatic copy number alternation (SCNA) is a common feature of the cancer genome and is associated with cancer etiology and prognosis. The allele-specific SCNA analysis of a tumor sample aims to identify the allele-specific copy numbers of both alleles, adjusting for the ploidy and the tumor purity. Next generation sequencing platforms produce abundant read counts at the base-pair resolution across the exome or whole genome which is susceptible to hypersegmentation, a phenomenon where numerous regions with very short length are falsely identified as SCNA. Results We propose hsegHMM, a hidden Markov model approach that accounts for hypersegmentation for allele-specific SCNA analysis. hsegHMM provides statistical inference of copy number profiles by using an efficient E-M algorithm procedure. Through simulation and application studies, we found that hsegHMM handles hypersegmentation effectively with a t-distribution as a part of the emission probability distribution structure and a carefully defined state space. We also compared hsegHMM with FACETS which is a current method for allele-specific SCNA analysis. For the application, we use a renal cell carcinoma sample from The Cancer Genome Atlas (TCGA) study. Conclusions We demonstrate the robustness of hsegHMM to hypersegmentation. Furthermore, hsegHMM provides the quantification of uncertainty in identifying allele-specific SCNAs over the entire chromosomes. hsegHMM performs better than FACETS when read depth (coverage) is uneven across the genome. Electronic supplementary material The online version of this article (10.1186/s12859-018-2412-y) contains supplementary material, which is available to authorized users.


Background
Characterizing somatic copy number alterations (SCNAs) is important for understanding tumorgenesis ( [1]), cancer etiology and prognosis ( [2]). In normal cells, two copies of chromosome are inherited from both parents. In contrast, tumor cells frequently contain alterations in copy numbers across the chromosomes, such as deletions, insertions, or amplifications among others ( [3,4]). In addition, tumor tissues always contain normal cells (reduced tumor purity) and frequently show an abnormal number of chromosomes (aneuploidy). These characteristics of the cancer genome and tissue heterogeneity complicate the estimation of SCNAs, in contrast to germline copy number variations (CNVs) analysis where neither should be considered ( [5,6]). Allele-specific SCNA analysis estimates the integer copy number for

Revised Manuscript
Click here to access/download;Manuscript;hsegHMM_revisedManuscript.tex Click here to view linked References  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 each allele instead of the total copy number, and is essential to identify the copyneutral loss of heterozygosity (NLOH) ( [7,8]). Based on the total copy numbers only, NLOH will be misidentified as normal regions with the copy number two, when one chromosome is duplicated but the corresponding homologous region is deleted ([9]). In this paper, we consider next-generation sequencing (NGS) platform-based whole exome sequencing (WES) data for studying SCNAs. The NGS technology provides high resolution at the single base-pair, which comes with mapping bias and the tendency for hypersegmentation. Mapping bias occurs from higher mapping rates for the reference allele than those for the variant allele at heterozygous loci( [10]). This bias leads to incorrect interpretations of allele-specific SCNAs. Hypersegmentation is also a major challenge in NGS-based allele-specific SCNA. The quality of such data depends on the sample preparation, the library preparation, and polymerase chain reaction (PCR) techniques from applying NGS technology, and the exome enrichment platforms from WES. It has been reported that capture e ciency could vary across the percentage of guanine or cytosine contained in DNA [11]. These technical procedures have a limit of accurate quantification of sequences, which potentially increase measurement errors that in turn result in excessive segmentations.
A number of papers have been proposed to address these challenges and complexities. While PennCNV ( [12]) and QuantiSNP ( [13]) are based on the assumption of 100% tumor purity, ASCAT ( [14]), GPHMM ( [5]), and MixHMM ( [15]) account for both the tumor purity and the ploidy. However, these methods do not explicitly characterize the genotype at each allelic location. Furthermore, these methods use a B-allele frequency (BAF), which is sensitive to mapping bias. Shen and Seshan ( [16]) developed FACETS that uses log Odds Ratio (logOR) instead of BAF, since the logOR of tumor versus normal cells provides unbiased allelic information. FACETS uses a genotype mixture model, providing an allele-specific tumor copy number profile adjusted for the tumor purity and ploidy. However, since the segmentation and genotype mixture modeling are conducted by separate algorithms, it is not possible to assess uncertainty in the estimation of allele-specific SCNA.
In this paper, we propose a novel hidden Markov modeling approach (hsegHMM) for allele-specific SCNA analysis accounting for the hypersegmentation. hsegHMM embeds logOR and logR (log R ratio) into a hidden Markov model (HMM) framework, and simultaneously conducts the segmentation and genotype mixture modeling required to identify SCNAs across chromosomes. Similar to FACETS, the lo-gOR is applied instead of the BAF to adjust for mapping bias. Hypersegmentation, which may result from logR outliers, is accounted for by assuming a t-distribution for the distribution of logR. hsegHMM makes inference about allele-specific SCNAs using the E-M algorithm where we iterate between the E-and the M-step until convergence. The E-step is made tractable by using a recursive forward-backward algorithm that evaluates functions of the hidden locus-specific genotype states given the observed logR and logOR. Given a genotype state, the tumor purity and the ploidy are obtained in the M-step by maximizing the expectation of the conditional log-likelihood function.

Methods
Hidden Markov Model:hsegHMM Modeling with logRatio and logOdds Ratio as outcomes Let Z k be the unknown genotype as the hidden state of the k th genomic location and W k and X k be the corresponding logR and logOR, respectively. We specify 12 di↵erent states of Z k , given in Table 1. The index for hidden states, j is from 1 to J (in this paper, J = 12) and the index for genomic locations, k is from 1 to N . The expectations of logR and logOR are given in Van Loo et al. ( [14]) and Shen and Seshan ( [16]), respectively. Specifically, we can write where C N is the copy number of normal cells prespecified as C N = 2; C T,j is the copy number of tumor cells at the j th state; ↵ is the tumor purity proportion of the tumor tissue over a range of 0 to 1; is the ploidy. For example, if a tumor sample contains 100 % tumor cells and is diploidy, then µ 3 , the expectation of logR at the state j = 3 is 0 along with C T,3 = 2, = 2, and ↵ = 1. The expectation of logOR is given by where m j and p j are the maternal and paternal copy numbers of the tumor at the k th genomic location, respectively. We assume that Z = {Z 1 , Z 2 , Z 3 , · · · , Z N }, the genotype sequence across chromosomes follows a Markov chain with a transition probability, P ij = P (Z k = j|Z k 1 = i) and an initial probability, r 0j = P (Z 1 = j).
P ij indicates the probability that the j th genotype state occurs conditionally on the i th genotype state at the previous location.

Joint emission probability with conditional distributions of logR and logOR
We consider a t-distribution for logR in order to account for outliers that are apparent in NGS data and potentially lead to hypersegmentation. Following Peel and McLachlan ([17]), we specify a t-distribution for W k with a degree of freedom v by a mixture of a normal distribution N (W k |µ j ,  2 /u k ) with a gamma distribution We also specify a normal distribution for logR, W k |Z k = j ⇠ N (µ j , 2 ) to examine how di↵erently these two distributions of logR behave in terms of handling hypersegmentation .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 We use the square of logOR due to the lack of the haplotype information ( [16]). The squared logOR (X 2 k ) follows a chi-square distribution, 2 1 (X 2 k /⌧ 2 |Z k = j, j ), with one degree of freedom and a non-centrality parameter j = ⇣ 2 j /⌧ 2 , with the mean and variance ⇣ j and ⌧ 2 , respectively. Finally, our joint emission probability with a t-distribution of logR at the k th location given the state j is When logR follows a normal distribution, the first part on the right-hand side in Equation 4 is replaced by N (W k |Z k = j, µ j , 2 ). Since logOR cannot be obtained in homozygous loci, the emission probability from Equation 4 is re-formulated, depending on the existence of logOR at the k th location (Equation 5).
Here, X k = NA indicates that the k th location is homozygous, and heterozygous otherwise.

Estimation with E-M algorithm
The E-M (expectation-maximization) algorithm ( [18]) is used to estimate the parameters of hsegHMM. In the E-step, the goal is to calculate the posterior probability, P (Z k = j | W , X 2 ) and the joint probability, P (Z k = j, Z k 1 = i| W , X 2 ), where W and X 2 indicate the sets of the logR and squared logOR values over chromosomes, respectively . These probabilities are evaluated by applying the forward-backward algorithm ( [19]) resulting in the following conditional probabilities, where a j (k) = P (W 1:k , X 2 1:k , Z k = j) and b j (k) = P (W k+1:N , X 2 k+1:N |Z k = j); W l:m and X 2 l:m indicate the observed values of logR and squared logOR from the l th locus to m th locus. a j (k) is evaluated by a forward recursion up to the k th observation and the b j (k) by a backward recursion from the last to the (k + 1) th observation. The recursion equations are Then, the loglikelihood is simply computed as log h P J l=1 a l (N ) i . The calculation of a j (k) and b j (k) may result in extremely small values that cause an underflow issue, particularly with large N as in our application. Therefore, the scaled HMM [20] is implemented for all the analyses in this paper.
In the M-step, given the hidden state values obtained from Equations 6 and 7, we maximize the expectation of the conditional log-likelihood function with respect to all the parameters. The expectation of the complete log-likelihood function is given by, where ✓ is a set of global parameters: ✓ = {↵, ,  2 , ⌧ 2 , v}. Given k and ⌘ i,j k in Equation 8, we can estimate all the parameters by maximizing Q in the M-step. For estimating the initial probability r 0j , we maximize Q 0 under the constraint that . Similarily, for estimating P ij under the constraint P J j=1 P (Z k = j|Z k 1 = i) = 1, we obtain the closed form as For estimating all the global parameters, we maximize Q ✓ in terms of them by using the L-BFGS-B optimization algorithm in the R function optim.

Simulations
Initially, we perform two simulation studies for assessing how accurately our proposed model identifies true genotype states when hypersegmentation occurs. For these two studies, we generate 500 datasets with true genotype sequences of size N = 10, 000 based on a Markov chain with a given transition probability, using the R-package markovchain. We consider four genotypes as true states: A, AA, AAAB, and AAAAB. The first two genotypes are chosen for hemizygous deletion and neutral LOH (loss of heterozygosity), while the last two are chosen for the amplification events. Moreover, AAAB (j = 7) and AAAAB (j = 10) give similar expectations for both logR and logOR, making it more challenging to distinguish between these two genotypes. In this study where = 1.6 and ↵ = 0.9, µ 7 and µ 10 are 1.25 and 1.55 for logR; ⇣ 7 and ⇣ 10 are 1.03 and 1.31 for positive values of logOR. For each simulation study, the hsegHMM-N and the hsegHMM-T models are applied; The observed logOR j is generated by X k = ⇣ j +" k where " k is normally distributed in these two simulation studies.

t-distribution-based logR
In this simulation, we simulate hypersegmentation using a t-distribution for logR and examine both hsegHMM-N and hsegHMM-T models. We start with generating 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 logR values from the t-distributions with v(= 4) degrees of freedom. The squared values of logOR are generated from the chi-square distribution with one degree of freedom. Similar to the TCGA-KL-8331 dataset used in the TCGA study section, we assign 90% of loci to be homozygous. For each locus, the allele-specific SCNA is identified by choosing the genotype with the largest posterior probability. In order to evaluate the accuracy of our models, we estimate the probability of a correct identification across the genome. First, we create the classification index variable which is set to be 1 if the estimated genotype is correctly assigned for each locus in one simulation, and zero otherwise. We then calculate the probability of identification of each genotype across locus by averaging the index across 500 datasets. Figure 4 shows the probability of identification for all genotype states. It is obvious that the red lines (by the hsegHMM-T model) are significantly closer to the true signal (black line) than the blue lines (by the hsegHMM-N model) for all genotypes. In particular, the accuracy plot for the genotype AAAB shows much lower blue lines, compared with the red ones. For instance, the 5909 th locus is truly assigned to genotype AAAB, and the probability of identification for AAAB at the location is 0.488 with the hsegHMM-N model but 0.940 with the hsegHMM-T model. This indicates that AAAB is harder to correctly identify when we use hsegHMM-N model rather than the correct hsegHMM-T model.
We examine the statistical properties of the global parameter estimates in Table 3. SEs are empirical standard errors and SE H are computed by averaging 500 asymptotic standard errors based on the Hessian matrices. The Hessian matrix for each dataset is numerically obtained by hessian in R package numDeriv. The estimates are unbiased even under the misspecified normal model. However, the asymptotic standard error estimates are sensitive to misspecification of the logR distribution (SE H is di↵erent from SEs under the misspecified hsegHMM-N model).

A mixture of normals-base logR
We examine the robustness of the t-distribution to alternative distributions that exhibit long tails. Specifically, we simulate under a mixture of normal distributions and examine the robustness of the hsegHMM-T model. Errors are generated by ✏ k ⇠ where ⇡ 1 and ⇡ 2 indicate the mixture proportions of the first and the second distributions, respectively. The means and variances for those two normal distributions are chosen under the condition of E(✏ k ) = 0 and In this simulation, ⇡ 1 and ⇡ 2 are set as 0.7 and 0.3 with µ ✏1 = µ ✏2 = 0, 2 ✏1 = 0.5, and 2 ✏2 = 0.5 ⇥ 3 2 . Then, the total error variance 2 ✏ is 2.25. Both the probability of identification ( Figure 5) and the summary of estimators with corresponding standard errors, SEs and SE H (Simulation 2 in Table 3) are shown in the same way as described in the previous Section (t-distribution-based logR). Figure 5 shows that all the red lines based on the hsegHMM-T model appear noticeably closer to the black lines than the blue lines based on the hsegHMM-N model for all the genotype states. In particular, detecting both AAAB and AAAAB with the hsegHMM-T model performs much better than the hsegHMM-N model.
A single dataset-based result is provided to investigate more closely how much more robust the hsegHMM-T model is than the hsegHMM-N model to cope with hypersegmentation ( Supplementary Figs S1-S4). The first two figures (Supplementary Figs S1 and S2) and the last two figures ( Supplementary Figs S3 and S4) show the copy number profiles of a particular dataset from the first and second simulation scenarios, respectively. Similar to the analysis results, hsegHMM-T appears to handle hypersegmentation much better than hsegHMM-N. This same pattern was observed for all simulated datasets (data not shown).
We also perform additional simulation studies for di↵erent values of purity (↵ = 0.3, 0.5, 0.7) and di↵erent numbers of reads (half and double related to the original from Figure 6). Based on 500 simulated datasets, our proposed model performs better as the purity increases in terms of a higher probabilities of correct genotype identification ( Figure S5).

A read counts-based simulation for the comparison of hsegHMM and FACETS
In this simulation, we compare our method, hsegHMM with FACETS which also constructs its model based on logR and logOR but with a segmentaion-based approach for allele-specific SCNA analysis. In order to make a fair comparison between these two methods, we generate datasets from read counts and read depth (coverage) in the beginning without relying on model assumptions of hsegHMM or FACETS. In this study, we choose the true profile of genotypes as A, AB, and AA orderly with di↵erent lengths over the entire chromosomes.
First, we consider the fact that the total coverages for normal cells and tumor cells are di↵erent. Thus, we assign di↵erent read depths for tumor and normal cells as 160 and 40 in this study. Then, for normal cells, the read depth is 40 and the read count of A allele is either 20 for AB (heterozygosity) or 40 for AA (homozygosity) regardless of the true genotypes for tumor. On the other hand, for tumor cells, both read depth and read counts are assigned depending on di↵erent genotypes. For the genotype AB, the read count and depth are set to be 80 and 160, respectively. For the genotype AA, the read count and depth are the same as 160. For the genotype A, the read count and depth are both 80 as the same number with a half of the total coverage since B allele is lost.
In order to generate read counts and depths, we use a uniform distribution with di↵erent intervals for normal and tumor samples. The intervals provide the variation occuring from measurement errors. For normal, read counts and depths are generated from a uniform distribution with the range of ±15 intervals for both the genotype AA and AB, and the range of ±20 for the genotype A. For tumor, read counts and depths are generated from either of these two uniform distributions with the range of ±30 and ±15 with probability of 70% and 30%, respectively. This setting provides logR and logOR values with di↵erent variances and asymmetric ranges between genotype A and the others, which makes more challenging to analyze. Finally, we round the decimal values of read counts and depths from these continuous uniform distributions to the nearest intergers.  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Through the preprocedure of FACETS, we obtain total 4,942 values of logR and logOR based on the integer values of read counts and depths, for which both hsegHMM and FACETS are applied. Figure 6 shows how di↵erent these two methods behave through the probability of identification plots for all the di↵erent genotype states. These plots are based on 500 simulated datasets. hsegHMM-N (red lines) and hsegHMM-T (blue lines) have almost the exact patterns of the identification plots with FACETS (green lines) for all the genotypes. On the other hand, when the read depth distribution was skewed (a rescaled beta distribution with shape parameter values of 1 and 6), FACETS did poorly in region identification as compared to hsegHMM. Specifically, we consider both the cases of short and wide region length. Figure S6 shows the probabilities of correct identification for a short region based on 500 simulated datasets. hsegHMM-T identified the mutation in approximately 96% of the datasets as compared with 5% using FACETS (the left panel in Figure S6). For the wider region, FACETS improved relative to hsegHMM, but there was still a marked improvements of our approach (the right panel in Figure  S6): 83% and 99% identification for FACETS and hsegHMM, respectively. These results show the advantage of hsegHMM as compared with FACETS for uneven coverage.
We also examine our method with di↵erent numbers of reads for read counts and depths by reducing half size and increasing double size of them ( Figures S7 and  S8). For both the half and double read size cases, our model shows similar results to those from the original size of reads ( Figures S7 and S8). FACETS showed similar behavior when the read counts were altered (Figures S7 and S8).

TCGA-KL-8331 renal cell carcinoma dataset
TCGA project (https://cancergenome.nih.gov) is a cancer genomic collaboration between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI). This project includes critical genomic information of 33 types of cancers with more than two petabytes of TCGA genomic dataset to contribute cancer etiology, treatment, and diagnosis. In this research, we apply hsegHMM to whole-exome sequencing data from a chromophobe renal cell carcinoma (RCC) sample (TCGA-KL-8331).
TCGA-KL-8331 dataset consists of read counts and total depths for both normal and tumor paired tissues from the same patient over the entire chromosomes. This dataset contains 1, 217, 407 single nucleotide variants (SNVs). Through FACETS pre-processing step, these 1.2 MB SNVs were reduced to 369,131 SNVs, which are limited to the germline polymorphic sites and filtered by low quality including lower depth coverage positions (see the details in the Data pre-processing section in [16]). Thus, observed logR and logOR are calculated for N = 369, 131 loci. In this RCC sample, we find that approximately 13% (47,660) of loci are heterozygous with the corresponding logOR available. For computational feasibility, we perform a thinning process which keeps every 10 th observation. This also reduces auto-correlations between observations, which helps alleviate hypersegmentation. We apply the hsegHMM procedure to the final dataset N = 36, 914.  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  Figures 1 and 2 show the results based on an assumed normal and t-distribution for the logR values given the hidden state (genotype state), respectively. We denote these two models as hsegHMM-N and hsegHMM-T, respectively. Each figure includes four panels corresponding to the estimated values of logR, logOR, copy numbers, and genotype status. With the hsegHMM-N model (Figure 1), estimated lines (brown color) show not only the main signals (longer bars) but also numerous dots across the chromosomes. These small dots occur due to the sensitivity of the hsegHMM-N model to extreme observations. The hsegHMM-T model reduces hypersegmentation with fewer short subsequences (Figure 2.(a)). However, a few numbers of short sequences still occur in using the hsegHMM-T model. Thus, instead of using the 12 genotype-state space, we consider only two major genotypes, A and AB identified by the hsegHMM-T with the 12 genotype states. It turns out that all the short dots are removed across the entire chromosomes ( Figure 2.(b)). Thus, the hsegHMM-T model with the two major genotype states manages hypersegmentation most e ciently among those three di↵erent model fits. Furthermore, according to the model fitting criteria, the hsegHMM-T model with A and AB genotype states (hsegHMM-T A/AB ) fits data best with the smallest AIC (62923.90) and BIC (62992.03). We also apply the FACETS method to compare the result with our method. The hsegHMM-T with the two major genotype states have almost the same allele-specific copy number profiles with FACETS in Figure 3.
The tumor sample purity ↵ is estimated to be about 87-88% for all the methods, which indicates a high proportion of the tumor cells in the tumor tissue. The estimated ploidy, b (⇡ 1.6) appears to be di↵erent from 2 in all the methods, which provides evidence for aneuploidy in this sequence. Note that the tumor purity, ploidy and variances of logR (V (W )) are similar in all the three hsegHMM models. This suggests that estimation of global parameters are robust to the distribution of logR and to an expanded genotype state space. This is in contrast to the allele specific genotype status that does appear to be sensitive to the distribution of logR and to the specification of an appropriate state space of genotypes, and hence to hypersegmentation.

Discussion
We have shown that the hidden Markov modeling approach provides an e↵ective way to identify allele-specific copy number alternations along the genome. As compared with FACETS, a segmentation-based approach, the hsegHMM provides an assessment of the uncertainty in parameter estimate (i.e. ploidy and purity), using likelihood-based estimate of variances as well as the ability to assess variability in copy number identification by computing posterior estimates of the genotype at each locus. It is also important to mention that hsegHMM is based on the output of WES, which in turn relies on the exome enrichment platforms where capture e ciency may still a↵ect SCNA estimation.
A major focus of the paper was demonstrating that hypersegmentation in allelespecific SCNA data can be substantially reduced by incorporating a long-tailed emission distribution (hsegHMM-T model) into a HMM framework. We also found that hypersegmentation could occur by choosing a state space (possible genotypes) that is more expansive than necessary. Thus, we recommend that the most parsimonious model with a limited number of genotype states be chosen. Of course, the 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 choice of this model should be based on using penalized likelihood methods such as AIC. Last, the hsegHMM assumes that logR and logOR measurements given genotype are independent across the entire chromosomes. This may not be true when loci are very close together, and failure of this assumption may lead to hypersegmentation. We therefore recommend thinning the sequence data (e.g., taking only one out of ten data points) to avoid this problem.
The application of hsegHMM can be extended in three future directions that have important applications in cancer genetics. First, hsegHMM can be applied to a population-based study where many subjects will be analyzed. In this case, we suggest that individual-specific analyses be conducted and the results combined in a final analysis. For example, evidence of a SCNA being related to a particular cancer may be suggested if a sizable proportion of the posterior probabilities of a genotype at a particular chromosome location are greater than a certain threshold (e.g. > 80%). Second, the relationship between a genetic factor and a subject-specific covariate may be examined in a second stage regression. For example, by using all the individual ploidy estimates obtained from the population-based study, we can construct a linear regression of the log ploidy estimate, log b with a set of any covariates such as log b = 0 + 1 age+ 2 BMI. As an illustration for a populationbased study, we have analyzed all 316 renal cell carcinoma samples from TCGA with the proposed model based on 5 copy number of state-space. We obtained the distribution of estimated ploidy across all the samples for any major copy number alteration event across the chromosomes. We also estimated the distribution of purity which is a measure of the quality of the tissue samples. Furthermore, we created a cytoband-based stacked histogram of allele-specific SCNA events for integrating allele-specific SCNA profiles from all the 316 samples. Each sample has its own allele-specific SCNA profile with di↵erent genotypes and regions. To standardize genetic locations across the samples, we used a cytoband file format which has predefined positions of cytobands across the whole chromosomes. For each cytoband, the corresponding allele-specific SCNA event is assigned within individual sample, and the number of times each event occurs is counted. After counting the frequencies of all the cytobands, we found that the most frequent mutation is a hemizygous deletion (genotype A) that has highest frequency on Chromosomes 3 and 14. In addition, we found a high frequency of a Gain (genotype AAB) in the region between q21.3 and q35.3 on Chromosome 5 ( Figure S9). Last, our model structure can be extended to infer tumor subclonal populations. In practice, a tumor sample contains a mixture of clones not just one main clone which is assumed in hsegHMM. Such an approach can also be embedded into a hidden Markov modeling framework, and is the subject of future research.

Conclusions
In this paper, we propose a hidden Markov model framework (hsegHMM) for estimating genotype status as well as copy number at each locus, incorporating the complexities of tumor samples as well as hypersegmentation. Specifically, under certain type of data with more fluctuated or irregular observations, hsegHMM-T model performs better than hsegHMM-N model in terms of such a remarkable reduction of hypersegmentation. As a byproduct of the hsegHMM estimation procedure ,   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 we can compute the posterior probabilities of allele-specific genotype status (the Method section) as well as provide a rigorous comparison of di↵erent models (e.g. normal versus t-distribution) by using AIC and BIC (the Result section). Hence, hsegHMM provides a rigorous framework for statistical inference and model assessment. hsegHMM can also expand the genotype state space so that it can handle a more flexible range of copy number alterations. Specifically, this flexibility is useful for analyzing data from certain type of cancers with high-level amplification events. Simulation studies showed that hsegHMM-T performed much better than FACETS in situation where the coverage (read depth) is uneven across the genome.
the asymptotic standard error of V (W ) with the hsegHMM-T is calculated by using the Delta method. 4 The distribution of the ploidy estimates is skewed so the SEs of the ploidy appears to be larger than SE H . Using the scaled MAD (median absolute deviation) gives a closer value (0.008) to SE H ; ✓m is the estimate for the m th dataset and b ✓ med is the median calculated from 500 simulated datasets .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Additional Files Figure S1 -Allele-specific SCNA analysis based on the hsegHMM-N model of a simulated dataset for the simulation study with logR generated from t-distribution The first two panels show the profiles of logR and logOR over the entire chromosomes; The last two panels indicate estimated copy numbers and genotype for each sequence over the entire chromosomes. Figure S2 -Allele-specific SCNA analysis based on the hsegHMM-T model of a simulated dataset for the simulation study with logR generated from t-distribution The first two panels show the profiles of logR and logOR over the entire chromosomes; The last two panels indicate estimated copy numbers and genotype for each sequence over the entire chromosomes. Figure S3 -Allele-specific SCNA analysis based on the hsegHMM-N model of a simulated dataset for the simulation study with logR generated from normal-mixture distribution The first two panels show the profiles of logR and logOR over the entire chromosomes; The last two panels indicate estimated copy numbers and genotype for each sequence over the entire chromosomes. Figure S4 -Allele-specific SCNA analysis based on the hsegHMM-T model of a simulated dataset for the simulation study with logR generated from normal-mixture distribution The first two panels show the profiles of logR and logOR over the entire chromosomes; The last two panels indicate estimated copy numbers and genotype for each sequence over the entire chromosomes.      1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64