Characterization of tumor heterogeneity by latent haplotypes: a sequential Monte Carlo approach

Tumor samples obtained from a single cancer patient spatially or temporally often consist of varying cell populations, each harboring distinct mutations that uniquely characterize its genome. Thus, in any given samples of a tumor having more than two haplotypes, defined as a scaffold of single nucleotide variants (SNVs) on the same homologous genome, is evidence of heterogeneity because humans are diploid and we would therefore only observe up to two haplotypes if all cells in a tumor sample were genetically homogeneous. We characterize tumor heterogeneity by latent haplotypes and present state-space formulation of the feature allocation model for estimating the haplotypes and their proportions in the tumor samples. We develop an efficient sequential Monte Carlo (SMC) algorithm that estimates the states and the parameters of our proposed state-space model, which are equivalently the haplotypes and their proportions in the tumor samples. The sequential algorithm produces more accurate estimates of the model parameters when compared with existing methods. Also, because our algorithm processes the variant allele frequency (VAF) of a locus as the observation at a single time-step, VAF from newly sequenced candidate SNVs from next-generation sequencing (NGS) can be analyzed to improve existing estimates without re-analyzing the previous datasets, a feature that existing solutions do not possess.

Algorithm 1 Sampling from P (z t |Z t−1 , α) using the Indian Buffet Process

S2
The procedure for resampling is itemized as follows: • Interpret each weight w i t as the probability of obtaining the sample index i.
• Draw N particles from the discrete probability distribution {w i t } and replace the old particle set with this new one.
• Set all weights to the constant value w i t = 1/N .

S3
In addition to the results presented in the main paper for the simulated datasets, we present additional results for more combinations of T, C, S and r. The results in Table 1 shows the e pts , e Z and e W computed for the proposed SMC-based, MCMC-based and MAP-based algorithms. In our experiments, we set a 0 = 0.4, a = 6, a 00 = 1, and b 00 = 100. α was set to 1 and 0.4 for the simulated and the CLL datasets, respectively.

S4
Here, we present the rest of the results obtained from the CLL datasets. Before presenting the results, we briefly describe the data pre-processing for the CLL  [3]). In addition, targeted sequencing were performed on selected somatic substitution sites in protein-coding genes from genomic DNA. The targeted sequencing was done to an average depth of 100, 000× (details in [3]). The complete datasets for the WGS and the deeply sequenced somatic mutations are in [3] and the deeply sequenced datasets used in all the analyses in this paper are in Tables 23 -28.
In the main paper, we presented part of the results obtained from analyzing the CLL datasets with the proposed SMC algorithm. Here, we present the rest of the results for the CLL003, CLL0077 and CLL006, and the results obtained when the CLL datasets are analyzed with the MCMC-based and the MAP-based algorithms. The posterior point estimates of the matrix of haplotypes Z and the matrix of proportions W are presented in Tables 2 -16.

S5
In this section, we provide the results obtained on each of the CLL datasets with manual analysis carried out by [3] and when analyzed with the method in [1] (Phylosub). The results are shown in Tables 17 -22. In the genotype matrices, each column denotes a subclone, as opposed to the haplotype in our analyses, and a 0 and a 1 denote the presence and absence of a mutation in a subclone, respectively. Also, the clonal proportion matrices show the proportions of each subclone in each of the samples.
The genes where the mutations are found are shown in the first column.    Table 9: CLL006 : Estimates of the mutational profiles of haplotypes Z in the samples using the MCMC-based algorithm.
The genes where the mutations are found are shown in the first column.
The genes where the mutations are found are shown in the first column.  Table 13: CLL077 : Estimates of the mutational profiles of haplotypes Z in the samples using the MAP-based algorithm.