A statistical approach for tracking clonal dynamics in cancer using longitudinal next-generation sequencing data

Abstract Motivation Tumours are composed of distinct cancer cell populations (clones), which continuously adapt to their local micro-environment. Standard methods for clonal deconvolution seek to identify groups of mutations and estimate the prevalence of each group in the tumour, while considering its purity and copy number profile. These methods have been applied on cross-sectional data and on longitudinal data after discarding information on the timing of sample collection. Two key questions are how can we incorporate such information in our analyses and is there any benefit in doing so? Results We developed a clonal deconvolution method, which incorporates explicitly the temporal spacing of longitudinally sampled tumours. By merging a Dirichlet Process Mixture Model with Gaussian Process priors and using as input a sequence of several sparsely collected samples, our method can reconstruct the temporal profile of the abundance of any mutation cluster supported by the data as a continuous function of time. We benchmarked our method on whole genome, whole exome and targeted sequencing data from patients with chronic lymphocytic leukaemia, on liquid biopsy data from a patient with melanoma and on synthetic data and we found that incorporating information on the timing of tissue collection improves model performance, as long as data of sufficient volume and complexity are available for estimating free model parameters. Thus, our approach is particularly useful when collecting a relatively long sequence of tumour samples is feasible, as in liquid cancers (e.g. leukaemia) and liquid biopsies. Availability and implementation The statistical methodology presented in this paper is freely available at github.com/dvav/clonosGP. Supplementary information Supplementary data are available at Bioinformatics online.


Accounting for sequencing errors
It is not unusal for NGS technologies to introduce errors, which manifest as low-frequency artefacts in the generated sequencing data. A common strategy for handling these errors is to filter the data before further downstream analyses. For example, after initial screening of somatic variation data, the investigator may decide that variants with low VAF values (e.g. less than 1%) are noise and proceed to remove them from the dataset. Alternatively, such sequencing errors may be handled directly by appropriate modifications of the Beta-Binomial noise model. Assuming a small sequencing error rate , the beta-binomial model can be writen as r ij ∼ BBin(R ij , v jgij , u j (1 −g ij )), whereg ij = + (1 − )f (φ ij ). In the absence of overdispersion (v j → ∞), the above reduces to the binomial model: r ij ∼ Bin(R ij , + (1 − )f (φ ij )). Under this formulation, the expected value of VAF is modelled as the mixture of two factors: tumour heterogeneity, which is summarised by the function f (φ ij ), and the sequencing error . When the error rate is small, f (φ ij ), the above two models reduce to the forms presented in the main text.

Performance metrics
In our benchmarks on synthetic data, we have used three different performance metrics: the Adjusted Rand Index (ARI; [1]), the Adjusted Mutual Information (AMI; [2]) and the Fowlkes-Mallows Index (FMI; [3]). All three scores are robust against agreement-by-chance between any two clusterings and against anisotropic cluster shapes. Below, we give details on how these metrics are calculated. Given a set of N mutations and two independent clusterings of these mutations X = (X 1 , . . . , X k ) and Y = (Y 1 , . . . , Y l ), we can construct the following contingency where n ij is the number of mutations in common between X i and Y j , and a i and b j are row-and column-wise sums, respectively. Then, ARI is given by the following expression: where c d indicates the binomial coefficient. ARI takes values between -1 and 1, where negative or close to 0 values indicate poor agreement between the two clusterings.
Next, the calculation of AMI requires first calculating the Mutual Information (MI) between clusterings X and Y , which is defined as follows: where p i = a i /N , q j = b j /N and π ij = n ij /N . Then, the AMI is calculated as: where H X = − i p i log p i and H Y = − j q j log q j are the entropies of X and Y , respectively. The expected M I (assuming X and Y are random) is given by the following expression: Values of AMI close to 0/1 indicate poor/excellent agreement between clusterings X and Y .
Finally, the FMI can be calculated easily as the geometric mean of the pairwise precision and recall, using the following expression: where T P is the number of pairs of mutations that belong in the same cluster in both X and Y ; F P is the number of pairs that belong in the same cluster in X, but not in Y ; and F N is the number of pairs that belong in the same cluster in Y , but not in X. As with the AMI, values of FMI close to 0/1 indicate poor/excellent agreement between X and Y .

Generation of synthetic data
For a given number of samples M , mutations N and mutation clusters K, data are simulated as follows: a) for each sample j, we randomly choose a purity value ρ j between 80% and 90% and a random collection time t j (with the first sample collected at time 0 and the last at time 1); b) for each cluster k, we sample a set of values {ψ jk } j,k from a Gaussian process prior with squared amplitude h 2 and inverse squared time scale τ ; we calculate each φ jk as a sigmoid function of ψ jk ; c) for each mutation i, we randomly sample a cluster membership indicator z i between 1 and K; d) finally, for each mutation i in each sample j, we sample the total number of reads R ij from the empirical distribution of total reads in the data and then the number of mutated reads r ij from a Binomial distribution: r ij ∼ Bin(R ij , 1 2 ρ j φ jzi ). We generated data with M = {3, 6, 12}, N = {25, 50, 100} and K = {2, 4, 8}. For h 2 and τ , we used the values {1, 10, 20} and {1, 10, 100}, respectively, which cover the range of values estimated from the actual data presented in the main text. For each of the 243 combinations of these parameters, we generated 3 replicates, which leads to a total of 729 datasets.

Supplementary results
In the main text, we used synthetic data to compare the performance of the baseline (i.e. Flat) model against a group of GP0 models, which (unlike the Flat model) explicitly take into account information on the temporal spacing of longitudinally collected samples (see Fig. 6 in main text). Here, we extend these simulations by considering two additonal baseline models: PyClone [4] and Canopy [5]. PyClone is a well known software for clustering mutations with similar VAF values in one or more tumour samples. Canopy is also a popular software for clonal and phylogenetic tree reconstruction, which also provides facilities for clustering the mutational profiles of tumour samples. Both models were applied on exactly the same data as those presented in Fig. 6 in the main text.
Our results from this set of benhmarks are illustrated in Suppl. Fig. 4. When few samples are available (M = 3), the Flat and PyClone models perform similarly to the GP0 models. At a high number of samples (M = 12), the performance of the baseline models (Flat, PyClone, Canopy) falls behind models GP0 and this difference is more pronounced at low mutation numbers (N = 25 or 50) and high data complexity (i.e. high number of clusters, K = 4 or 8). At intermediate sample numbers (M = 6), the baseline models perform worse than the GP0 ones when few mutations (N = 25) and many clusters (K = 8) are considered, but again this gap in performance closes with increasing mutation numbers. Canopy tends to perform worse that the alteratives, although all models demonstrate very high (i.e. above 85%) agreement with the ground truth in almost all examined scenarios. As stated in the main text, these results indicate that, in the presence of non-trivial cluster dynamics, the baseline models are comparable to GP0 models, but only when the number of samples or data complexity (here, the number of clusters) is low. As illustrated in Suppl. Figs. 5 and 6, the same conclusions hold when agreement to the ground truth is measured using the Adjusted Mutual Information (AMI) or the Fowlkes-Mallows Index (FMI).

Supplementary Discussion
In the section Model nomenclature in the main text, we give the total number of parameters that need to be estimated in each model. In comparison to Flat, models in the GP0 group (which are the ones performing best in our becnhmarks) inlcude only two additional parameters, while models GP2 have L + 1 + L(L − 1)/2 more parameters (L is the number of clusters with non-zero weights) making them the most complex models we examined in this paper. The total number of data points considered for parameter estimation is N × M , where M is the number of samples and N is the number of mutations (thus, for patient CLL003 with 28 mutations, we have 140 datapoints in total). The smallest dataset we studied was from patient CLL006 (18 mutations; 5 samples; therefore, 90 datapoints; Figs. 2Bii and Fig. 3, top-right panel). Models GP0 perform quite well on this dataset, but models GP2 had quite bad performance (both on this dataset and on the other datasets presented in the same figure) and, therefore, they were omitted from the figure. At larger datasets (e.g. Fig. 4D), most GP2 models perform quite well (i.e. better than the Flat baseline), but comparably to the simpler GP0 models. Overall, the GP0 group of models perform at least as well as (and usually better than) the baseline Flat model on the datasets we analysed. In the case of synthetic data, models GP0 perform comparably to the baseline in the case of 25 mutations and 3 samples (75 data points; Fig. 6, top-left panel and Suppl. Figs 4 to 6, top-left panel) and their performance gap (compared to the baseline) increases with increasing number of samples and data complexity.
Below we give some guidance on when to use each model. We recommend using the Flat model in the case of single tumours or cross-sectional samples and one of the GP0 models or the Flat model in the case of longitudinal data. In order to decide which model to use in this case, the user is advised to run a small benchmark on their longitudinal dataset comparing the Flat against the four GP0 models using the ELBO as performance metric. If Flat is the best performing model, then the user should use this model. If one of the GP0 models performs at least as well as the Flat model, then the user is advised to use the GP0 model, instead. If more than one GP0 models perform at least as well as the Flat model, but they have similar performance to each other, then the user is advised to use the smoothest model, because smoother models estimate more narrow 95% credible intervals (see Fig. 1 and Suppl. Figs. 1 to 3). Notice that the GP0 models are ordered as GP0-Exp < GP0-Mat32 < GP0-Mat52 < GP0-ExpQ in terms of increasing smoothness, where GP0-Exp is not smooth (i.e. non-differentiable), while GP0-ExpQ is perfectly smooth (i.e. infinitely differentiable).