COSINE: A Web Server for Clonal and Subclonal Structure Inference and Evolution in Cancer Genomics

Cancers evolve from mutation of a single cell with sequential clonal and subclonal expansion of somatic mutation acquisition. Inferring clonal and subclonal structures from bulk or single cell tumor genomic sequencing data has a huge impact on cancer evolution studies. Clonal state and mutational order can provide detailed insight into tumor origin and its future development. In the past decade, a variety of methods have been developed for subclonal reconstruction using bulk tumor sequencing data. As these methods have been developed in different programming languages and using different input data formats, their use and comparison can be problematic. Therefore, we established a web server for clonal and subclonal structure inference and evolution of cancer genomic data (COSINE), which included 12 popular subclonal reconstruction methods. We decomposed each method via a detailed workflow of single processing steps with a user-friendly interface. To the best of our knowledge, this is the first web server providing online subclonal inference, including the most popular subclonal reconstruction methods. COSINE is freely accessible at www.clab-cosine.net or http://bio.rj.run:48996/cun-web.


Introduction
The genome of cancer cells originate from mutation of a single cell with sequential clonal and subclonal expansion of somatic mutation acquisition during pathogenesis, which is thought to be a Darwinian evolutionary process [1][2][3][4]. Through nextgeneration sequencing (NGS) of tumor tissue, this evolutionary process can be characterized by statistical modelling, and the clonal state, somatic mutation order, and evolutionary processes can be identified [4][5][6]. Subclonal inference of overall tumor genome sequencing data is an important part of tumor evolution research, providing a new avenue for studying the relative order of mutations and the mutation process in tumorigenesis. This evolutionary process can be inferred from NGS data, with the assumption of "most recent common ancestor (MRCA)" adopted from classical population genetics.
In the past decade, a variety of subclonal reconstruction methods have been developed for large or single cell genomic data of single or multiple tumor samples over time and/or multiple locations [6][7][8][9][10][11][12][13][14][15][16][17][18]. The subclonal reconstruction process typically includes three steps: first, computing the variant allelic fractions of somatic mutations with related copy number alterations and tumor purity; second, estimating the cancer cell fraction (CCF) in the tumor, which using structural variant information for better accuracy; third, clustering the CCFs to identify subclonal structures and construct related phylogenetic trees. Through the process of clonal and subclonal expansion, the landscape of normal cells (i.e., common ancestors in population genetics) then evolves into different cancer cells. Thus, based on experimental design and reconstruction of specific tumor mutation characteristics, the accuracy and resolution of each feature inferred by subcloning can be determined. Among the above methods, most employ non-parametric Bayesian approaches (e.g., Dirichlet process with stickbreak representation) for clustering [6,8,10,14,15], and employ Markov chain Monte Carlo (MCMC) resampling schemes that contain high computational costs, especially under high mutation rates. A more economical way for clustering is to use modified mixed Bayesian models, such as SciClone [9]. Combinational phylogeny is another popular approach used for clustering, and includes TrAP [7], CITUP [11], and CloneFinder [13], although CloneFinder only uses single nucleotide variant (SNV) information. The deconvolution of cancer cell SNV density shows a high computational efficiency for subclonal inferencing, as first applied in Sclust [12] and later in FastClone [17]. Both deconvolution methods can complete the subclonal inference process in less than 5 s using simulation data with more than 500 SNVs. Regularized maximum-likelihood estimation methods can also be used for subclonal inferencing, e.g., CLiP [16].
As subclonal reconstruction methods have been developed using different programming languages and are generally under the Linux platform, many users may find it difficult to operate and compare them. In this paper, we established a web server for clonal and subclonal structure inference and evolution in cancer genomics, which included 12 popular subclonal reconstruction methods [6][7][8][9][10][11][12][13][14][15][16][17], e.g., DPclust [6], PyClone [8], PhyloWGS [10], and Sclust [12]. Each method is decomposed through detailed operational steps and implemented through the relevant operational interface, which allows professional or non-professional bioinformatic scientists to run and compare methods using their own data. Although the comparison of some subclonal inference methods has been performed in this field [18], but online tools for subclonal inferencing remain scarce. To narrow the gap between model and user, we established the first web server to provide online subclonal inferencing, with the inclusion of the 12 most popular subclonal inferencing methods.

Subclonal inference from bulk genomic data
All tumor cells in a sample that mutate before the MRCA can be used as producers of a clonal population [4,18]. Driver and passenger mutations continue to accumulate during tumor growth ( Figure 1A). Tumor cells with driver mutations will trigger a clonal expansion, which creates a subpopulation of cells. Clonal mutation means all tumor cells bearing mutations, and subclonal mutations means only part of the tumor cells bearing mutations at a variant allele, and such a subclonal cells was identified through shared mutations. The clonal and subclonal mutations are shown in Figure 1B and 1C respectively. For example, in Figure 1B and 1C, there were 16 cells, which contain 10 tumor cells and six normal cells, in Figure 1B and 1C, so the purity of tumor cells was 10/16 = 0.625. The observed variant allele frequency (VAF) of clonal mutation was 10/32 = 0.315, and subclonal mutation was 5/32 = 0.15625; the expected VAF of clonal mutation was 10/32 = 0.315, and subclonal mutation was 5/16 = 0.15625.
For the clonal mutation, the observed VAF was equal to or close to the expected VAF.
For the subclonal mutation, the observed VAF was significantly less than the expected VAF.
A general workflow for clonal and subclonal structure inference usually includes five steps: 1) calling somatic mutation from tumor-normal matched NGS data; 2) calling gene copy number in NGS data; 3) estimating CCFs; 4) inferencing clonal and subclonal structure via clustering of CCFs; 5) constructing clonal and subclonal evolutionary trees. The subclonal inference workflow of the COSINE is depicted in Figure 2. These were a lots tools for mutation and copy number calling in cancer genomics, for example GATK [19] and VarScan2 [20]. Generally, during sequencing of tumor tissue, normal cells are mixed in the tumor tissue. Thus, normal tissue close to tumor tissues is needed to obtain an accurate measure of the VAF of mutated sites.
Several studies have estimated tumor-cell purity and somatic mutations in tumor-only samples based on machine learning [21,22], however tumor-only methods require high sequencing coverage. Under low sequencing coverage (i.e., sequencing depth < 30x), tumors with normal matched type genomic data are recommended for estimating tumor purity and somatic mutations. After getting somatic mutation and copy number change information, CCFs can be estimated. Different subclonal reconstruction methods have their CCF estimate strategies, which will lead to slightly difference in subclonal inference results [18]. Structure variation information will help to improve subclonal inferencing results, for example in the latest research of SVclone [15]. A phylogenetic tree can then be constructed after clustering of CCFs. Some of these 12 methods in the CONSINE can construct phylogenetic tree, for example PhyloWGS [10], FastClone [17], PhylogicNDT [14], CITUP [11], CloneFinder [13], TrAp [7]. Details on the functions of the 12 methods in COSINE are summarized in Table 1. Tarabichi et al. provided a practical guide to subclonal inferencing from bulk cancer genomic sequencing data [23].
The clustering of CCFs and identification of clonal and subclonal mutations are the most important steps in subclonal inferencing. Non-parametric Bayesian methods are frequently employed in subclonal inferencing [6,8,10,14,15]. As somatic mutations in tumor cell populations are derived from unknown subclone numbers and with unknown CCFs distributions and clone/subclone(s) distributions. These unknown parameters of this inference process can be jointly estimated via a Bayesian Dirichlet model [24]. Given a set of observed somatic mutations, with the total read depth of each base and read number of variant alleles for each mutation, then: where ! is the number of reads of the -th mutation with reads ! and the expected fraction of reads ! ; > 0 is a scaling parameter; () is a Dirichlet process function. This model can be considered as a stick-break representation Dirichlet process and can be applied for clustering via MCMC resampling and running at least 20 000 iterations, leading to a high computational cost, especially when SNV > 4 000.
To overcome highly computational cost of Bayesian Dirichlet type methods, a more economical clustering way were employed a variational Bayesian mixture model or combinational phylogenetic method, like SciClone, TrAP, CITUP, and CloneFinder.
The most efficient way was directly deconvolution on the CCF density of all cancer cells showed a highly computational efficiency for subclonal inferencing, which used in the Sclust and the FastClone. Among these two ultra-fast subclonal inferencing methods, the Sclust can take additional structure variation information into account and jointly estimate copy number alteration, tumor-cell purity and subclonal structure in subclonal inferencing process. In one DREAM challenge's benchmark study on the evaluation of subclonal reconstruction methods [18], a regularized maximumlikelihood estimation method also shows an economical way to perform subclonal inferencing, such as CLiP, which only provides a Python package and short abstract for the method. The regularized maximum-likelihood could be another economical way for subclonal inferencing.

Subclonal inferencing is a key step to understand cancer evolution
The clonal and subclonal states of somatic mutations provide key information to understand intratumor heterogeneity, e.g., prognostic management, therapeutic strategy, and drug resistance [3,4]. With the advance of computational models and longitudinal cancer genome studies, exploring the micro-evolutionary history of tumors has become more predictable, which should help improve our understanding of the role of the immune microenvironment in tumorigenesis. Thus, subclonal inferencing is a key step after mutation and gene copy number calling. An online web server for subclonal inferencing is needed to accelerate and enhance cancer studies.

Prepared input data for subclonal reconstruction
After mapping raw genomic sequencing data to the reference genome, filter and adjust the read by GATK standard pipeline [19] is needed. The mapping, mutational calling step is supper computational time cost and resource consuming, we made a practical guide for using the BWA-MEM package [25] to map raw clean reads to the reference genome, and then correcting the BAM file with the best practice of the GATK4. The GTAK4-corrected bam file was used for calling mutation and copy number alteration via VarScan2 package [20]. A detailed practical guide for mapping, mutational calling, copy number calling, and subclonal inferencing was described in the Supplementary Files.

Design and implementation of the COSINE web server
To facilitate the use of our previously developed Sclust approach and the 11 other methods, we developed an online subclonal inference web server called COSINE (freely available at www.clab-cosine.net). Among these 12 methods, seven use only one programming language (Sclust developed in C++; PyClone, FastClone, and CloneFinder developed in Python; DPclust, SciClone developed in R; TrAp developed in Java) and five use more than two programming languages. These methods are all run a Linux system, which can hinder non-professional users from achieving subclonal inferencing quickly. In the COSINE web server, we implemented 12 subclonal inference methods to the high-performance computing cluster. Users can use one to five commands to call the subclonal inference method directly in the method's frame box through their web interface, and then download the results of the operation after completion. Due to security settings, users are required to register and log in to perform online subclonal inferencing when completing large tasks. COSINE can be accessed free of charge from www.clab-cosine.net.
A COSINE workflow is illustrated in Figure 3. We decomposed each subclonal reconstruction method using detailed running steps, and implemented a related running interface. Figure 3A uses pre-processed raw data to call SNVs and copy number (structure variation needed for some methods), as described in the Supplementary Files. Figure 3C shows the COSINE interface, and the function of the 12 methods are summarized in Table 1. As these input files for each subclonal reconstruction method differ from each other, we first divided each method into different running steps, and then created Python script to change the somatic mutation vcf and copy number alteration file to the format of each method (see in Supplementary File).

COSINE web server usage
In COSINE, we developed a user-friendly online computational platform for subclonal structure inferencing, as shown in Figure 3C. Users can follow the following steps for subclonal inferencing in COSINE: 1) Visit COSINE website and select (click) the desired method ( Figure 3C-1); 2) Create a new task on the method page ( Figure 3C-2); 3) Upload and run the program (( Figure 3C-2); 4) Download the results upon completion of task. We created a special page for users to post their issues when using the methods.

Future developments
We developed COSINE, an online computational platform for subclonal structure inferencing of the cancer genome, with integration of 12 popular subclonal inferencing methods for easy access and a user-friendly interface. Many subclonal inference models have been proposed in recent years, which has introduced issues for biomedical researchers regarding method choice, installation, and program running. COSINE fills the gap from model developer to normal user, making subclonal inferencing easier and more convenient. In the future, we will add additional functions and methods for online clonal evolutionary tree plotting and adjustment, and also include subclonal reconstruction methods from single cell genomic sequencing data.

1) Somatic mutation calling in genomic data
3) Calculating cancer cell fraction(CCF) 2) Gene copy number calling in genomic data; (Structure variation information, optional) 4) Clustering CCF distribution to find clonal and subclonal structure in cancer genome