Regional and temporal heterogeneity of epithelial ovarian cancer tumor biopsies: implications for therapeutic strategies

Stage III/IV epithelial ovarian cancer (EOC) is a systemic disease. The clonal relationship among different tumor lesions at diagnosis (spatial heterogeneity) and how tumor clonal architecture evolves over time (temporal heterogeneity) have not yet been defined. Such knowledge would help to develop new target-based strategies, as biomarkers which can adjudge the success of therapeutic intervention should be independent of spatial and temporal heterogeneity. The work described in this paper addresses spatial and temporal heterogeneity in a cohort of 71 tumor biopsies using targeted NGS technology. These samples were taken from twelve high grade serous (HGS) and seven non HSG-EOC, both at the time of primary surgery when the tumor was naïve to chemotherapy and after chemotherapy. Matched tumor lesions growing in the ovary or at other anatomical sites show very different mutational landscapes with branched tumor evolution. Mutations in ATM, ATR, TGFB3, VCAM1 and COL3A1 genes were shared across all lesions. BRCA1 and BRCA2 genes were frequently mutated in synchronous lesions of non HGS-EOC. Relapsed disease seems to originate from resistant clones originally present at the time of primary surgery rather than from resistance acquired de novo during platinum based therapy. Overall the work suggests that EOC continues to evolve. More detailed mapping of genetic lesions is necessary to improve therapeutic strategies.


Regional heterogeneity at the pathway level
We next shifted our attention from genes relevant to therapeutic strategies towards the pathways in which mutated genes are involved. Data reported in in Figure 3 were collapsed into the 19 pathways previously described (Supplementary Method 3.1.2.7). Supplementary Figure 2 reports in a false color scale the prevalence of mutated pathways according to clusters I and II, III and IV previously obtained ( Figure 3A and Figure 3B, respectively). In both HGS and non HGS group, there are differences between tumor lesions growing in the ovary or in other anatomical sites. Synchronous lesions were mainly characterized by mutations in homologous recombination, mismatch repair and miRNA biogenesis pathways. Differently, tumor biopsies withdrawn in the ovary are characterized by defects in signal transduction, drug resistance and EMT-TF pathways. Student's t-test analysis confirmed that these pathways were significantly different between synchronous lesions and ovaries, in both HGS and non HGS group (FDR < 0.01). On the other hand, DNA damaging signaling, TGF pathway, ECM interaction and transcription regulation were found ubiquitously mutated.

Allelic fraction heatmap
The allelic fractions (AFs) of the 736 identified somatic variants are shown in a false color scale heatmap. For each patient, tumor samples taken from different anatomical sites at primary surgery are grouped together accordingly to their histological features (histotype and grade). CCNE1 locus gene amplification and methylation status of BRCA1 promoter have been reported in literature [1] as poor prognostic markers in HGS-EOC. In our cohort of patients, with the exception of patient 10152, regional heterogeneity does not affect the locus gene amplification for CCNE1 or the promoter methylation status of BRCA1 gene.

Mutational load
For each gene, the total number of somatic mutations identified in ovary and in their matched synchronous lesions were counted. Data reported in Supplementary  Figure 4 shows that synchronous lesions are characterized by increased number of somatic mutations compared to their matched primary tumor site in the ovary. The ATM, ATR and FN1 genes resulted more frequently mutated.

Phylogenetic trees
In order to decipher temporal and spatial heterogeneity, a phylogenetic tree was depicted for each patient based on the allelic fraction of identified somatic mutations, as described in Supplementary Methods 3.1.2.4.

Analysis of mutational signatures related to chemotherapy exposure
To asses whether at relapse somatic mutations were acquired de novo or outgrowth by sub-clonal selections, we focused our attention on the total number of C>T, or G>A transition counted at primary surgery or at relapse. It has been reported that Pt-treatment is able to induce single base substitutions [2]. In Supplementary Figure 6 we compared the number of base changes in biopsies taken at first surgery (both ovary and synchronous lesions) with those taken at second/third surgeries (metachronous diseases) after Pt-based therapies. Results show no differences between the two groups suggesting that the high level of private mutations counted in the synchronous    Table 4 Supplementary

Genomic DNA extraction and purification
Genomic DNA (gDNA) was extracted and automatically purified with an automatic extraction system (Maxwell ® Rapid Sample Concentrator, Promega, Milan Italy). The concentration and the quality of gDNA was evaluated just before library preparation using Quant-iT ™ PicoGreen ® fluorescence assay (Thermo Fisher Scientific, Waltham, Massachusetts, USA).

Target re-sequencing libraries
TrueSeq Custom amplicons panel included 65 genes, arbitrary selected from the medical literature, on the basis of the presence of described mutations with an established role in response to targeted therapies and/ or in current treatment paradigms. The complete list of genes with their chromosome coordinates, accordingly to USCS hg19 are as previously reported [4]. Design was performed by Design Studio algorithm following manufacturing instructions. In Supplementary Table 3, genes are organized into functional pathways based on Ingenuity Variant call Analysis software (Qiagen, Milan Italy). Color palette is used in downstream analysis to make visual analysis easier. Briefly, selected panel covers a total of 341748 bp, with a total of 2549 amplicons. 250 bp length amplicons were designed on the entire coding region, exon-intron boundaries (+/−10 bp); in addition, 2 Kb of the flanking sequence in the 5' UTR was added to the target region of selected genes [4]. Adaptor-ligated libraries were prepared from 250 ng of gDNA following protocol instructions (Illumina).

High level overview
Raw de-multiplexed reads from the MiSeq sequencer were aligned to the reference human genome (UCSC build hg19, the latest available when the targeted regions were designed) using the Burrows-Wheeler Aligner (BWA, version 0.7.12-r1039; [5]), running in paired-end mode. To ensure a good call quality and to reduce the number of false positives, samples underwent Base Quality Score Recalibration (BQSR), using the Genome Analysis Toolkit (GATK, version 3.3; [6][7][8]). After BQSR, sequences around regions with insertions and deletions (indels) were realigned locally using the Smith-Waterman algorithm. In this step, synchronous metastasis samples were recalibrated along with a blood counterpart acting as control,. Putative somatic variant calls were detected with two separate programs, MuTect (version 1.1.5; [9]) and VarScan 2 (version 2.3.6; [10]), pairing each sample with its matched blood.
Variant calls were annotated with biological information using the Variant Effect Predictor (VEP; version 81 with genome version hg19; [11]). Mutations were annotated with the 1000 Genomes project, dbSNP (version 138) and Catalogue of Somatic Mutations in Cancer (COSMIC), version 68. At the end of the analysis, variant calls from each sample pair were merged together using GATK, and subsequently filtered to exclude nonsomatic calls, with an allelic fraction less than 1%, or with a read depth less than 200 fold (200×). A complete data set including all primary and synchronous lesions was built by merging the processed data with our previously published somatic variant calls [4].
The mutation data set was annotated with dbNSFP (version 3.2; [12]) and loaded into a database using the GEMINI software (version 0.17.0, [13]).

Analysis configuration
Synchronous metastases were analyzed using blood as normal control. A detailed sample sheet with individual sample configuration is available on GitHub. The analysis used the same gene panel as previously described [4]. In this case, FASTQ files from the two different panels were concatenated and all the regions from the two panels were assembled into a single BED file. Afterward, all samples were analyzed using the merged regions.
Marking of PCR duplicates was disabled. As additional configuration, we enabled clinical reporting for VEP in bcbio-nextgen, in order to provide amino acid and nucleotide changes in a format more easily understandable by clinicians.
The creation of per-sample GEMINI databases was disabled, as a more complete database was created after the analysis.

Analysis set-up and execution
FASTQ files and sample sheets were loaded on the two clusters part of the Cloud4CaRE project, a project between the Mario Negri Institute, the ACTO foundation and the computing departments of two major Italian banks to provide computational resources for NGS analysis. Cloud4CaRE is made up of two distinct computing platforms, a 214-CPU cluster (8 CPUs per node, 48 Gb RAM per node) or a 792-CPU cluster (24 CPUs per node, 48 Gb RAM per node), both running the Simple Linux Utility for Resource Management (SLURM) software for task scheduling.
Pre-computed genome indexes for BWA, along with genome sequences and associated files, were retrieved from the CloudBioLinux repository (http:// cloudbiolinux.org) through bcbio-nextgen's upgrade command. The analysis pipeline was then run using bcbio-nextgen (version 0.9.2) on the two clusters (one for primary samples, and one for synchronous diseases).
Alignment coverage was measured with Chanjo (version 2.1.1) and BEDtools (version 2.24.0), wrapped in the calculate_coverages program.

Variant processing and annotation
VCF files produced by the variant callers were pre-filtered during the quality control steps, flagging non somatic variants with the REJECT flag. Per-sample allelic fraction measures (FA for MuTect and FREQ for VarScan 2) were converted into a standardized FREQ format field to increase compatibility with downstream tools.
After the completion of the pipeline, the final set of variant calls underwent further processing: calls from different samples were joined together using the GATK's CombineVariants walker, and afterwards all variants were joined together in a single file. Multiallelic loci were decomposed into single alleles with the vt program (http://genome.sph.umich.edu/wiki/ Vt) and subsequently annotated with the Variant Effect Predictor.
Lastly, variants from synchronous disease were combined with the data of matching, previously published primary tumors [4], using the CombineVariants walker from the GATK, giving locus priority, in case of identical loci, to the primary tumor data. Multi-allelic loci were then again decomposed with vt.
The final set of annotated variants was loaded into GEMINI for downstream analysis.

Phylogenetic tree reconstruction
Primary tumors and metachronous metastases from our previous work were re-analyzed with MuTect and VarScan as described in previous sections. Additionally, one sample from an unrelated ovarian tumor cell line (OVCAR-8) was analyzed using a pool of healthy controls as outgroup. Afterwards, fraction information was extracted from the VCF file and used to build a matrix where samples were represented in rows, and individual loci in columns. Missing data (NA) was given a fraction of 0.0. The data from the OVCAR-8 sample were also added to ensure a common point of origin for all samples. The matrix was generated with the PyVCF library (https:// pyvcf.readthedocs.org/) coupled with the pandas data analysis package (version 0.17.1; [14]). Example code is available in the GitHub repository.
This matrix was then used to build distance measurements with the dist.gene function of the ape R package [15]. Individual sample groups were joined with a neighbor-joining tree estimation algorithm [16] as supplied by the nj function in the ape package. Trees were drawn with the plot.phylo function ("fan" tree type).

Data availability
The aligned sequences are available at the EBI European Nucleotide Archive (http://www.ebi.ac.uk/ ena/data/view/; ID pending) and the configuration files, and all the associated programs (written in Python 2 and Python 3) are freely available on GitHub (https:// github.com/lbeltrame/mnegri-ov198) under the GNU GPL or the BSD license. The complete joined mutation list is available as a Variant Call Format (VCF) file at https://github.com/lbeltrame/mnegri-ov198/tree/ master/data.

Clustering
Hierarchical clustering was performed with the fastcluster package [16] coupled with the seaborn visualization library (https://stanford.edu/~mwaskom/ software/seaborn/) to represent data as a heatmap. The chosen cluster metric was the Pearson's correlation coefficient, using complete linkage.

Pathway-based allelic fraction and statistical tests
From information on allelic fraction for each gene per sample, we calculated a normalized allelic fraction value (AF norm ) for each pathway per sample Given N as number of genes in the pathway and n as the number of mutated (with at least one non-synonymous mutation) genes inthe pathway, and AF mean as the mean allelic fraction for all the genes in the pathway, AF norm is expressed as: These information were used to compare the main groups from the clustering analysis (see previous section), we evaluated AF norm for each sample separately in both cluster groups we found, and significance of the difference was assessed by a t-test. p-values from the test were then corrected for multiple testing using the False Discovery Rate (FDR) method (Benjamini and Hochberg, 1995). PCR conditions were according to the Qiagen specifications with an annealing temperature of 61°C (experimentally determinate by a gradient PCR). After PCR, 5 μl of product was run on a 1% agarose gel to ascertain success of the reaction and successful reactions were taken forward to pyrosequencing. Pyrosequencing of methylated sites was performed using the PyroMark Q24 (QIAGEN) according to the manufacturer's protocol. Pyrosequencing runs were subjected to quality control using the Epitect ® Plus DNA Bisulfite Kit QIAGEN to draw a standard curve calibration where interpolate data.