Data on somatic mutations obtained by whole exome sequencing of FFPE tissue samples from Russian patients with prostate cancer

Prostate cancer (PCa) is the most frequently diagnosed among men malignant disease that remains poorly characterized at the molecular level. Advanced PCa is not curable, and the current treatment methods can only increase the life expectancy by several months. Identification of the genetic aberrations in tumor cells provides clues to understanding the mechanisms of PCa pathogenesis and the basis for developing new therapeutic approaches. Here we present data on somatic mutations, namely single nucleotide variations (SNVs), small insertions and deletions, detected in prostate tumor tissue obtained from Russian patients with PCa. Moreover, we provide a raw dataset on the whole exome and targeted DNA sequencing of tumor and non-tumor prostate tissue obtained from Russian patients with PCa and benign prostatic hyperplasia (BPH). This data is available at NCBI Sequence Read Archive under Accession No. PRJNA506922.


a b s t r a c t
Prostate cancer (PCa) is the most frequently diagnosed among men malignant disease that remains poorly characterized at the molecular level. Advanced PCa is not curable, and the current treatment methods can only increase the life expectancy by several months. Identification of the genetic aberrations in tumor cells provides clues to understanding the mechanisms of PCa pathogenesis and the basis for developing new therapeutic approaches. Here we present data on somatic mutations, namely single nucleotide variations (SNVs), small insertions and deletions, detected in prostate tumor tissue obtained from Russian patients with PCa. Moreover, we provide a raw dataset on the whole exome and targeted DNA sequencing of tumor and non-tumor prostate tissue obtained from Russian patients with PCa and benign prostatic hyperplasia (BPH). This data is available at NCBI Sequence Read Archive under Accession No. PRJNA506922.

Data
Matched tumor and non-tumor FFPE prostate tissue samples were obtained from 26 patients with PCa and 8 patients with BPH via radical prostatectomy or TURP, respectively. DNA extracted from these samples was used to construct 61 whole exome and 25 targeted DNA libraries that were sequenced using Ion Proton platform. The corresponding raw sequencing data (reads in FASTQ format) was deposited at NCBI SRA database under project accession No. PRJNA506922.
The data on whole exome sequencing of samples from PCa patients (50 matched samples from 25 patients) was analyzed to detect somatic mutations in prostate cancer tissue. Reads were mapped to the GRCh37 assembly of the human genome. Paired variant calling performed for matched samples allowed to filter germline mutations and detect somatic variants in tumor tissue. The information on identified somatic alterations is presented in VCF format in Supplementary File 1. A total of 1696 somatic mutations in all 25 tumor samples were detected, including 1686 (99.4%) SNVs, 8 (0.472%) insertions and 3 (0.118%) deletions. The summary of detected somatic variants is shown in Table 1.
Moreover, variant annotation was performed using Variant Effect Predictor (VEP) which identifies genes and transcripts affected by genetic alterations and predicts their consequences on protein sequences ( Fig. 1). Value of the data Detection of somatic mutations by whole exome sequencing is the widely recognized method used to identify genetic abnormalities in tumors for various types of cancer [1e3]. The data on somatic mutations presented here can serve as the basis for studying the pathogenesis of the disease and the search for new therapeutic targets. The dataset on targeted DNA sequencing also presented here could be valuable for reliable validation of identified somatic mutations due to the much higher coverage compared to whole exome sequencing. Data on samples from patients with BPH may be used within a control group for validation of the detected genetic variants to identify mutations specific to malignant prostate tissue. The same tissue samples were previously subjected to transcriptome profiling by RNA sequencing [4]. Moreover, urine and plasma from these patients was also used for total RNA and targeted DNA sequencing [5]. Thus, this dataset can be valuable for an integrated analysis of DNA and RNA sequencing data obtained from PCa and BPH patients' multiple tissues. The dataset can be readily incorporated into the study involving other sample cohorts and implementing any computational algorithms of choice since the data is available in raw format and the metadata includes comprehensive clinical patient information (serum PSA level, Gleason grade, TNM clinical and pathological stage, extraprostatic extension, seminal vesicles and perineural invasion, surgical margins status).
In addition, SIFT and PolyPhen algorithms were implemented to predict the effect of amino acid substitution caused by a variant on the structure and function of a protein (Fig. 2). The VEP annotation of each variant is included in the Supplementary File 1. The data used to draw bar charts is presented in Supplementary Table 1.

Sample collection and DNA extraction
All patients had signed an informed consent form. Tissue samples were obtained from 26 patients with PCa and 8 patients with BPH from City Clinical Hospital No. 50 via radical prostatectomy or TURP, respectively. All patients had not received specific therapy prior to sample collection. Clinical patient data including serum PSA level, Gleason grade, TNM clinical and pathological stage, extraprostatic extension, seminal vesicles and perineural invasion, surgical margins status is provided in Supplementary Table 2. The postoperative material was fixed in formalin and embedded in paraffin, the corresponding thin sections of the FFPE tissue samples were examined by the pathologist determined areas of tumor and non-tumor adjacent tissue. DNA was extracted from these marked regions using AllPrep DNA/RNA FFPE and Gen-eRead DNA FFPE kits (Qiagen). Table 2 provides information about samples, DNA extraction kits used and corresponding libraries. For each patient maximum of two DNA samples were obtained: from tumor and non-tumor adjacent tissue. Either exome or targeted panel library or both were constructed from each DNA sample. Every library name corresponds to a single library and to a single FASTQ record in NCBI SRA database.

Whole exome library preparation
Amplification of exonic regions was performed using Ion AmpliSeq Exome RDY Kit (Thermo Fisher Scientific). Considering the quality of FFPE-derived DNA the number of cycles in this amplification step was raised to 13e15 instead of 10 recommended by the manufacturer. Further steps of library preparation were carried out in accordance with the manufacturer's instructions.

Targeted DNA library preparation
GeneRead DNAseq Targeted Human Prostate Cancer Panel (Qiagen) was used for targeted enrichment of the extracted DNA. This amplification procedure was also modified as for exome libraries to account for DNA quality extracted from FFPE tissue samples. Number of PCR cycles was raised to 20e22 instead of 18 recommended for standard DNA samples. Subsequent library construction was performed using GeneRead Library Prep workflow (Qiagen) following the manufacturer's recommendations.

High-throughput sequencing
Quality of the constructed libraries was assessed by 2100 Bioanalyzer (Agilent Genomics) using Agilent High Sensitivity DNA Kit (Agilent Genomics). High-throughput sequencing was performed on Ion Proton platform using ION PI HI-Q Sequencing 200 Kit and Ion PI Chip Kit v2 (Thermo Fisher Scientific) following the recommendations of the manufacturer. Base calling was performed by Torrent Suite 5.0, fastqCreator v3.4.56313.

Detection and annotation of somatic variants
Reads were mapped to the human genome (GRCh37 assembly) with bwa mem tool from BWA package with the following non-default parameters: -c 250 -M [6]. Paired somatic calling was performed using 4 variant callers: MuTect (v. 1 [9] and VarScan (v. 2.4.1) [10] which were run via bcbio-nextgen (v. 0.9.7) somatic variant calling pipeline [11] with minimal allele fraction equal to 0.1. The following additional filters were then applied to each caller call set: 1) DP > 10 2) QUAL >20 3) AF in normal sample <0.005 or AF in normal sample is at least three times less than AF in tumor sample.
At least two callers should have called a mutation as a somatic to include it into the final somatic call set. The resulting lists of somatic variants were filtered according to the target regions of AmpliSeq Exome Kit provided by the manufacturer and off-target variants were excluded. The final sets for each individual were combined into a single multi-sample VCF file (See Supplementary File 1). Variant annotation, including SIFT and PolyPhen functional effect predictions, was performed with VEP software [12] using data from ENSEMBL release 91.

Acknowledgments
This work was supported by the Ministry of Education and Science of Russian Federation (grant no. 14.607.21.0068, Sep 23, 2014, unique ID RFMEFI60714X0068) and by the Federal Medical Biological Agency, grant code "Panel".