Identification of hallmarks of lung adenocarcinoma prognosis using whole genome sequencing.

In conjunction with clinical characteristics, prognostic biomarkers are essential for choosing optimal therapies to lower the mortality of lung adenocarcinoma. Whole genome sequencing (WGS) of 7 cancerous-noncancerous tissue pairs was performed to explore the comparative copy number variations (CNVs) associated with lung adenocarcinoma. The frequencies of top ranked CNVs were verified in an independent set of 114 patients and then the roles of target CNVs in disease prognosis were assessed in 313 patients. The WGS yielded 2604 CNVs. After frequency validation and biological function screening of top 10 CNVs, 9 mutant driver genes from 7 CNVs were further analyzed for an association with survival. Compared with the PBXIP1 amplified copy number, unamplified carriers had a 0.62-fold (95%CI = 0.43-0.91) decreased risk of death. Compared with an amplified TERT, those with an unamplified TERT had a 35% reduction (95% CI = 3%-56%) in risk of lung adenocarcinoma progression. Cases with both unamplified PBXIP1 and TERT had a median 34.32-month extension of overall survival and 34.55-month delay in disease progression when compared with both amplified CNVs. This study demonstrates that CNVs of TERT and PBXIP1 have the potential to translate into the clinic and be used to improve outcomes for patients with this fatal disease.


INTRODUCTION
Lung adenocarcinoma contributes to over 500,000 deaths annually and an average 5-year survival rate of less than 15% despite great advances in cancer therapy [1]. Identifying and characterizing prognostic determinants are essential for aiding in developing better therapeutic strategies to lower its mortality [2]. While clinical and pathological characteristics are considered major determinants in the variability in outcomes of this disease, genetic factors may also contribute [3].
In past decades, molecular biomarkers, especially single nucleotide polymorphisms (SNPs), assessing an individual's genetic predisposition for diseases have shown potential for guiding clinical treatment of lung cancer [4][5][6]. Recently, increasing evidence suggests that copy number variations (CNVs) may also account for a large proportion of the heritability of lung cancer [7][8][9].
A CNV is a large structural genetic aberration that consists of duplications or deletions covering more than 1 kb and may result in phenotypic variation through the alteration of biological function or gene expression [10,11]. Various methods have been developed to detect CNVs at genome-wide and locus-specific levels, including hybridization, polymerase chain reaction (PCR) amplification and fluorescence resonance energy transfer technology (FRET) [12,13]. Recently, sequencing, especially next generation sequencing technology, has provided us a better tool with which to completely characterize genomic CNVs. Besides, the technology assists in overcoming the hurdle of the unfixed design and imprecise boundaries of these CNVs [14]. Sequencing has presented advance in detection of biomarkers to progression of various cancers. By whole genome sequencing (WGS) or target sequencing, some studies have demonstrated the prognostic prediction role www.impactjournals.com/oncotarget of high copy number alteration burden in the prostate cancer relapse [15], FGFR1 and PIK3CA amplifications in oral cavity squamous cell carcinoma [16], and MYC amplification in pancreatic adenosquamous carcinoma and lung adenocarcinomas [17,18].
Here, we performed comprehensive WGS to detect CNVs associated with carcinogenesis of lung adenocarcinoma. After the assessment of CNV frequency and screening for potential biological function, the most notable CNVs were detected to find the hallmarks of lung adenocarcinoma prognosis in Chinese patients through survival analysis.

Participant characteristics
Seven male lung adenocarcinoma patients who were smokers and had no family history of cancer were recruited to participate in WGS. At the time of diagnosis, their age ranged from 47 to 64 years with a mean age of 56 years. Four of the 7 had stage II lung adenocarcinoma, while the rest had stage III. Samples from 114 patients with lung adenocarcinoma that had been diagnosed by histology were used to validate the frequencies of top 10 CNVs by quantitative polymerase chain reaction (qPCR). Among these patients, 69 (60.52%) were male, 59 (51.75%) were smokers, 35 (30.70%) were drinkers and 17 (14.91%) had a family history of cancer. Three hundred thirteen patients with lung adenocarcinoma were recruited to determine if there was an association between target CNVs previously identified and disease prognosis. Among the 313 patients, 180 (57.51%) were male, 158 (50.48%) were smokers, 93 (29.71%) were drinkers, 39 (12.46%) had a family history of cancer, 113 (36.10%) died and 153 (48.88%) presented disease progression during the follow-up period ( Table 1). 260 of 313 lung adenocarcinoma patients (83.07%) completed the last follow-up assessment or died during the follow up, with a median survival time of 46.36 months. There were no difference of characteristics between patients lost of follow-up and patients with completed follow-up.

Overview of the somatic CNV landscape
WGS was performed on 7 cancerous and noncancerous paired tissues. The number of reads for a genomic region in the tumor compared to adjacent normal tissue was calculated in order to identify CNVs. The mean sequencing coverage was 2.5× with a range of 2.0× to 5.1 × . 1272, 824 and 2756 CNVs were detected by the CNVseq, BICseq and CNVer algorithm, respectively. After matching CNV areas detected by the three algorithms, 2604 somatic CNVs were identified differently expressed between cancerous and noncancerous tissues, among which, 2488 were amplifications and 116 were deletions. There were 4,6,25,65,132,310,650,1281 and 131 CNVs were detected with total frequency of 11, 10, 9, 8, 7, 6, 5, 4 and 3, respectively. The distribution of CNVs is presented as Circos plots in Figure 1. The locations of these CNVs on chromosomes are listed in Supplementary Table 1. Chr5_262301_297746,  chr5_565873_681306,  chr5_1607662_1663720,  chr19_37756707_37760335,  chr5_209198_257662,  chr5_482152_508485, chr5_843417_1023251, chr5_1157300_1368500, chr1_154919397_154921901 and chr3_129101148_129103476 were the top 10 most frequently mutant CNVs detected by the three algorithms ( Table 2). The top 100 CNVs and mutant driver genes located in these CNVs are listed in Supplementary Table 2.

Target genotyping of CNVs most frequently detected by qPCR
The biological functions of the single genes located in chr3_129101148_129103476, chr5_1607662_1663720 and chr19_37756707_37760335 are undefined, therefore, only the remaining 7 CNVs of interest were followedup on. Totally, there were 17 pairs of primers were successfully designed to amplify the 7 CNVs. However, only 12 primers targeting 12 genes presented optimal amplification, which included PBXIP1, SDHA, PDCD6, SLC9A3, CEP72, TPPP, BRD9, TRIP13, LOC100506688, SLC6A19, SLC6A18 and TERT. Detailed information on the primers used for the 12 genes is presented in Supplementary

CNVs associated with lung adenocarcinoma patient prognosis
CNVs with low amplification (TPPP and SLC6A18) and undefined biological function (LOC100506688) were excluded from further studies on the associations between target CNVs and lung adenocarcinoma prognosis.
From the Cox proportional hazards analysis, significant associations between clinical stage and overall survival and progression free survival were found (Supplementary Table 4 (Table 4 and Figure 2C). Further analysis found that CNV of TERT was correlated with progression free survival of lung adenocarcinoma. Compared with an amplified TERT, those with an unamplified TERT had a 35% reduction (95% CI = 3%-56%) in risk of disease progression. A similar effect was seen after adjusting for age, sex, cigarette use, and clinical staging with an HR of 0.67 (95% CI = 0.45-0.99). In comparison to the amplified variant, the unamplified TERT was associated with a median delay in disease progression of 23.77 months. The 3-year progression free survival rate of patients with an unamplified TERT was 59.0% (95% CI: 45.0%-70.7%) compared to 38.0% (95% CI: 30.0%-45.9%) of those with an amplified TERT (Table 4 and Figure 2D). In addition, the combined effect of the CNVs of PBXIP1 and TERT on lung adenocarcinoma prognosis was assessed. Patients with no amplifications in either gene had a 34.32-month longer median survival time with an HR of 0.51(95% CI = 0.28-0.93) compared with those with both genes amplified. Similarly, cases with both unamplified genes had a 34.55-month delay in disease progression compared with patients with amplifications in both genes (HR = 0.55, 95% CI = 0.32-0.94). Even patients with only one nonamplification had a 11.63-month longer overall survival compared with those with both amplifications (HR = 0.54, 95% CI = 0.35-0.84) ( Table 5 and Figure 2E-2F).
We also explored the effect of the two genes in CNVs and clinical stage on lung adenocarcinoma prognosis. Compared with stage III-IV patients with an amplified PBXIP1, stage I-II patients with an unamplified PXBIP1 had an 85% lower risk of death (95% CI = 72%-92%) and a 73% lower risk of progression (95% CI = 56%-83%). Similarly, stage I-II patients with an unamplified TERT had a 65% lower risk (95% CI = 38%-81%) of death than stage III-IV patients with an amplified TERT and a 70% lower risk (95% CI = 48%-82%) of disease progression (Table 5 and Figure 2G-2J).
The genes CEP72, BRD9, TRIP13, SLC9A3, SDHA, SLC6A19 and PDCD6 did not have any predictive role in lung adenocarcinoma prognosis (Table 4). There was no association between the 9 mutant driver genes and the prognosis of patients with lung squamous cell carcinoma (Supplementary Table 5), which supports the concept that lung adenocarcinoma and lung squamous cell carcinoma are genetically heterogeneous cancer types.

CNVs and expression of PBXIP1 and TERT based on TCGA
From The Cancer Genome Atlas (TCGA) database, whole genome copy number variations of 511 lung adenocarcinoma and whole gene expressions of 512 lung adenocarcinoma were downloaded. Based on the data from TCGA, the amplification rates of PBXIP1 and TERT were 58.8% and 64.1%. Along with an increased copy number, expression of PBXIP1 and TERT mRNA increased significantly. The median gene expressions for unamplified and amplified PBXIP1 were 3448.71 and 4764.52, respectively (P < 0.001). Similarly, the median

DISCUSSION
To our knowledge, this is the first study on the association between CNVs and the prognosis of lung adenocarcinoma in Chinese patients using WGS. Overall, we found a gain in the short arm of chromosome 5 (5p) correlated with lung adenocarcinoma carcinogenesis. Also, the CNVs of PBXIP1 and TERT presented significant upregulation of corresponding gene expressions and were found to be independently predictive of lung adenocarcinoma patient survival. Furthermore, these two structural mutations strengthen the clinical role of stage in disease prognosis.
Deep sequencing is a well-known technology that has resulted in the most comprehensive collection of biomarkers for various diseases, including cancers [19]. The development of sequencing technologies has opened the door to novel methods for detecting genetic mutations using low-coverage sequencing. Recent published studies have demonstrated the comparability of low-coverage sequencing to deep sequencing for detecting large structural mutations, especially for mutations larger than 20 kb [20]. Additionally, three CNV calling algorithms, CNVseq, CNVer and BICseq, were incorporated into the analysis to improve upon statistical power. CNVseq and BICseq captured many more CNVs than CNVer, and BICseq had a higher detection rate than CNVseq for the same CNVs. Assessment of CNVs using three algorithms may reduce the false positives and negatives normally obtained when using a single algorithm.
In this study, a significant role for the short arm of chromosome 5 (5p) was found in carcinogenesis. Previous molecular cytogenetic studies have shown that chromosomal aberrations occur on 5p in all major lung tumor types [21][22][23][24][25]. Besides a role in carcinogenesis, aberrations in 5p are also biomarkers for cancer prognosis. A genome-wide analysis revealed that copy number gains in chromosome 5p were correlated with to the survival of early-stage non-small cell lung cancer [26]. Sarit et al. even presented direct evidence for this with the finding that amplification of genes in chromosome 5p may be responsible for the malignant progression of bronchioloalveolar carcinoma, which is a subtype of lung adenocarcinoma [27]. Telomerase is an enzyme consisting of a reverse transcriptase called telomerase reverse transcriptase (TERT) and an RNA component that adds repeats of a DNA sequence (TTAGGG) to the ends of chromosomes in order to prevent shortening. Telomerase activity is high in embryonic and stem cells but nearly undetectable in most somatic cells, due primarily to transcriptional downregulation of TERT. However, recent identification of highly recurrent point mutations in the TERT promoter in multiple cancer types suggests one potential mechanism for up-regulation of telomerase via reactivation of TERT [21,28]. The importance of TERT reactivation in cellular immortalization and carcinogenesis is supported by its expression in more than 90% of immortal cell lines and tumors. The gain of TERT is the most frequent amplification event occurring in early stage cancers [29]. A recently published whole genome study directly supports our finding by demonstrating amplification of TERT in lung adenocarcinoma [27]. Additionally, overexpression of TERT is a biomarker for the progression of and poor outcomes from lung cancer [30][31][32][33].
Pre-B-cell leukemia homeobox (PBX) interacting protein 1 (PBXIP1) is a scaffolding protein of the PBX-family interacting microtubule-binding protein. It promotes cell migration which is necessary for cancer cell proliferation, migration and invasion through activation of the PI3K/ AKT/mTOR and Raf/MEK/ERK pathways [34]. Little direct evidence has been published for the carcinogenic role of PBXIP1 in lung adenocarcinoma. However, a recent study indicated that the gene was overexpressed in breast infiltrative ductal carcinoma, as well as promoted cell adhesion and migration through modulation of focal adhesion dynamics. Similar overexpression of PBXIP1 was also found in high-grade glioma and ependymoma [35], oral squamous cell carcinoma [36] and liver cancer [37]. Moreover, an amplified copy number of PBXIP1 was found to be predictive of poor outcomes in undifferentiated pleomorphic sarcomas and leiomyosarcomas [38]. Overall, the above evidence supports plausibly role for PBXIP1 in promoting lung adenocarcinoma.
We acknowledge several limitations to our study. First, the modest sample size of the WGS may not have had optimal statistical power to identify and validate some well-known lung cancer-related genes, such as TP53, EGRF, KRAS and BRAF. In our study, these genes were captured by the low-coverage sequencing with relative low frequencies of 19.0%, 28.6%, 28.6% and 9.5%, which was not significant enough to followup on in further association studies. Second, given the small sample size in WGS, we picked up 7 typical lung adenocarcinoma patients with similar histology to scan potential somatic copy number variations associated with the disease. Further validation of the positive findings was conducted in general lung adenocarcinoma patients to ensure the good extrapolation of final results. Since the characteristics of the discovery set and validation Patients with different copy numbers were divided into the two groups of amplification and nonamplification, which were distinguished by a cut-off point of 2 −ΔΔCt as 1.3.
sets were not consistent especially in gender and smoking status, CNVs contributing to female or nonsmoking lung adenocarcinoma may be underestimated. Third, although three copy number calling algorithms were used during analysis, low-coverage sequencing is not as sensitive and specific as deep-sequencing at detecting small structural mutations. This may explain why most of the target CNVs detected were larger than 20 kb. Fourth, we only selected top 7 frequently detected CNVs in the discovery set for further validation. The selection may omit some important CNVs with lower mutation frequency.
In conclusion, this study advances the complete characterization of the genomic CNVs in lung adenocarcinoma in Chinese patients and expands our understanding of tumor biology. Furthermore, a prognostic significance for the CNVs of TERT and PBXIP1 in lung adenocarcinoma was found, which may lead to translation into the clinic and improve outcomes for patients with this fatal disease.

Ethics statement
This study protocol was reviewed and approved by the Institutional Review Board of Huazhong University of Science and Technology. All patients in this study gave written informed consent. This study was carried out in accordance with the recommendations of the Declaration of Helsinki for biomedical research involving human subjects.

Study population
This study included three populations (discovery set, validation set I and validation set II). Discovery set   Progression free survival was calculated by subtracting the date of first treatment from the date of recurrence of, metastasis of or death from lung adenocarcinoma.  was used to scan the differently expressed copy number variations between cancerous and paired noncancerous tissues by WGS. Validation set I was used to verify the frequency of top CNVs found in the discovery set. Validation set II was applied to detect the correlation between mutant driver genes from target CNVs with prognosis of lung adenocarcinoma. The population of discovery set and validation sets were from the same library of lung adenocarcinoma patients, but recruited in different stages. . Participants who had smoked ≥100 cigarettes in their lifetime were defined as "ever smokers", while those who had smoked fewer were classified as "never smokers". Similarly, participants who consumed alcoholic beverages at least once a week for ≥1 year were defined as "ever drinkers", while the remaining cases were "never drinkers". Patients with any first and/or second-degree relative(s) with a history of cancer were defined as ''with a family history of cancer", while the remaining subjects were ''without a family history of cancer''. Patient's clinical data were obtained from medical records. Tumors were staged according to the Union for International Cancer Control (UICC) tumor-node-metastasis (TNM) staging system. The primary endpoint of follow-up was overall survival and the secondary outcome was progression free survival. Overall survival was calculated by subtracting the date when the patient was first treated from the date of death, and patients were censored when lost of follow-up. Progression free survival was calculated by subtracting the date of first treatment from the date of recurrence of, metastasis of or death from lung adenocarcinoma. Patients were censored if death was due to other causes or the annual follow-up was unsuccessful.

Detection of CNVs by WGS
Paired cancerous and noncancerous tissues from 7 typical lung adenocarcinoma patients with similar histology types recruited in 2007 were enrolled into WGS. Using haematoxylin and eosin (H&E) staining, cancerous tissues were identified as areas made up of more than 80% tumor cells, while noncancerous tissues were defined as areas lacking tumor cells. DNA extraction was then performed using the TIANGEN DNA kit (TIANGEN BIOTECH, Beijing, China) according to the manufacturer's instructions. Sequence capture, enrichment and elution from 14 genomic DNA (gDNA) samples were performed by IntegraGen using Agilent in-solution enrichment (SureSelect Human All Exon Kit v2) with the provided biotinylated oligonucleotide probe library (Human All Exon v2-46 Mb). Briefly, 3 μg of each gDNA sample were sonicated and purified to yield fragments of 150-200 bp. Adaptor oligonucleotides were ligated onto A-tailed fragments and enriched for using 4-6 PCR cycles. The purified libraries, 500ng/library, were hybridized to the SureSelect library for 24 h. Then the eluted fraction was PCR amplified for 10-12 cycles and sequenced on an Illumina HiSeq2000 sequencer as pairedend 75-bp reads [39]. Image analysis and base calling were performed using the Illumina Real Time Analysis (RTA) Pipeline version 1.9 with default parameters. Initial analysis of WGS was based on the Illumina pipeline (CASAVA1.7) against the reference genome of hg19. Because none of the algorithms were optimal for the detection of CNV and to improve the power and compensate for the disadvantage of using a single algorithm, three algorithms, CNVseq [40], CNVer [41] and BICseq [42], were used to identify CNVs in each tumor against the matched noncancerous tissues. The packages used for CNVseq, CNVer and BICseq were CNV-seq, BIC-seq2.1.1 and CNVer-0.81, respectively. CNVs called by each algorithm were produced with the corresponding frequency among 7 patients and then matched with each other to find the common parts. The total frequency of each common CNV was summed from the frequencies called by three algorithms. Then CNVs were ranked according to their total frequency. Circos plots were generated for each patient to summarize the results from the CNV analyses [43].

Detection of CNVs by qPCR
To verify the findings from WGS, the frequency of the top CNVs in cancerous tissues from 114 lung adenocarcinoma patients was measured by qPCR. After excluding target genes that had only low levels of amplification or an undefined biological function, verified CNVs were further analyzed for an association with survival in 313 lung adenocarcinomas. To explore whether predictive biomarkers for survival of lung adenocarcinoma may also be applicable to lung squamous cell carcinoma, target CNVs were also measured in 303 lung squamous cell carcinomas by qPCR. Primer Premier 5.0 was used to design primers for each CNV. At least one optimal primer was picked for each target CNV. The primer performance was confirmed to have an r 2 > 0.99 and an amplification efficiency of 90%-110%. The qPCR reaction was performed in a total of 20 ul containing 10 μl SYBR Green I Master mix (Toyobo, China), 0.8 mM primers and approximately 50 ng of template DNA. The housekeeping gene β-glubin was used as an internal control for normalization. The pooled DNA from peripheral blood lymphocytes from 100 healthy subjects was used as the standard. PCR reactions for each sample were performed in triplicate using a StepOnePlus Real-time PCR System (Applied Biosystem). The raw data were analyzed using StepOne™ Software v2.1. Amplification levels were calculated using the 2 −ΔΔCt method, where ΔΔ Ct for a target gene was defined as ( Δ Ct of lung cancer sample −Δ Ct of standard) and Δ Ct was the difference in threshold cycles for the sample in question normalized against the reference gene of β-glubin. Patients with different copy numbers were divided into the two groups of nonamplification and amplification, which were distinguished by a cut-off point of 1.3.

CNV and gene expression data from TCGA
To verify the role of CNVs in gene expressions, whole genome copy number alterations (Affymetrix SNP 6.0 SNP array) and mRNA expressions (RNA seq V2 RSEM) of lung adenocarcinoma were downloaded from The Cancer Genome Atlas Project (TCGA) (https://tcga-data.nci.nih. gov/tcga/tcgaHome2.jsp). According to the recommended cut off point, we divided the segment ratio into CNV as follow rule: unamplification was called if the probe log-ratio ≤ 0.18, otherwise, amplification was called as usual recommended.

Statistical analysis
The Kaplan-Meier curve and log-rank test were used to estimate the differences in overall survival and progression free survival based on individual CNV. The single effect of CNV and combination effects of CNVs and clinical stage on lung adenocarcinoma prognosis were evaluated by Cox proportional hazards model. Furthermore, the Wilcoxon signed-ranks test was used to analyze the association between target CNVs and their gene expression based on data from TCGA. All tests were two sided and with a P < 0.05 was considered significant. All statistical analyses were performed using SAS (version 9.4; SAS Institute, Cary, NC).