Integrated molecular portrait of non-small cell lung cancers

Background Non-small cell lung cancer (NSCLC), a leading cause of cancer deaths, represents a heterogeneous group of neoplasms, mostly comprising squamous cell carcinoma (SCC), adenocarcinoma (AC) and large-cell carcinoma (LCC). The objectives of this study were to utilize integrated genomic data including copy-number alteration, mRNA, microRNA expression and candidate-gene full sequencing data to characterize the molecular distinctions between AC and SCC. Methods Comparative genomic hybridization followed by mutational analysis, gene expression and miRNA microarray profiling were performed on 123 paired tumor and non-tumor tissue samples from patients with NSCLC. Results At DNA, mRNA and miRNA levels we could identify molecular markers that discriminated significantly between the various histopathological entities of NSCLC. We identified 34 genomic clusters using aCGH data; several genes exhibited a different profile of aberrations between AC and SCC, including PIK3CA, SOX2, THPO, TP63, PDGFB genes. Gene expression profiling analysis identified SPP1, CTHRC1and GREM1 as potential biomarkers for early diagnosis of the cancer, and SPINK1 and BMP7 to distinguish between AC and SCC in small biopsies or in blood samples. Using integrated genomics approach we found in recurrently altered regions a list of three potential driver genes, MRPS22, NDRG1 and RNF7, which were consistently over-expressed in amplified regions, had wide-spread correlation with an average of ~800 genes throughout the genome and highly associated with histological types. Using a network enrichment analysis, the targets of these potential drivers were seen to be involved in DNA replication, cell cycle, mismatch repair, p53 signalling pathway and other lung cancer related signalling pathways, and many immunological pathways. Furthermore, we also identified one potential driver miRNA hsa-miR-944. Conclusions Integrated molecular characterization of AC and SCC helped identify clinically relevant markers and potential drivers, which are recurrent and stable changes at DNA level that have functional implications at RNA level and have strong association with histological subtypes.


Background
Lung cancer is one of the most prevalent and deadliest cancers; it accounts for 13% of all cancer cases and 18% of the deaths in 2008 [1]. Among European men, in 2011 lung cancer deaths are predicted to reach~182,000, with a standardized rate of 37.6/100,000 person-years [2]. The rate among European women is still expanding and may reach 14 to 15 per 100,000 person-years in 2015 [3]. Despite recent advances in surgical and chemo/radiation therapies, the prognosis is very poor, with a 5-year overall survival rate of only~15%, which has not improved over several decades.
The inability to cure lung cancer is related to the advanced stage of the disease at the time of diagnosis and the resistance of the disease to currently available anti-cancer drugs [4]. In its early stages lung cancer tends to be asymptomatic, so at the time of diagnosis most tumors are locally advanced (stage IIIB) or metastatic (IV). Furthermore, even among early-stage patients who are treated primarily by surgery with curative intent, 30-55% will develop and die of metastatic recurrence.
Human lung cancers are classified into two major subtypes, small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC), with the latter accounting for~80% of all primary lung cancers. NSCLC represents a heterogeneous group of cancers, consisting mainly of squamous cell carcinoma (SCC) and adenocarcinoma (AC), and a much smaller fraction of large-cell carcinoma (LCC). Current clinical management and therapeutics depend on histopathological classification, which is reliable for surgical specimens, but often difficult to assess for small biopsies. This is a serious issue, because in~70% of cases only small biopsies or cytology specimens are available [5].
The objectives of this study were to utilize integrated genomic data including copy-number alteration, mRNA, microRNA expression and candidate-gene full sequencing data to characterize the molecular basis of the distinctions between AC and SCC. Cancer genomes are often unstable and accumulate a large number of mutations and structural or copy-number changes. Thus we cannot tell which genomic changes are 'drivers' and which are 'passengers.' Availability of data from multiple sources has the potential to solve these problems, where, for example functional or physiological impacts of the DNA-level genomic changes can immediately be verified at RNA level. Integrated molecular characterization of AC and SCC may help identify clinically relevant potential drivers, which are recurrent and stable changes at DNA level that have functional implications at RNA level and have strong association with histological subtypes.

Patients and tissue samples
The present study was organized by the CHEMORES initiative (Chemotherapy resistance consortium), which is an EU funded (FP6) Integrated Project involving 19 academic centres, organizations for cancer research, and research-oriented biotechnology companies in 8 European countries.
Tissue samples from a cohort of 123 patients who underwent complete surgical resection at the Institut Mutualiste Montsouris (Paris, France) between 30 January 2002 and 26 June 2006 were analysed. Clinical characteristics are given in Table 1. The median age of patients was 63 years (range 41-85), 34 (28%) were female and 89 (72%) were male. The histopathology of all tumors was reviewed by the same pathologist (JvdO): 50 patients had SCC, 57 AC, 13 LCC and 3 unclassified. Using the new 7th edition TNM staging 56 were stage I, 25 stage II, 28 stage III and 4 stage IV. Adjuvant platinum based chemotherapy was administered to 61 patients. Fifty-nine patients experienced a relapse. Two-year relapse-free survival was 64%, and the median time to recurrence for the cohort was 5.2 years. After a median follow up of 40 months (range 0-92) 36 patients had died and 23 patients were alive with recurrence.
This study was performed using snap-frozen tumor and adjacent normal lung tissue. Samples were handled according to the Tumor Analysis Best Practices Working Group [6]. Haematoxylin and eosin stained frozen sections, taken before and after the cutting of slides for analysis, revealed a median cell content of 85% (an inter-quartile range of 65% to 95%). All tissues were banked after written informed patient consent, and the study was approved by the Ethics Committee of Institut Gustave Roussy (IGR). Genomic investigations were performed at IGR, leader of the Genomic work-package of Chemores consortium, in the genomic center core facility certified ISO9001, labelled European reference and training center for Agilent aCGH data processing and analysis Resulting raw signals and log2 (ratio) profiles were normalized and centered according to their dye composition (Cy5/Cy3) and local GC content. These profiles were segmented with the Circular Binary Segmentation algorithm [7] through its implementation in the DNAcopy package for R v2.8.1 using default parameters. DNA copy number imbalances were detected considering a minimum of 3 consecutive probes and a minimal absolute amplitude threshold that was specific for each profile, accordingly with its internal noise. This specific internal noise was computed as one-fourth of the median of the absolute log2 (ratio) distances across consecutive probes on the genome. Of the 128 aCGH hybridizations performed, 17 were discarded: 7 due to their clinical annotations, 2 due to anomalies in their normal reference, and 8 due to the bad quality of their profile, resulting in 111 usable profiles. All aCGH coordinates in this study are mapped against the human genome as defined by the UCSC build hg18.
To assess the discovery of the genomic regions with differential anomalies between the AC, LCC and SCC populations, ANOVA tests were performed on the segmented aCGH dataset. To account for multiple testing, p-values were transformed to false discovery rate (FDR) [8].

Gene expression and microRNA microarray assay
The lysis of 40 to 50 frozen sections of 10 micron-thickness, cut from each NSCLC tissue sample was done using a Polytron homogenizer (Ultraturrax, IMLAB, Lille, France).
For dual color Cy3 (normal samples) and Cy5 (tumor samples) labelling, Agilent Fluorescent Low Input Linear Amplification kit adapted for small amounts of total RNA (500 ng total RNA per reaction) was used, followed by purification of labelled probes by Qiagen RNeasy Mini kit and by a protocol provided by Agilent. Gene expression profiling was performed with dye-swap, using dual-color 244K Human exon array from Agilent (custom design with the content of the 44K Human genome plus 195,000 probes, one for each exon as defined in refGene list of UCSC build hg18 (http://genome.ucsc.edu/)). Hybridization was carried out for 17 hours at 65°C at 10 rpm, followed by washing steps. Scanned microarray images were analyzed by using Feature Extraction software version 10.5.1.1 (Agilent).
For the microRNA analysis, normal and tumor samples were hybridized on separate arrays. Agilent miRNA Microarray System with miRNA complete labelling and hybridization kit was used for Cy3 labelling. Briefly, isolated total RNAs were dephosphorylated, labelled with pCp-Cy3 and hybridized to Agilent 8x15K arrays for 20 hours at 55°C in a rotating oven (Robbins Scientific) at 20 rpm. Slides were washed and scanned for gene expression using an Agilent G2565C DNA microarray scanner using defaults parameters.

Gene mutations analysis
Sequencing was performed at IGR and at the Royal Institute of Technology (Stockholm, Sweden). DNA was extracted with QIAamp DNA Mini Kit (Qiagen, Hilden, Germany). After PCR amplification of target exons, sequencing reactions were carried out using the BigDye® Terminator Cycle Sequencing Kit (Applied Biosystems, Forster City, CA). The primer sequences are available on request. Sequencing reactions were run on a 48-capillary 3730 DNA Analyzer®. Sequence analysis and alignment was performed with SeqScape® software (Applied Biosystems). All detected mutations were confirmed in at least one independent PCR reaction. In all 123 samples, full coding sequences of exons including oncogenic mutational hotspots were analyzed corresponding to: TP53 (NM_000546.

Gene-expression data processing and normalisation
All processing methods used for gene expression analysis were performed on the median signal from Agilent Feature Extraction raw data files using functions and packages collected in the R Bioconductor project [9] as well as custom written routines.
For gene expression data, dye-swap arrays were first combined (by taking the average of intensities) to obtained only one array per condition. This combination has the result of centering the M values (log2ratios) on zero. Then, flagged spots as well as control spots were removed. Normalization was then performed using the normalize-WithinArrays function from R package LIMMA [10].
For miRNA data, control spots were systematically removed, and flagged spots (gIsFeatNonUnifOL and gIsSaturated columns from raw files) were considered as missing values ("NA"). Array normalization was performed using the least-variant-set method [11].

Differential expression analyses of miRNA expression
To assess differentially-expressed miRNA, we first estimated the fold changes and standard errors between two groups of samples by fitting a linear model for each probe with the lmFit function of LIMMA package in R. Then we applied an empirical Bayes smoothing to the standard errors from the linear model previously computed with eBayes function.

Integrated genomics using Driver-Gene Search algorithm
Motivated by Akavia et al. [12], we considered each gene in an altered region a potential driver; its functional impacts on other genes can then be due to its direct copy number alterations or indirectly via the corresponding expression of the gene. It was then assumed that the functional impact of the driver is mediated through a biological pathway, such as proliferation or invasion, which would contain genes that are correlated to the driver. We identified such pathways by clustering genes into the candidate drivers. Thus the output of this clustering process is a collection of potential drivers together with a set of target genes associated with each driver. We developed the Driver-Gene Search (DGS) algorithm for this purpose, employing transparent statistical analyses to identify the so-called candidate target and driver genes. A complete workflow is shown in Figure 1.
The DGS algorithm consists of 3 broad steps as follows.
Step 1: Identify recurrent copy-number alterations (CNAs) across individuals. This is achieved using the combination of two computational algorithms implemented in MPSS and cnvpack, available from http://www.meb.ki.se/~yudpaw/. Based on a correlated random-effect model for the unobserved patterns, MPSS takes a robust smooth segmentation approach to identify whether a segment is a true CNA [13]. The segmentation threshold is fixed at −0.15 and 0.15 of the intensities for deletions and duplications, respectively. Segments with FDR less than 1e-05, number of probes less than 10, length of segments less than 1 kb are filtered out. As shown in Figure 1, these abnormal regions for individuals become input of step 1 in the workflow. We then apply the cnvpack to identify recurrent CNA regions [14], defined as alterations found in at least 10% of the subjects in whom at least one abnormal region is found by MPSS. Genes / miRNAs in recurrent CNAs are chosen as candidate drivers.
Step 2: Set up initial modules. First identify candidate drivers from CNA regions from Step 1, and compute pair-wise correlation between candidate driver and target genes. Each target gene is grouped with the highest significantly correlated candidate driver gene, using P < 0.001. From this initial pairing, we know that genes within each module are all highly correlated with at least one candidate driver.
Step 3: Find final modules. It is reasonable to assume that these target genes are also possibly correlated with other candidate drivers. In this step, we identify the correlation between several candidate drivers and the target genes in every module using the canonical correlation analysis (CCA). We used an R package SCCA to perform sparse canonical correlation analysis [15].

Network enrichment analysis
Next we functionally characterize the potential driver genes using the network enrichment analyses (NEA) [16]. Target genes are likely enriched with diverse biological processes and pathways that reveal the physiological roles of the driver and the associated genes. We use neaGUI to test whether there is an over-representation of network links between genes associated with their driver and genes in KEGG pathways. The tool is available for download from https://r-forge.r-project.org/projects/neagui/. To maintain an adequate number of genes for testing and a moderate correlation with the driver, we select a threshold of correlation coefficient based on Additional file 1: Figure S4 such that the number of target genes is around 100. For driver miRNA, the tested gene set contains biologically regulated genes obtained from miRBase [17]. The NEA method uses topological information of the gene interaction network, so target miRNAs identified by the DGS algorithm are not included in the NEA for testing.

Validation of expression data analyses
For mRNA expression we used previously published NSCLC expression dataset GSE3141 [18] from the Gene Expression Omnibus website in order to validate some of the results based on mRNA expression. The dataset comprised of 58 AC and 53 SCC samples, and mRNA was extracted from frozen tissue of primary lung tumors. The gene expression was measured on Affymetrix Human Genome U133 Plus 2.0 Array, which we subsequently normalized using global normalization at the log-expression level. If a gene is represented by several Affymetrix probes, we take the average of the log-expression value to represent the gene. For miRNA expression we validated our results using Ming You et al's data GSE29135 [19].

Validation of candidate driver genes
For the candidate driver genes we also attempted to verify the CNA status and the correlation of CNA and expression in an independent dataset previously published by Chitale et al. [20]. Frequency of CNAs was validated in 199 AC cases hybridized to Agilent 44K CGH arrays, where probes were less dense than the 244K arrays used in our sudy. Of the 199 samples, 102 were hybridized to the HG-U133A 2.0 Affymetrix oligonucleotide arrays. The robust multichip average method, which outputs log2-transformed expression values, was used to summarize probe measurements into one measurement for each probe. We assessed copy number status of the 500 kb region surrounding each driver gene. On average, 20 probes were targeted at the driver gene region. A region was considered as amplified if its regional averaged probe intensity was significantly higher than zero, which was the global mean after standardization of the LogRatio intensities within samples. One-sided Welch t-test was used to assess the statistical significance.

Copy-number alteration profiles
ANOVA testing was performed to compare the different histological groups according to their genomic anomalies from the aCGH data. At FDR < 1.0e-5, we identified 168 differential genomic regions. These regions were then merged into 34 genomic clusters within which the frequencies of aberrations for each genomic state and each subpopulation did not vary for more than 1%. These 34 clusters distributed into two main loci: chromosomal regions 3q26.2-29 and 22q12.1-13.1 ( Figure 2). Additional file 1: Table S1 shows the top 34 differential genomic clusters and their genomic annotation of interest, and the gene composition is given in Additional file 1: Table S2.
mRNA expression analyses Figure 3 shows the plots of the first 3 principal components (PC) of mRNA expression data. These plots indicate that the transcriptomic variability in NSCLC is dominated by histological types. Particularly, we observed in Figure 3(a) that AC was well separated from SCC. Indeed, a 15-gene classifier between AC and SCC achieved a cross-validated AUC of 96%, which means that these two types could almost perfectly be separated. The classifier was built using L1-penalized logistic regression, such that the estimates of many regression coefficients are shrunk towards zero. The selected 15 genes have the strongest effect with little shrinkage and can be used to predict AC and SCC. The clear separation between AC and SCC is also shown by the color map in Additional file 1: Figure S1. We validated this result using Bild et al's NSCLC data GSE3141 [18], where the 15-gene signature achieved Candidate targets expression: rest of genes/miRNAs + Input: Figure 1 Flowchart of DGS algorithm. Driver and targets are identified in a three step process, as shown. Candidate drivers are firstly selected from genes/miRNAs that reside in copy-number altered (CNA) regions and filtered by various procedures, for example based on fold-change and consistency between expression level and copy number status. The rests of genes/miRNAs are candidate targets, which are grouped based on correlation with the candidate drivers. Correlation between all drivers and targets in each module is highlighted using sparse canonical correlation analysis (SCCA).
an AUC of 92%. We performed a Monte-Carlo test by taking 1000 random selections of 15 genes and considered each as a signature. For these random signatures we obtained a median AUC~73%, and all had AUCs < 92%, thus giving a p-value < 0.001 for the observed signature. The list of the 15 classifier genes and the probes from both platforms are given in Additional file 1: Table S3.
In Figure 3(b) it appears that LCC and SCC form separate groups. LCC was only a small (n = 12/123) subset of non-small-cell lung cancers; interestingly, while its mRNA profile was not easily distinguished from that in AC (AUC < 50%), it was well separated from SCC (AUC =91%).
The gene expression profiling analysis identified a panel of 5 secreted biomarker candidates SPP1, CTHRC1, GREM1, SPINK1 and BMP7, and 5 non-secreted genes KRT6A, TP63, LGALS7, GCNT3, SPRR2D as IHC biomarker candidates ( Table 2). The distributions of the markers are shown in Additional file 1: Figures S2 and S3. As shown in these figures, the corresponding distributions of these markers in Bild et al's NSCLC data GSE3141 [18] were largely similar to those in our data. The list of the genes and the probes from both platforms are given in Additional file 1: Table S4. SPP1, CTHRC1 and GREM1 are candidate biomarkers to identify the cancer. Additionally, based on the AUC values in Table 2, SPINK1 and BMP7 would help to distinguish AC from SCC. The non-secreted IHC biomarkers can be used for annotation of small biopsies in order to identify the cancer and, except for GCNT3, they are informative for distinguishing AC from SCC.

Driver genes analyses
In total we identified 864 CNAs occurring at least once among 86 subjects, where 55 CNAs occurred in at least 12 (10%) of the total 121 subjects. These 55 highly recurrent CNAs covered~50% of the genes contained in the original 864 CNAs. To further increase our confidence we excluded regions that contained both amplifications and deletions: this resulted in 7 CNAs containing only amplifications and 6 CNAs with only deletions. These CNA regions contained 93 genes. The corresponding genes were further filtered out if their mRNA expression did not vary across subjects (i.e., standard deviation < 0.25); this step resulted in 75 genes in amplified regions and 7 genes in deleted regions.
To explore the impact of CNAs on gene expression, we analyzed the patterns of gene expression in samples with mutations compared to samples with normal copy number. We first note that many amplified genes do not necessarily have a higher expression in tumor compared to in normal tissue and vice versa, suggesting that further filtering is necessary. We thus kept only those genes that (1) exhibited significantly up-regulated gene expression in samples with amplifications compared to non-mutated samples, based on P < 0.05 using one-sided Welch's t test and fold-change > 2; vice versa for samples with deletions, and (2) showed consistent over-expression in tumors with amplifications; vice versa in tumors with deletions. Using these requirements, all genes with recurrent copy number losses were excluded, and 8 genes with recurrent  copy number gains were identified as a final list of candidate driver genes: RNF7, NDRG1, FAM49B, MRPS22, SLC25A36, ACPL2, PPP1R16A and LRRC14. An initial pairing of candidate target and driver genes based on correlation formed 8 modules consisting of 8,600 target genes. The biggest module contained 2,147 target genes and smallest module 336 genes. Using SCCA, we reduced the number of target genes in the modules to an average of 815 target genes. The number of target genes in a module gives us some idea of the importance of a driver gene. It is tempting to speculate that the higher number of targets whose expression is affected by a driver the more likely the mutated driver in this module might play a major role in tumor development. We plotted the number of target genes as a function of the correlation threshold in Additional file 1: Figure S4. The ranking of the candidate driver genes remained the same across the correlation threshold.
We then compared the mRNA expression in samples with vs those without altered genes, and selected four candidate driver genes with fold-change > 4 and P < 0.001: MRPS22, RNF7, NDRG1, FAM49B; see Additional file 1: Figure S5. Next we assessed their predictive value in predicting tumor histology. The individual AUCs using of these four genes were 0.81 (Pvalue = 0.02), 0.75 (0.05), 0.73 (0.07), 0.40 (0.76), respectively; see Table 3. The Pvalues were obtained from a Monte-Carlo test by computing the AUCs of 1000 random genes. We validated this result using Bild et al's data GSE3141, where the candidate driver genes achieved AUCs of 0.74 (Pvalue = 0.04), 0.81 (0.01), 0.72 (0.05), 0.40 (0.77). Thus MRPS22, RNF7 and NDRG1 were significant predictors of histology, but FAM49B does not appear to be specific to any histological type. MRPS22 was over-expressed in 78% and amplified in 16% of the SCC tumors, but showed little amplification or over-expression in AC (Additional file 1: Table S5). RNF7 and NDRG1 showed more amplification and mRNA over-expression in SCC compared to AC.
Next we used an independent data set published by Chitale et al's data [20] to get a bioinformatics validation of the potential driver genes. To assess the copy-number alteration profiles, Chitale et al. used Agilent 44K CGH arrays, which are much less dense than the 244K arrays in our study. Because the sensitivity of CNV detection algorithms is limited by the resolution of the array, we decided to validate the frequency of copy number gains for the candidate driver genes directly, as well as their properties, including the number of correlated genes and relationship between the copy number status and gene expression. (See more details in the Methods Section.) We did find significant copy-number gains for these driver genes. Using a threshold of p-value < 0.001, the frequency of copy number gains was 11.6%, 28.1% and 7.5% for MRPS22, NDRG1 and RNF7, similar or exceeding as we reported in Additional file 1: Table S5 for AC patients from our study. We then performed a one-sided Welch t-test to compare the gene expression level in patients with copy number gains vs the non-mutated samples. We obtained p-values of 0.07, 7.5 × 10 -6 , and 0.2 for MRPS22, NDRG1 and RNF7, respectively. Had we used the p-value threshold of 0.05 in defining the copy number gain, all three candidate driver genes would exhibit significantly up-regulated gene expression in samples with amplifications, with corresponding P-values 0.002, 6.7 × 10 -7 , and 0.0009, respectively, suggesting that expression of the three drivers exhibit the expected positive correlation between the copy number gains and up-regulated gene expression. We also find that the number of genes correlated with driver gene MRPS22, NDRG1 and RNF7 is 395, 219 and 311, respectively, at a correlation coefficient at least 0.4. This large number of correlated genes is similar with what we observe in our data in Additional file 1: Figure S4.

Driver miRNAs
We used the same procedure to identify potential driver miRNAs and examined their predictive ability on histology type. Because of the total number of miRNAs was much smaller than the number of genes in the genome, less stringent filtering procedures were applied. We started with the 864 common CNAs. To further increase our confidence we excluded regions that were amplified in some individuals and deleted in others, but by allowing a mixture of amplifications and deletions at a threshold of less than 10% mixture, we arrived at 84 CNAs containing mainly amplifications and 254 CNAs with mainly deletions. These CNA regions contained 33 miRNAs. Expression of all these altered miRNAs has a standard deviation > 0.25.
To explore the impact of CNAs on miRNA expression, we analyzed the patterns of miRNA expression in samples with mutations compared to samples with normal copy number. We kept only those miRNAs that (1) exhibited up-regulated expression in samples with amplifications compared to non-mutated samples, based on P < 0.01 using one-sided Welch t test; vice versa for samples with Table 3 AUCs of the candidate driver genes and miRNAs deletions, and (2) showed over-expression on average in tumors with amplifications, and under-expression in tumors with deletions. Using these requirements, three miRNAs hsa-miR-16-2*, hsa-miR-570 and hsa-miR-944 with recurrent copy number gains were found, and one miRNA hsa-miR-31* with recurrent copy number losses was identified as a final list of candidate driver miRNAs. Next we assessed the predictive ability of the four potential driver miRNAs in classifying tumor histology; Table 3. Only hsa-miR-944 showed a significant AUC of 88% (Pvalue = 0.001 from the Monte-Carlo test). We validated this result using Ming You et al's data GSE29135 [19], where hsa-miR-944 achieved a median AUC of 78% (Pvalue = 0.001). None of the other miRNAs was significant on the validation data.

Network enrichment analyses
The network enrichment analysis was used to functionally characterize the candidate driver genes and miRNA. In this analysis we assessed the gene-network interaction between the target genes of each candidate driver with the genes in KEGG pathways; this was previously shown to be more powerful than the gene-set enrichment analysis [16]. For each of the three potential driver genes that were also good predictors of histological subtypes, we used the number of target genes in Additional file 1: Figure S4 to guide the choice of threshold of correlation coefficient. A total of 72 target genes of MRPS22 were selected at a correlation threshold 0.6, and 68 and 113 target genes of NDRG1 and RNF7, respectively, at a correlation threshold 0.45. A set of 641 genes was regulated by hsa-miR-944, which best classifies tumor histology among the potential driver miRNAs. The expressions of a miRNA and its target mRNAs are not necessarily correlated, so we did not filter the set of target mRNAs of hsa-miR-944 by correlation coefficient.
The results are presented in Additional file 1: Table S6. For MRPS22 and hsa-miR-944, the top enriched pathways included many pathways known to be involved in cancer, such as DNA replication, cell cycle, mismatch repair, p53 signalling pathway and other lung cancer related signalling pathways. Intriguingly, the top pathways associated with NDRG1 included few known cancer-related pathways, but a large number of immunological pathways, such as Fc gamma R-mediated phagocytosis, B-cell receptor signalling pathway, T cell receptor signaling pathway, Fc epsilon RI signaling pathway and chemokine signaling pathway. RNF7 appeared to be associated with cell shape and motility-related pathways, such as focal adhesion, ECMreceptor interaction and regulation of actin cytoskeleton and adherens junction pathways. Figure 4 shows an example of network from the three driver genes and 23 genes in the pathway of mismatch repair, which was significantly enriched in the target genes of MRPS22. We observed more links connecting to MRPS22 than to RNF7 and NDRG1, in line with the results that the mismatch repair pathway was among the top enriched pathways for MRPS22, but not for the latter two drivers.

Mutation analyses
We observed the following number of mutations from the sequence analyses: KDR 5, AKT1 0, KRAS 20, EGFR 13, PIK3CA 2, BRAF 4, ERBB2 1 and TP53 29 mutations. Further analyses were performed with the KRAS, EFGR and TP53 mutations. There was no evidence of association with time to relapse and no evidence of interaction with adjuvant chemo therapy (smallest P > 0.21 for the three and combined mutations). There was no significant association with tumor stage or lymph-node status.
The most significant result was the association with histology: almost all (18/20) KRAS mutations were found in AC, and none in SCC; 19% of AC had TP53 mutations vs 28% of SCC, but this was not statistically significant (P = 0.41). Since AC and SCC had significantly distinct mRNA expression profiles, we also found significantly distinct mRNA profiles between KRAS + vs KRAS-tumors (AUC = 0.86). For TP53 mutations, there were numerous DE genes (the top 200 genes had FDR~8%), but the AUC for the L1 classifier was only~0.67. This suggests that the multi-functionality of TP53 and its aberrations led to a broad spectrum of transcriptional responses.

Discussion
In this study we analyzed a cohort of corresponding tumor and non-tumor tissue samples at DNA, RNA and miRNA level from NSCLC patients. Integrated molecular characterization of AC and SCC had identified clinically relevant markers and potential drivers, which are recurrent and stable changes at DNA level that have functional implications at RNA level and have strong association with histological subtypes.
The common feature of all of the previous studies was the exploration of tumor samples only. In contrast, the strength of our study design is the use of paired tumor and adjacent normal lung tissue from the same patient. This investigation was performed, for each patient, by comparing perfectly defined histological tumor and normal frozen lung samples obtained during curative surgery. The main advantage of this unique feature is that noise related to the genetic background variability is reduced, which should lead to more tumor-specific molecular data and more sensitive statistical analysis.
From a methodological perspective, our study is the few recent studies [21][22][23][24] to employ an integrative systems biology approach in NSCLC. We found distinct molecular characteristics of AC and SCC, a result supported also by others [25,26]. We also have evidence that LCC does not form a distinct molecular subgroup, but appears to be a subgroup of AC. Among the 34 genomic clusters identified by aCGH, several genes exhibited a different profile of aberrations between AC and SCC, including PIK3CA, SOX2, THPO, TP63, PDGFB genes. These chromosomal regions could be targeted by FISH probes, which might help pathologists to distinguish between the two entities. Thus far the FISH method was mainly used on the one hand to check for certain chemotherapy targets such as EGFR [27] or HER-2 [28]; and on the other hand to define a set of four genetic markers (frequent copy number gains in chromosomes 1q32, 3q26, 5p15, and 8q24) applied to spiral CT-guided FNA cytology samples, which were highly sensitive for the diagnosis of lung cancer and highly specific in their ability to exclude cancer within a given specimen [29].
Our results provide evidence that there are multiple potential biomarkers for early diagnosis and to help pathologists to distinguish between AC and SCC in small biopsies or in blood samples. Besides confirming some existing non-secreted SCC markers such as TP63 and KRT6A, novel findings include using gene expression profiling analysis to identify five secreted genes, among which SPP1, CTHRC1 and GREM1 would be candidate biomarkers to identify the cancer using blood samples; additionally, SPINK1 and BMP7 would help to distinguish AC from SCC as complementary to existing SCC markers. High expression of SPP1, also known as osteopontin, was associated with poor survival of patients with stage I NSCLC [30]. CTHRC1 is known to have an aberrant expression in many tumor types, such as cancers of the gastrointestinal tract, breast, lung and thyroid, but until now this marker was not investigated in differential diagnostics of NSCLC [31]. GREM1 is a member of the aberrantly activated Hedgehog signalling pathway, and has been reported to act in an oncogenic manner in lung adenocarcinoma and can induce cell migration, invasion and proliferation [32,33]. SPINK1 or tumor-associated trypsin inhibitor (TATI) did not appear to be a good tumor marker in lung cancer, since its sensitivity was poor and the correlation between TATI serum levels and stage of the disease and histological type was weak [34]. A recent publication indicated that BMP7 plays a key role in the regulation of lung cancer progression, linking its expression level to lymph node involvement [35].
Using integrated genomics approach we found in recurrently altered regions a list of three potential driver genes, MRPS22, NDRG1 and RNF7, which were consistently over-expressed in amplified regions, had wide-spread correlation with an average of~800 genes throughout the genome and highly associated with histological types. Using a network enrichment analysis, the targets of these potential drivers were seen to be involved in DNA replication, cell cycle, mismatch repair, p53 signalling pathway and other lung cancer related signalling pathways, and many immunological pathways. NDRG1 has been recently reported to predict tumor angiogenesis and poor outcome in patients with lung cancer [36]. Expression of RNF7 has been found to be a prognostic marker in non-small cell lung cancer [37].
Multiple sets of molecular signatures are presented in this study. They are identified in parallel using different approaches with different intentions and could help to better characterize NSCLC from various aspects. Ten markers, composed of a panel of 5 secreted and 5 non-secreted biomarker candidates, are identified as top-ranking differentially expressed genes, i.e. by comparing the direction of relative mRNA expression between tumour vs normal tissues. In clinical practice, these potential biomarkers may be used for early diagnosis of the cancer, and additionally to distinguish between AC and SCC in small biopsies or in blood samples. In contrast, three potential drivers are selected using an integrated genomics approach, called Driver-Gene Search algorithm, which combines information on both of DNA and RNA levels. The DGS algorithm attempts to select potential driver genes that may have functional impact on the expression of other genes due to copy-number alterations of the driver genes. There is no guarantee that the driver genes are ranked as the top differentially expressed genes. The findings of the drivers provide a clue for downstream experimental analysis to understand the molecular mechanisms of the development of lung cancer.

Conclusions
In summary, an understanding of the molecular mechanisms involved in formation of various NSCLC subtypes is crucial for the development of efficient differential diagnostics methods to better distinguish between tumor entities even in small biopsies. Our results provide evidence that there are multiple molecular signatures which could help pathologists to diagnose small tissue samples with NSCLC. Novel findings include differentially expressed sets of secreted and non-secreted genes that may help in the diagnosis and classification of NSCLC on serum or tissue samples. The driver-gene search algorithm for integrating genomic data, mRNA and miRNA expression identified potential driver genes, which are useful for follow-up experimental validation. The findings of this study should help to instigate others to implement these in clinical practice.

Data availability
The microarray data related to this paper have been submitted to the Array Express data repository at the European Bioinformatics Institute (http://www.ebi.ac.uk/ arrayexpress/) under the accession numbers E-MTAB-1132 (GE), E-MTAB-1133 (CGH) and E-MTAB-1134 (MIR).

Additional file
Additional file 1: Figure S1. Color map of the log2-expression ratio for AC and SCC patients using 657 gene-probes. Each column represents a patient and each row a probe. Figure S2: Boxplots of the log2expression ratio for the top 5 secreted biomarkers found in Chemores and the corresponding log-expression values in Bild et al's NSCLC data GSE3141. Figure S3: Boxplots of the log2-expression ratio for the top 5 non-secreted biomarkers and the corresponding log-expression values in Bild et al's NSCLC data GSE3141. Figure S4: Drivers and number of their targeted genes given on the y-axis. Figure S5: P-values of Welch's t test on -log10 scale (left) and fold change of driver genes' expression (right) are given on x-axis; Number of targeted genes with a correlation coefficient at least 0.3 are given on the y-axis. Genes in the topright area are considered in a predictive model of lung cancer histology. Table S1: List of the 34 clusters of the most differential genomic regions between AC, LCC and SCC populations. Table S2: List of the 34 clusters of the most differential genomic regions between AC, LCC and SCC populations with the known genes within each cluster. Table S3: List of the 15 classifier-genes with the corresponding probes on Agilent 244K and Affymetrix U133 Plus 2.0 arrays. Table S4: List of the 10 potential biomarker genes with the corresponding probes on Agilent 244K and Affymetrix U133 Plus 2.0 arrays. Table S5: List of 4 candidate driver genes and 4 candidate driver miRNAs, their tumor expression levels and copy-number alteration status in AC and SCC. Table S6: Network enrichment analysis of target genes of MRPS22, NDRG1, RNF7 and hsa-miR-944. The top and bottom 20 ranked pathways are shown.