Identification of lncRNA biomarkers for lung cancer through integrative cross-platform data analyses

This study was designed to identify lncRNA biomarker candidates using lung cancer data from RNA-Seq and microarray platforms separately. Lung cancer datasets were obtained from the Gene Expression Omnibus (GEO, n = 287) and The Cancer Genome Atlas (TCGA, n = 216) repositories, only common lncRNAs were used. Differentially expressed (DE) lncRNAs in tumors with respect to normal were selected from the Affymetrix and TCGA datasets. A training model consisting of the top 20 DE Affymetrix lncRNAs was used for validation in the TCGA and Agilent datasets. A second similar training model was generated using the TCGA dataset. First, a model using the top 20 DE lncRNAs from Affymetrix for training and validated using TCGA and Agilent, achieved high prediction accuracy for both training (98.5% AUC for Affymetrix) and validation (99.2% AUC for TCGA and 92.8% AUC for Agilent). A similar model using the top 20 DE lncRNAs from TCGA for training and validated using Affymetrix and Agilent, also achieved high prediction accuracy for both training (97.7% AUC for TCGA) and validation (96.5% AUC for Affymetrix and 80.9% AUC for Agilent). Eight lncRNAs were found to be overlapped from these two lists.


INTRODUCTION
Lung cancer is the leading cause of cancer-related mortality worldwide [1,2]. According to a report on 2018 global cancer statistics, lung cancer was the most commonly diagnosed cancer in 37 countries, making up to about 11.6% of total cancer cases for both sexes [2]. Based on histology, lung cancer can be divided into two types: small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC) which account for about 15% and 85% of lung cancer, respectively [3]. NSCLC can be further subdivided into several subtypes: adenocarcinoma (ADC), squamous cell carcinoma (SCC), adenosquamous carcinoma, undifferentiated carcinoma and large cell carcinoma [4]. In 2014, more than 25% of cancer deaths were attributed to NSCLC [5][6][7]. The overall 5-year survival rate after curative tumor resection is relatively low in lung cancer patients [8] because most already have locally advanced or metastatic disease when diagnosed [9]. Only around 20%-30% of patients have potentially operable, earlystage disease at presentation [9]. Currently, the clinical diagnosis of lung cancer relies mainly on chest X-ray, low dose computed tomography (CT) scans, and other imaging technology which, unfortunately, are encumbered by the harmful effects of radiation and high costs. Although there are invasive methods for auxiliary diagnoses, such as bronchoscopy and biopsy, these methods are painful and time-consuming. Moreover, overlapping symptoms between lung cancer and other chronic respiratory conditions such as cough, dyspnea, chest pain, fatigue, chest infection, hemoptysis, and AGING weight loss often complicate and delay diagnosis [9]. These reasons underscore the important need for noninvasive, sensitive and reliable biomarkers for the early diagnosis of lung cancer.
Although the central dogma of biology states that the flow of genetic information hardwired in the DNA occurs by transcription into RNA and translation into proteins, noncoding RNAs (ncRNAs) are not translated. The many types of ncRNAs are broadly classified into long and small ncRNAs [32]. In recent years, ncRNAs have been studied as potential biomarkers for diagnosis, prognosis, and subtyping [33]. LncRNAs not only participate in a broad range of biological processes such as cell proliferation, migration, invasion, survival, differentiation, and apoptosis [34] but are also involved in tumorigenesis and metastasis in many cancer types [34][35][36]. Certain lncRNAs have been proposed as potential biomarkers associated with tumor initiation, progression or prognosis [37]. Indeed, lncRNA discovery is a very active field in cancer biology research [38] and here, we explored the possibility of lncRNAs as potential diagnostic biomarkers in lung cancer through a meta-analysis of publicly available microarray and RNA-Seq data, using integrative cross-platform data analyses, machine learning, and independent validation.
The majority of papers reporting meta-analyses assembled differentially expressed gene (DEG) lists from published experimental studies and then articulated consistently reported DEGs; or integrated multiple datasets from different microarray platforms and then executed statistical tests to discover consistently expressed DEGs [39]. By contrast, our study was designed to test whether microarray and RNA-Seq generate similar results to identify lncRNA biomarkers and whether these two platforms could validate each other. Using data-mining and machinelearning approaches, we identified 8 lncRNAs as potential diagnostic biomarkers. To test the efficiency of the biomarkers of interest, we evaluated and compared their sensitivity and specificity [40]. We also performed function analysis using The Atlas of ncRNA in Cancer (TANRIC) [41], the Database for Annotation, Visualization and Integrated Discovery (DAVID) [42,43] and Tumor Alterations Relevant for Genomics-driven Therapy (TARGET, accessible at https://software.broadinstitute.org/cancer/cga/target).

Combining datasets
Patient information from the downloaded datasets are summarized in Table 1  Principal component analysis (PCA) performed on the three microarray datasets (GSE18842, GSE19188, and GSE70880) before normalization showed distinct separation of the red, green and yellow components ( Figure 2A). After per sample and per gene normalization, PCA revealed that while the two microarray datasets GSE19188 and GSE18842 from the Affymetrix platform could merge well, the GSE70880 dataset from Agilent did not cluster with the Affymetrix AGING ones. As shown in Figure 2B, the red and green components merged together while the yellow one remained separated.

Identification of most correlated lncRNAs
Based on the PCA, we merged the two Affymetrix datasets, increasing the sample size to a total of 247. We used this merged microarray Affymetrix dataset for training to identify lncRNA biomarkers that are differentially expressed between lung cancer and normal samples, then used the Agilent and TAGA datasets for validation. For the alternative analysis, we used the RNA-seq TCGA dataset, which had a comparable sample size of 216, for training and validated on the Affymetrix and Agilent datasets. The Agilent database  AGING was only used for validation in both analysis streams because of its small sample size (n=40). Results of the first analysis where training was done on the Affymetrix dataset using a two-sample t-test are listed in Supplementary Table 1

Identification of Diagnostic Signature and classifiers
As stated above, we first trained on the merged Affymetrix dataset and validated in both Agilent and TCGA datasets. We used the top 20 differential lncRNAs ( Figure 3) to build a classification model using the BayesNet algorithm. The training model showed good resultsit was able to distinguish cancer from normal samples with a sensitivity of 0.971, specificity of 0.991, and AUC (ROC area) of 0.991 ( Table 2). Results of the validation performed on the TCGA and Agilent datasets were as expected ( Table 2). Validation performed on the TCGA dataset had a sensitivity of 0.991, a specificity of 0.880 and AUC of 0.992; the Agilent dataset had a sensitivity of 0.850, a specificity of 0.900, and AUC of 0.928.
Similarly, when the top 20 differentiated lncRNAs from training done on the TCGA dataset were used to build a classification model using the Voted Perceptron algorithm, we also achieved very good accuracy in separating cancer from normal samples. The training sensitivity was 0.991, specificity was 0.954 and AUC was 0.995 (Table 3). Validation on the Affymetrix and Agilent datasets also gave the expected results. For the Affymetrix dataset, sensitivity = 0.949, specificity = 0.964 and AUC = 0.965, while for the Agilent dataset, sensitivity = 0.600, specificity = 0.950 and AUC = 0.809. Overall, these results suggest that the lncRNAs used in the models are significantly associated with lung cancer and could be used to discriminate tumors from normal samples.
Comparing the top 20 lncRNAs from the Affymetrix and TCGA datasets revealed 8 overlapped lncRNAs ( Figure 3) which were all downregulated in cancer (Figures 4, 5). Interestingly, except for a few lncRNAs      Figure 4C). By hierarchical clustering, we found that the 8 lncRNAs could completely differentiate the microarray datasets ( Figure 5A) and the TCGA dataset ( Figure 5B) into normal and tumor groups. Therefore, we used them for further functional analysis. The Receiver Operating Characteristic (ROC) curves illustrate the diagnostic ability of the two models is pretty strong (Figure 6).

Survival and function analyses
We sought to determine if lncRNAs could predict patient survival (Table 4). LINC02555 has p = 0.0299 which shows statistical significance as a prognostic biomarker. However, since the p equals 0.2564 for this cox regression model, there's no statistical significance between these 8 lncRNAs' high expression levels and low expression levels. In conclusion, only LINC02555 could be a potential prognostic biomarker. The hazard ratio for LINC02555 is 1.026036, which means that around 1.026036 times as people with higher LINC02555 expression level are dying as people with lower LINC02555 expression level. From the Kaplan Meier plot, we can see that the two lines intersect at some points (Figure 7). This means the prognostic ability is not very good.
We also performed functional analysis using the common 8 lncRNAs. This analysis revealed three significant pathways for 3 lncRNAs (Table 5): AC008268.1 in complement and coagulation cascades,

DISCUSSION
Biomarkers from easily accessible tissues such as blood and other body fluids are useful and economical screening tools for various diseases. Such tools are especially important for diseases such as lung cancer where existing diagnostic methods are not able to identify patients at early stages when intervention can be more effective. Indeed, a growing number of studies are using high throughput next-generation sequencing (NGS), especially microarray and RNA-Seq data, to identify diagnostic or prognostic biomarkers for lung cancer [4,37,[44][45][46][47][48]. Most of these studies examined data from only one technology, either microarray or RNA-Seq.
Microarray and RNA-Seq are two popular ways to measure gene expression. These two technologies have been compared in terms of technical reproducibility, variance structure, absolute expression levels, detection of isoforms, and the ability to identify DEGs and develop predictive models [49]. In general, they are comparable when reporting for high-intensity genes; however, microarrays have been shown to have some systematic biases in their estimation of differential expression for low-intensity genes [50]. Identifying mRNA gene markers for lung cancer has been done by many studies, but very few studies focus on integrative data analysis of lncRNA on lung cancer. This is the reason why we choose lncRNA for the study. Here, through bioinformatics integrative analysis of microarray and RNA-Seq datasets, we identified 8 lncRNAs that could be used as diagnostic or prognostic biomarkers for lung cancer. We used Correlation Attribute Eval feature selection on the statistically significant results to find the top 20 most related lncRNAs. At the same time, we chose the most significant 20 lncRNAs according to their p-values, interestingly, we found that they were exactly the same as what feature selection selected. So, in the case    The top 20 lncRNAs from each training were then used in machine learning to build two training models. We used two classifiers, Voted Perceptron algorithm and Bayes Network learning, for machine learning. All the classifiers were tried in Weka and the best results were chosen for each training dataset. Interestingly, we noted that the best results for the two training datasets were from different classifiers, with Bayes Network learning working better for the Affymetrix dataset and Voted Perceptron algorithm for the TCGA dataset. Overall, we found that the model built from Affymetrix was better than the model from TCGA. We also noted that the Agilent dataset as a validation dataset performed comparatively worse in all models. The sample size of the Agilent dataset was small compared to the others and it did not cluster with the other two datasets when PCA was done after normalization. This could be due to Affymetrix and Agilent are different platforms and the ways for them to design the arrays are not the same, either. Also maybe the Agilent dataset sample size is not big enough. The batch effects even exist for the same platform coming from different labs and more often existed from different platforms. Still, because this study aims to find biomarkers for global lung cancer, we decided to include the Agilent dataset in our analysis.
The training models built separately from the Affymetrix and TCGA datasets and validated on the rest of the datasets resulted in two lists of 20 lncRNAs which include 8 common ones that could be diagnostic lncRNAs. Given that both the sensitivity and specificity are greater than 0.9, we can say that these 8 biomarkers can help predict whether the tissue sample is lung cancer or healthy. Moreover, the good performance of these 8 lncRNA biomarkers strongly suggests that they should work as biomarkers for all lung cancer samples, perhaps including subtypes, although subtypes were not explored in this study. Some of these 8 lncRNAs have previously been described in connection with cell biology and cancer. For example, Qiao et al. reported that TBX5-AS1 was down-regulated in lung cancer tissues compared to non-tumor lung tissues, and its expression was linked to unfavorable prognosis in never-smoking female lung cancer patients [51]. Liu et al. reported that GATA6-AS1 was spatially correlated with the transcription factor GATA6 across the genome [52]. In another study, the long non-coding antisense transcript of GATA6-AS was revealed to interact with epigenetic regulator LOXL2 to regulate endothelial gene expression via changes in histone methylation [53]. Also, Chen et al. found that GATA6-AS1 was down-regulated in LUSC patients and was significantly linked to survival time [54]. ADAMTS9-AS2 was found to correlate with bladder cancer patient survival in an analysis of significantly differentiating RNAs [55] and might play a role in early-stage digit development [56]. In a glioma study, ADAMTS9-AS2 was found to be significantly downregulated in tumor tissues compared with normal ones and reversely associated with tumor grade and prognosis. Their analysis showed that low ADAMTS9-AS2 was an independent predictor of poor survival in glioma [57].
Previous studies have described the function of lncRNAs, but in general, their clinical potential is underexplored [37]. Here, we showed for the first time that lncRNAs are promising biomarkers for the diagnosis of global lung cancer that significantly augment CT imaging which often fails to clearly distinguish between benign and cancer states. By performing in silico analysis on existing normal and tumor tissue samples from GEO NCBI and building prediction models, we identified 8 lncRNAs as promising candidate biomarkers with good diagnostic AGING power based on their high sensitivity and specificity. Our results suggest that in the future, by simply testing the expression level of these 8 lncRNAs in the blood or other body fluids and then generating the prediction model, we may be able to tell if there is lung cancer or not. This will be especially useful because, in the clinic, patients prefer non-invasive detection methods like blood tests rather than invasive methods like a biopsy. As such, it is important that these biomarker candidates be experimentally validated in the laboratory using body fluid samples. We are hopeful that if validated in blood samples, we may be able to create a simple blood test to diagnose lung cancer.
Our study is significant because it reports promising biomarker candidates with solid cross-validation bioinformatics data analysis on different platforms of a pretty large sample size. These biomarkers could be tested in blood serum or plasma samples in future studies. Besides providing potential novel diagnostic biomarkers for lung cancer, our study also provides novel candidate molecules and pathways for mechanistic studies on lung cancer development and carcinogenesis and for the development of new targets for lung cancer treatment.

CONCLUSIONS
We identified 8 lncRNAs as potential diagnostic biomarkers for NSCLC through integrative crossplatform data analyses. This data mining and machine learning approach would be an efficient and economical screening method for tumor biomarker discovery. Moreover, we are now in an exciting time in bioinformatics when both high-throughput tools and data are increasingly accessible for tumor biomarker discovery. Our study can also help understand the development of lung cancer and provide potential novel targets for lung cancer treatment.

Overview of the workflow
To detect lncRNAs differentially expressed between healthy and lung cancer tissues, we employed a onefactor (cancer/normal) experimental design in which datasets containing lung cancer samples and adjacent normal tissue samples were selected ( Figure 8). This approach narrows the variation of data and allowed sufficient statistical power. Based on this design, we AGING downloaded three lung cancer microarray datasets, GSE19188 [58], GSE18842 [59] and GSE70880 [44,60], with a total of 287 samples, from the GEO repository, and 216 RNA-Seq samples from TCGA (http://cancergenome.nih.gov/). For the datasets from array-based platforms, GSE19188 and GSE18842 were combined as Affymetrix dataset and GSE70880 was named as Agilent dataset.
LncRNA names were obtained from BioMart and a total of 963 lncRNAs common to every dataset were identified for further analyses. At first, we used the Affymetrix dataset for training. Differentially expressed lncRNAs at FDR adjusted p-value by the Benjamini-Hochberg procedure less than 0.05 using Student's ttests were selected. We uploaded the data to Weka (version 3-8-2) [61], then used correlation Attribute Eval feature selection to get the most statistically significant related lncRNAs. We selected the top 20 differentially expressed lncRNAs to build a model using Bayes Network learning, then performed validation on the TCGA and Agilent datasets.
Next, we used the TCGA dataset for training and selected differentially expressed lncRNAs at FDR adjusted p-value less than 0.05 using Student's t-tests. We applied Correlation Attribute Eval feature selection on the statistically significant results to find the most related lncRNAs and used the top 20 to build the model using the Voted Perceptron algorithm incorporated in Weka. This time, validation was performed on the Affymetrix and Agilent datasets. Of the top related 40 lncRNAs from the two analyses, we identified 8 overlapping lncRNAs. These were further interrogated for survival and function analysis.

The cancer genome atlas (TCGA) datasets as RNA sequencing (RNA-seq) dataset
Lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) datasets from TCGA incorporating RNA-Seq FPKM data from 216 lung tissue samples and matched adjacent normal tissue samples were downloaded. The annotations of the TCGA dataset were acquired from BioMart in R (version 3.4.3).

Gene expression omnibus (GEO) datasets as microarray dataset
We conducted a search of the GEO Microarray database using keywords "ncRNA, lung cancer and RNA seq*" to find microarray datasets. We also searched for papers in PubMed that were related to lncRNA, RNA sequencing and lung cancer. We only downloaded microarray datasets with at least 20 lung cancer samples and adjacent normal tissue samples because studies with a smaller sample size would be challenging to merge due to batch effects. The downloaded GEO datasets are: GSE70880, GSE19188, GSE18842. GSE19188 and GSE18842 contain NSCLC samples but the lung cancer types in GSE70880 are unknown. The annotations of each dataset were obtained from BioMart in R (version 3.4.3). The TCGA or GEO datasets were merged based on transcript names and a total of 963 lncRNAs common in all the datasets were included for further analyses.

Data normalization for GEO datasets
First, we used per sample every sample median value across all the lncRNAs in GSE19188, GSE70880, and GSE18842 datasets. Then we performed per gene normalization based on every lncRNA expression median value across all the samples in these three microarray datasets. We used RBoxPlot in Array Studio software to check the normalization results. LncRNAs in the GSE70880 dataset with missing data were excluded. Principal component analysis based on the common 963 lncRNAs was done on the combined datasets. GSE19188 and GSE18842, both from the Affymetrix platform, clustered well, but GSE70880 from the Agilent platform could not cluster with the other two. So, GSE19188 and GSE18842 were combined and named as Affymetrix dataset and GSE70880 was named Agilent dataset for further analyses. RBoxPlot were obtained using Array Studio 10 (Supplementary Figure 1)

Data normalization for TCGA dataset
The lung cancer RNA-seq data from TCGA was normalized based on the Fragments Per Kilobase of transcript per Million mapped reads (FPKM). The FPKM data from TCGA were log 2 transformed after adding 0.1.

Screening for differentially expressed lncRNAs
For each dataset, the difference in expression of lncRNAs between cancer and normal was examined by a twosample t-test. The fold change and regulation direction were then reported. Each of the datasets was tested for differential expression by a two-sample t-test using Array Studio 10. Statistically significant differentially expressed lncRNAs were selected with a False Discovery Rate (FDR) adjusted p-value less than 0.05 and fold change greater than 1.3 in at least one dataset.

Training datasets and validation
The Affymetrix dataset containing differentially expressed lncRNAs was used as a training dataset first.

AGING
Feature selection correlation method was done using Weka (version 3.8.2) and the top 20 lncRNAs were selected for further analysis. Hierarchical cluster analysis was performed using Array Studio 10. Bayesian network classifier (BayesNet) was used on the top 20 Affymetrix lncRNAs and validated in TCGA and Agilent datasets. Sensitivity and specificity were calculated based on the Bayesian network results. The formulas for sensitivity and specificity are: Likewise, the TCGA dataset with differentially expressed lncRNAs was used as a training dataset and Weka (version 3.8.2) feature selection correlation method was used to identify the top 20 lncRNAs. Voted Perceptron was used for classification, and then validated in both Affymetrix and Agilent datasets separately. Sensitivity and specificity were calculated based on Voted Perceptron results. Subsequently, hierarchical cluster analysis was performed using Array Studio to check the expression levels of the eight overlapping lncRNAs. The Receiver Operating Characteristic (ROC) curves were plotted using Weka software.

Survival analysis
Cox regression analysis was performed using the survival package in R. The lncRNAs with a p-value of less than 0.05 were considered associated with survival. This analysis includes all lung cancer samples with survival information available from TCGA.

Function analysis
Overlapping lncRNAs from the top 20 lncRNAs obtained after feature selection from Affymetrix and TCGA datasets respectively, were used for functional analysis.
We used TANRIC to find correlated mRNA, miRNA, protein and somatic mutation with the common lncRNAs in LUSC and LUAD datasets, respectively. The lists of correlated genes were then used in TARGET and DAVID for functional analysis.

AUTHOR CONTRIBUTIONS
YD envisioned the project and designed the work. TZ conducted the data analysis and wrote the manuscript with the help of VK and YD. All authors have read, revised, and approved the final manuscript.