Predicting the Clinical Outcome of Lung Adenocarcinoma Using a Novel Gene Pair Signature Related to RNA-Binding Protein

Adenocarcinoma is the most common type of lung cancer, and patients have varying prognoses. RNA-binding proteins (RBP) are deemed to be closely associated with tumorigenesis and development, but the exact mechanism is currently unknown. This study was aimed at constructing a new robust prognostic model based on RNA-binding protein-related gene pair scores for better clinical guidance. The model for this study was constructed based on data of lung adenocarcinoma from The Cancer Genome Atlas (TCGA) database. Prognosis-related RBP gene pair models were created based on differentially expressed genes, and the accuracy of the models was verified in a different age, staging, and other subdatasets. A total of 379 RNA-binding protein-related genes were differentially expressed in tumor tissue. From these genes, we constructed a prognostic model consisting of 33 gene pairs, which were found to be significantly associated with survival in TCGA dataset (P < 0.0001, hazard ratio (HR) = 4.380 (3.139 to 6.111)) and different subdatasets. As expected, the results were verified in the GEO validation cohort (P = 7.8 × 10−3, HR = 1.597 (1.095 to 2.325)). We found that the signature exhibited an independent prognostic factor in both the univariate and multivariate Cox regression analyses (P < 0.001). CIBERSORT was applied to estimate the fractions of infiltrated immune cells in bulk tumor tissues. CD8 T cells, activated dendritic cells, regulatory T cells (Tregs), and activated CD4 memory T cells presented a significantly lower fraction in the high-risk group (P < 0.01). Patients in the high-risk group had significantly higher tumor mutational burden (TMB) (P = 4.953e − 04) and lower levels of immune cells (P = 3.473e − 05) and stromal cells (P = 0.005) in the tumor microenvironment than those in the low-risk group. Furthermore, the Protein-protein interaction (PPI) network and various enrichment analyses have genuinely uncovered the interrelationships and potential functions of the RBP genes within the model. The results of the present study validated the importance of RNA-binding proteins in tumorigenesis and progression and support the RBP gene-related signature as a promising marker for prognosis prediction in lung adenocarcinoma.


Introduction
Non-small-cell lung cancer typically includes lung adenocarcinoma and lung squamous carcinoma, with lung adenocarcinoma accounting for about 40-55% of all lung cancer patients [1]. Unlike lung squamous carcinoma, lung adenocarcinomas tend to originate in the smaller bronchial tubes and are therefore more likely to occur in the peripheral lobe of the lung and are more likely to occur in women and nonsmokers. The prognosis of patients with lung adenocarcinoma is usually closely related to the pathological type of the tumor, the pathological stage, the genetic characteristics of the patients, and the selectivity of surgery or targeted therapy options [1,2]. Overall, patients with larger inoperable tumors or early metastases usually have a poorer prognosis and reduced survival. For earlier, more localized lung adenocarcinoma lesions, surgery has been the preferred option with excellent results [3]. In inoperable patients, a puncture biopsy of the lesion, genetic testing, and immunohistochemical testing are usually performed clinically to determine whether molecular targeted therapy or immunotherapy can be used properly [4]. Commonly used targeted therapeutic agents primarily target tumor angiogenesis and epidermal growth factor receptors. Immune checkpoint inhibitors targeting the PD-1/PD-L1 pathway have shown significant efficacy in the treatment of some patients [5]. Patients with poor immunotherapy or targeted therapy will require systemic chemotherapy and chemotherapy.
RNA-binding protein (RBP) is a category of proteins that accompany RNA to regulate the metabolic process and bind to RNA. Approximately 60% of RNA-binding proteins were found to interact extensively with chromatin and are enriched in gene promoter and enhancer regions [6]. RNAbinding proteins are core components of various ribonucleoprotein (RNP) complexes and are essential for posttranscriptional gene regulation (PTGR). RNA-binding proteins are involved in multiple aspects of RNA metabolism, including RNA splicing, RNA translocation, intracellular localization, and translation control [7]. Although RNA plays an indispensable role in RNA metabolism in organisms, its role in different diseases, especially in the development and progression of cancer, is still poorly understood. However, from the few relevant studies, it appears that RNA-binding proteins are closely related to both tumor origin and progression [7][8][9]. For instance, Zhang et al. found that a natural compound, neobractatin (NBT), significantly upregulates the expression of the RNA-binding protein Muscleblind-like 2 (MBNL2), which in turn inhibits tumor metastasis [10]. Another study on melanoma found that the RNA-binding protein ELAVL1 was overexpressed in tumor tissue and significantly correlated with tumor progression and prognosis, promoting tumor formation and inhibiting cancer cell senescence [11]. The critical role of RBP has also been found in lung cancer research; for instance, Li et al. found that the expression of several essential RBP genes is strongly associated with the prognosis of lung adenocarcinoma [12]. However, the diversity of data from different sequencing platforms and the heterogeneity of tumors also affect the integration and analysis of large amounts of gene expression data. The standardization of crossplatform data is also a critical and challenging point for analysis. Recently, a new method based on relative sequencing of gene expression levels has been developed to overcome the shortcomings of traditional gene expression data processing and has yielded stable and reliable results in several studies [13][14][15].
In our research, the expression levels of a range of RBPrelated genes within each tumor sample were compared in pairs using a novel method, ultimately generating a score for each gene pair [13,15]. Scoring for this gene pair-based approach is based entirely on gene expression profiles within a single tumor sample. It does not need to be normalized across samples to account for differences between multiple samples or sequencing platforms [15]. We used The Cancer Genome Atlas (TCGA) RNA-seq dataset to construct the gene pair signature and to validate it by stratifying the dataset and by using the Gene Expression Omnibus (GEO) dataset. Subsequently, we confirmed the efficacy of this immunomarker in predicting tumor prognosis by comparing it with other clinicopathological information. The relationship between the signature and other prognosis-related factors, including tumor-infiltration lymphocyte cell content, tumor mutational burden (TMB), and tumor microenvironment, was further explored.

2.
1. Data Sources of Lung Adenocarcinoma. The HTSeq-FPKM RNA-seq expression data, Masked Somatic Mutation data based on the "VarScan2 Variant Aggregation and Masking" workflow, and corresponding clinical data of 522 lung adenocarcinoma patient samples were retrieved from The Cancer Genome Atlas (TCGA) program dataset (https:// portal. https://gdc.cancer.gov). Another validation dataset (GSE72094) was extracted from the Gene Expression Omnibus (GEO) database with corresponding survival information (http://www.ncbi.nlm. http://nih.gov/geo). The GSE72094 dataset of 442 lung adenocarcinoma patients was published on Oct 21, 2015, and based on the GPL15048 platform. Patients with overall survival time (OS) less than one month or missing survival information were excluded from the study. In total, 477 cases retrieved from TCGA database and 386 cases from the GEO database were recruited and analyzed in the present study.

Gene Expression Data
Processing. The RNA-seq expression data was HTSeq-FPKM type. The expression profile data for each gene was converted to the corresponding gene symbol from the probe level according to the annotation file. No further standardization of the expressed data is required. If the patient has multiple samples, take the average expression value of each gene to represent the gene expression level of the patient. If there are multiple probes for a single gene, the average expression value will be taken as the expression level for that gene.

2.3.
Modeling of the RBP-Related Gene Pair (IRGP) Signature. The extraction of RNA-binding protein-related genes was determined based on outstanding research published by Gerstberger et al.'s team in 2014. In total, they identified 1542 RNA-binding protein-related genes that are highly relevant to RNA metabolism [16]. We then calculated genes with differential expression between tumor samples and normal samples by applying the R package "limma." We used these genes as candidate genes for the next gene pair analysis. The specific method of constructing the prognostic model is as described in the previous study [15]. Briefly, we performed a pairwise comparison to obtain a score for each gene pair between the gene expression values within each sample in the TCGA cohort. The score of a specific gene pair was set to one when the expression level of the first gene was higher than the other; otherwise was zero. We would discard the pairs if more than 90% of the scores of a pair were identity in the samples. We eventually constructed a signature of 33 IRGPs using the Lasso Cox proportional risk regression model. We then stratified patients into low-and high-risk groups using the most appropriate cutoff value. We used the "survivalROC" R package (using R package "survival-ROC," version 1.0.3) to obtain the cutoff value by the timedependent receiver operating characteristic (ROC) curve analysis at three years for overall survival in TCGA dataset.

Prognostic
Value of the Signature in the TCGA Cohort and Subcohorts. Survival analysis by the log-rank test was performed between the different immune risk groups. Subsequently, both univariate and multivariate Cox proportional hazard regression analyses of the risk factor and other clinical factors for the overall survival were performed in TCGA cohort. Pathologic stage and gender were converted as continuous variables. Stage I to stage IV were transformed into 1 to 4. We also randomly divided TCGA dataset into two different subcohorts and even split it into different subdatasets based on clinical characteristics such as age, gender, and pathological staging to verify the accuracy and validity of our model building in different subdatasets. Patients were divided into older and younger groups according to the median value of their age.

Dataset
Validation of the Signature. To further prove the prognostic value of the signature in different cohorts, we applied the risk model to another independent cohort from the GEO database (GSE72094) for validation. We performed the same univariate and multivariate analyses as in TCGA cohort. Patients without matching clinical information will be excluded from the study.

Estimation of Immune Cell Abundance in Tumor Tissue.
To analyze whether there were differences in the immune cell abundance of the tumor tissue in different risk groups, we used CIBERSORT (https://cibersort.stanford.edu/) to evaluate the relative fraction of predefined cell types in mixed solid tissues. The data used were normalized gene expression data of the tumor tissue [17]. We used the default LM22 leukocyte gene signature matrix from the CIBERSORT website. LM22 contains 547 genes distinguishing 22 types of immunerelated cells. Disabling quantile normalization was checked. We set the number of permutations to 1000 for robust analyses. Then, CIBERSORT enumerated the relative proportions of the 22 infiltrating immune cells, including B cells, dendritic cells, T cells, natural killer cells, and macrophages.

2.7.
Estimation of the Tumor Microenvironment (TME) and TMB. TMB usually refers to the number of somatic cell mutations detected per million bases, including gene coding errors, base substitutions, gene insertion, or deletion errors. After calculating TMB values for all samples, patients were grouped according to the previous risk cutoff values to explore whether there was a difference in TMB between highand low-risk groups. TME is a general term for immune infiltrating cells and stromal cells in the tumor tissue other than tumor cells. Based on the RNA expression data, we used R package "estimate"(version 1.0.13) to score the immune microenvironment of all tumor tissue samples and scored the immune cell content and stromal cell content, respectively, to calculate the final tumor purity. We then grouped patients according to the previous prognostic model to compare whether there were differences in the TME between high-and low-risk groups.
2.8. Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway Functional Enrichment Analysis and Gene Set Enrichment Analyses (GSEA). GO and KEGG enrichment analysis was performed utilizing genes in the signature. We completed the GO and KEGG pathway analysis using R packages ("enrichplot," "cluster-Profiler," and "ggplot2") [18]. We used P value < 0.05 and Q − value < 0:05 as the threshold for GO and KEGG enrichment analysis. GSEA is used to assess the distribution trends of genes in a predefined set of genes in a gene set sequenced for phenotypic relevance and thus to determine their contribution to the phenotype [19]. We applied the GSEA software (Version 4.0.3, http://software.broadinstitute.org/gsea/) with 1,000 phenotype permutations for GSEA. The threshold of statistically significant gene sets was set to nominal P value < 0.05 with an FDR-adjusted Q − value < 0:25. We classified the patients into two groups according to their risk values. We then performed a GESA to compare whether there were pathways of differential enrichment between the two groups. MSigDB oncogenic signature gene sets (version 7.1, https:// www.gsea-msigdb.org/gsea/downloads.jsp) were applied in the GSEA.

Protein-Protein Interaction (PPI) Network of Genes in the
Signature. The genes that make up the prognostic model were used to construct the PPI network to analyze the intrinsic function of the model. The PPI network construction was based on the STRING database (https://string-db.org/), and we subsequently applied Cytoscape software (version 3.8.0) for the reconstruction and visualization of the network.

Statistical Analyses.
Statistical analyses were mainly performed on R software (version 3.6.3, http://www.r-project .org). Survival analyses were performed using the "survival" package (version 3.1-11) with the Kaplan-Meier method. We used the Student two-sample t-test or Wilcoxon ranksum test to compare the continuous variables. The "survival" package also calculated the RMS curve and time ratio. For all analyses, the statistical threshold was set to P value < 0.05.

Differential Expression Analysis of RBP-Related Genes.
A total of 1542 RBP-related genes that are highly relevant to RNA metabolism were recruited in the study. Overall, 379 RBP genes were differentially expressed in tumor tissues compared to normal tissues (FDR P value < 0.05 while LogFC value > 0:5). Two hundred forty-six genes were upregulated in expression, and 133 were downregulated in the tumor tissue (see Table S1 for details). The heat map and volcano plot of gene differential expression are detailed in Figures 1(a) and 1(b).

Construction of the RBP-Related Gene Pair Signature.
Gene expression data with corresponding clinical data of the TCGA cohort (n = 477) was used as an exploratory dataset. Genes with an average expression greater than 0 and genes with median absolute deviation ðMADÞ > 0:5 are included in the subsequent analysis. Patients with OS less than 30 days or without corresponding survival information were excluded. We used a total of 380 differentially expressed RBP genes as candidate genes for constructing the prognostic model. After rigorous screening to remove relatively small variation genes (MAD = 0), only 339 candidate genes were left for further study. Finally, a series of 33 gene pairs were recruited in the risk model using Lasso Cox proportional hazard regression from TCGA cohort (Table 1). Using the model, we can calculate a risk score for each sample. The fittest cutoff value of the IRGP risk score was set at −0.075 using a timedependent ROC curve analysis. We then stratified the dataset into the high-or low-risk group according to the cutoff value (see Figure S1 in Supplementary Materials). Significantly, compared to the low-risk group, the high-risk group in the exploratory TCGA cohort exhibited an even worse OS (P < 0:0001, hazard ratio ðHRÞ = 4:380 (3.139 to 6.111)) ( Figure 2). The risk curves plotted in TCGA dataset based on model scores are shown in Figure 3(a). Significant differences in the pathologic tumor stage, T stage, and N stage were demonstrated between groups related to OS in the univariate Cox analysis (P < 0:001). However, in the multivariate Cox, only the phenotype of the signature exhibited a robust independent prognostic factor (P < 0:001) (see in Figure S2(a) and Table 2 for details). Highly similar results were found in the two randomly divided TCGA subcohort and other subcohorts stratified by age, gender, pathologic stage, and N stage. Risk subgroups obtained in each subcohort were consistently significantly correlated with survival prognosis (see in Figure 2).

Signature Validation in the GEO Dataset.
Using the risk score cutoff, we stratified the patients in the GEO validation cohort into high-and low-risk groups. Consistent with the findings previously obtained in TCGA dataset, a significant difference of OS was found between the two groups (P = 7:8 × 10 −3 , HR = 1:597 (1.095 to 2.325)) (see in Figure 2). Notably, the RBP gene pair signature remained an independent predictive value of OS in both the univariate and multivariate Cox analyses in the validation dataset (P < 0:001) (see in Table 2 and Figure S2(b)). And the risk curves were similar to TCGA dataset, and high-risk patients also had a poorer prognosis (Figure 3(b)).

Immune Cell Infiltration between Different Risk Groups.
CIBERSORT was used to estimate the fractions of 22 infiltrated immune cells using the RNA-sequence data. We used a threshold of P < 0:05 to rule out unreliable results. Among the 535 tumor samples in TCGA, only 477 tumor samples were eligible for further analysis. The relative abundance of parts of the 22 infiltrated immune cells exhibited significant differences between the high-and low-risk groups (Figure 4). Compared to the low-risk group, the proportion of M0 macrophages, M1 macrophages, activated CD4 memory T cells, activated mast cells, eosinophils, and resting NK cells exhibited higher fraction in the high-risk group (P < 0:01). Conversely, memory B cells, resting dendritic cells, resting mast cells, monocytes, activated NK cells, regulatory T cells (Tregs), and restingCD4 memory T cells presented a significantly lower fraction in the high-risk group (P < 0:01) (Figure 4). Naive CD4 T cells were present in only two samples, so we did not include this type of cell in the statistical analysis.
3.5. Estimation of TME and TMB. Based on the prognostic risk model, we divided both TMB and TME data into high and low-risk groups and used the Wilcoxon signed-rank test    BioMed Research International to explore differences across risk groups. Notably, both the TMB-and TME-related indicators in the high and low-risk groups showed significant differences. Immune and stromal cells scored lower in tumor tissues in the high-risk group than in the low-risk group, and the corresponding average purity of tumors was higher than in the low-risk group (Figures 5(a)-5(c)). Besides, higher TMB values were present in the higher risk group than in the lower risk group ( Figure 5(d)).

GO and KEGG Functional Analysis of Genes in the
Signature. For the above 49 genes that make up the 33 genes pair prognostic model, we used GO and KEGG analyses to explore the closely related functions and pathways of these genes. Results of GO enrichment analysis revealed that the signature genes were enriched significantly in ncRNA processing, catalytic activity (acting on RNA), cytoplasmic ribonucleoprotein granule, and other GO terms (P < 0:05 and Q − value < 0:05) (Figure 6(a), Figure S3(a), and Figure S4). Results of the KEGG functional enrichment analysis revealed that genes in the signature were significantly enriched in five KEGG pathways (P < 0:05 and Q − value < 0:05). Among them, the pathways of RNA transport and RNA degradation were most significantly enriched ( Figure 6(b), Figure S3(b), and Table S2).

GSEA Based on Risk Scoring.
Since the established signature was found to be highly correlated with prognosis, we then attempted to explore their functional implication and intrinsic association through enrichment analysis. We classified patients into high-and low-risk groups based on gene pair risk score cutoff value and used this risk classification as a phenotype for GSEA of TCGA cohort. As a result, we found the enrichment of three oncogenic signatures gene sets in the high-risk group, including "JAK2_DN.V1_DN," "MTOR_UP.N4.V1_DN," and "CSR_EARLY_UP.V1_DN" (significant at FDR Q − value < 25% and nominal P value < 1%) (Figure 7 and Table S3), which suggests a crucial role in lung adenocarcinoma progression and prognosis of these significantly enriched gene sets.

PPI Network Construction of the Genes in the Signature.
To better explore the potential function of the genes in the signature, we constructed a protein-protein interaction network using the STRING database. A total of 40 gene nodes and 69 edges are included in this PPI network ( Figure 8 and Table S4). Most of the genes that make up the PPI network are overexpressed in the tumor tissue.

Discussion
As the most common pathological type of lung tumor, lung adenocarcinoma usually occurs around the lobe of the lung. In particular, a large proportion of lung cancers detected at an early stage are lung adenocarcinomas. The prognostic profile of lung adenocarcinoma is diverse, and overall, patients who are identified early and can undergo surgery survive much longer than patients who are no longer considered for surgery. Patients who can benefit from targeted therapy or immunotherapy have more prolonged survival and higher quality of life than those on regular chemotherapy, based on gene mutations and immunohistochemistry. According to previous studies, an excellent prognostic marker or model helps in predicting patient survival and better clinical management [20,21]. Recently, Ling et al. used differentially expressed genes closely related to the tumor microenvironment to construct models to predict patient prognosis and explore the relationship between patient responsiveness to immunotherapy and the tumor microenvironment [21]. Zhao et al. created a tumor immunoscore clinical prognostic signature of lung adenocarcinoma using 109 immune-associated genes and validated its accuracy in different datasets [20]. However, the clinical applicability of these biomarkers remains limited due to tumor heterogeneity and sequencing technical problems. In particular, the issue of standardization of the data from different sequencing platforms is also a challenge in clinical applications. Therefore, in our study, to eliminate the influence of different platforms and interindividual standardization on the results, we introduced the concept of gene pairs. And by assigning the size of a particular pair of RBP gene expression values, we obtained a new predictive model that is more suitable for individual studies and clinical application. As described by Li Figure 2: Survival curves for different risk groups in TCGA and GEO datasets. According to the optimal cutoff value, patients from different cohorts were stratified into the high-or low-risk group. Kaplan-Meier curves were used for survival analyses between different risk groups in different datasets: TCGA cohort (a), two TCGA validation subcohort (b, c), and the GEO validation cohort (d) groups stratified by pathologic stage (e, f), age (g, h), gender (i, j), and N stage (k, l). 7 BioMed Research International within a single sample [15]. Several previous studies have confirmed the availability and accuracy of this immune gene pair method in predicting overall survival in different types of cancer, including serous ovarian carcinoma and hepatocellular carcinoma [13][14][15]22].
Although the exact mechanism is not yet clear, a growing body of research suggests that RBP plays an essential role in tumorigenesis and development [23,24]. A study of glioblastoma found that expression levels of eight RBP-related genes were strongly associated with tumor prognosis, and  IGF2BP1|SRSF12  IGF2BP1|PABPC3  IGF2BP1|ZC3H12D  IGF2BP3|RPP40  IGF2BP3|OASL  MRPL38|TSEN54  MRPS12|DCP1A  DDX52|STRBP  OAS1|POP7  WDR4|PARS2  MAGOHB|PARS2  PPIL4|DARS2  WDR46|RAE1  SKIV2L|MRPS24  SKIV2L|DCPS  TDRKH|ZC3H12C  TDRKH|URB1  TDRKH|ZC3HAV1L  SNRPA1|PABPC1L  MRPL54|OAS1  MRPL54|DDX56  MRPL54|BZW2  TFB2M|RBPMS  DARS2|HINT3  RNPC3|ERI2  FASTK|SPATS2  SMAD9|ERI2  FASTKD3|ZC3H12C  PABPC3|ZC3H12D  INTS8|SKIV2L   identified PTRF and FNDC3B among them as potential prognosis-related biomarkers [23]. The RNA-binding protein PSF was found to play a pathophysiological role in ERpositive breast cancer by posttranscriptional regulation of the expression of its target genes ESR1 and SCFD2. Furthermore, PSF and SCFD2 were recognized as potential diagnos-tic and therapeutic targets for primary and hormonerefractory breast cancer [24]. To explore the function of RBPs in lung squamous cell carcinoma (LUSC), Li et al. obtained 300 RBP-associated genes that are aberrantly expressed in tumor tissues and ultimately screened 9 nine genes and successfully constructed a prognostic signature [7]. In our study,   (Table 1) was identified to predict overall survival for lung adenocarcinoma patients. Some of these genes have been reported in previous studies to be associated with tumor development and prognosis [25,26]. It was found that increased expression of INTS8 in the tumor tissue was associated with a more reduced overall survival and prognosis in patients with HCC. The authors suggested that the primary mechanism is to accelerate the epithelial-to-mesenchymal transition (EMT) by upregulating the TGF-β signaling pathway to promote tumor progression and metastasis [25]. IGF2BP1 also plays crucial roles in the generation and outcome of human tumors, such as the presence of high expression of IGF2BP1 in pancreatic tumor tissues, and its expression level is significantly associated with patient prognosis [27]. Similarly, DCAF13 has also been found to be highly expressed in a variety of tumors, including breast cancer, leading to a poorer tumor prognosis [28].
Researchers have conducted relevant studies in many of these cancers, quantifying these tumor-infiltrating cells and correlating their abundance with tumor features and outcomes [29]. Previous studies have provided substantial evidence to support a favorable prognosis and outcome for malignant melanoma with abundant infiltration of TILs [30,31]. In the current study, by using the CIBERSORT platform, we have estimated the relative fractions of 22 TILs in tumor tissues from TCGA cohort using the CIBERSORT platform. Using the prognostic signature cutoff, we divided the 22 TILs that resulted in high-and low-risk groups and compared whether differences exist in the content of each type of cell between the two groups. Significant differences in the relative fraction of infiltrated immune cells in tumor tissue were observed between the two different risk groups. In our study, we found more M1 macrophage, activated CD4 memory T cells, activated mast cells, and eosinophil infiltrates in the high-risk group.
Tumor-associated macrophages (TAM) not only prevent T cells from attacking tumor cells but also secrete growth factors that nourish tumor cells and promote tumor angiogenesis, leading to tumor cell expansion and metastasis [32][33][34].   [32]. Both eosinophils and mast cells belong to inflammatory cells. These inflammatory cells are sought to contribute to barriers to antitumor immunity [35], because the inflammatory environment in the tumor tissue is believed to promote the development and progression of the tumor according to previous studies [36]. And conversely, we found that monocytes, activated NK cells, and Tregs presented a significantly lower fraction in the high-risk group. NK cells are critical immune cells in the body and are lymphocytes that can nonspecifically kill tumor cells and viral cells without presensiti-zation. Lower levels in the tumor tissue predicted reduced tumor-killing capacity, favoring tumor progression, and a poorer prognosis. Also, the high-risk group in our study had significantly increased TMB levels, which may suggest whether tumor patients could benefit from tumor immunotherapy, such as programmed death receptor 1 (PD-1) inhibitors, because, based on clinical and prior knowledge, patients with higher TMB values have a significantly higher response rate to immune checkpoint inhibitors than patients with lower TMB [37,38]. According to the results of the tumor microenvironment in this study, tumor purity was significantly higher in the high-risk group than in the low-risk group. This suggests that the lower the amount of immune and stromal cells surrounding the tumor cells, the faster the tumor proliferates and progresses, and the worse the patient's prognosis may be. And this is relatively consistent with previous studies [20,21].

BioMed Research International
Consistent with previous studies, GSEA of TCGA cohort revealed that three oncogenic signatures gene sets, including "MTOR_UP.N4.V1_DN," were significantly enriched in the high-risk group, suggesting a crucial role in lung adenocarcinoma progression and metastasis of these gene sets. The present study found several JAK2-related genes in this signature. The JAK2-STAT signaling pathway is a cytokinestimulated signal transduction pathway that has been identified in recent years and is involved in critical biological processes such as cell proliferation, apoptosis, and immune regulation [39]. The JAK2/STAT3 signaling pathway has been reported to be closely associated with the development and metastasis of non-small-cell lung cancer [40].
There are several limitations of the present study. The first is that the model built in this study is based on TCGA database. Although its feasibility was validated in a GEO dataset, it needs to be approved in more independent data in the future. Secondly, there is limited clinical data in public databases, and the impact of factors such as surgery, chemotherapy, and neoadjuvant therapy on clinical survival was not taken into account. These could have some effect on the accuracy of the results. Thirdly, there is a need to conduct functional analysis of genes in RBP-related prognostic models from multiple perspectives in the future to explore their intrinsic mechanisms related to survival, so as to provide more accurate clinical guidance.

Conclusions
Based on the differentially expressed RBP gene in the tumor tissue, we constructed a stable prognostic model for lung adenocarcinoma using a novel gene pair approach. We successfully validated its accuracy and availability in another independent dataset. This is strong evidence of the critical role of RBP in tumorigenesis and development. Notably, the method is based on pairwise comparison and assignment of different gene expressions within a single sample, which reduces the process and complexity of data processing for different platforms or sources and, therefore, can better help clinical treatment and management in the future.

Data Availability
The data that support the findings of this study are openly available in TCGA at https://portal.gdc.cancer.gov/ and GEO database at http://www.ncbi.nlm.nih.gov/geo/.