Prediction of competing endogenous RNA coexpression network as prognostic markers in AML

Recently, competing endogenous RNAs (ceRNAs) hypothesis has gained a great interest in the study of molecular biological mechanisms of cancer occurrence and progression. However, studies on leukemia are limited, and there is still a lack of comprehensive analysis of lncRNA-miRNA-mRNA ceRNA regulatory network of AML based on high-throughput sequencing and large-scale sample size. We obtained RNA-Seq data and compared the expression profiles between 407 normal whole blood (GTEx) and 151 bone marrows of AML (TCGA). The similarity between two sets of genes with trait in the network was analyzed by weighted correlation network analysis (WGCNA). MiRcode, starBase, miRTarBase, miRDB and TargetScan was used to predict interactions between lncRNAs, miRNAs and target mRNAs. At last, we identified 108 lncRNAs, 10 miRNAs and 8 mRNAs to construct a lncRNA-miRNA-mRNA ceRNA network, which might act as prognostic biomarkers of AML. Among the network, a survival model with 8 target mRNAs (HOXA9+INSR+KRIT1+MYB+SPRY2+UBE2V1+WEE1+ZNF711) was set up by univariate and multivariate cox proportional hazard regression analysis, of which the AUC was 0.831, indicating its sensitivity and specificity in AML prognostic prediction. CeRNA networks could provide further insight into the study on gene regulation and AML prognosis.

AGING tumors, such as liver, gastric, breast, colon, pancreatic and bladder cancer.
In chronic myeloid leukemia (CML), lncRNA SNHG5 promoted imatinib resistance via acting as a ceRNA against miR-205-5p [7]. LncRNA UCA1 was also identified as an important modulator of MDR1 to promote imatinib resistance through completely binding miR-16 [8]. In AML, lncRNA NEAT1 modulated cell proliferation and apoptosis by regulating miR-23a-3p/SMC1A [9]. LncRNA UCA1 contributed to the chemoresistance, through activating glycolysis by the miR-125a/HK2 pathway [10]. In addition, aberrant upregulation of CCAT1 was detected in French-American-British (FAB) M4 and M5 subtypes of AML patients. CCAT1 repressed monocytic differentiation and promoted cell growth by upregulating c-Myc via its ceRNA activity on miR-155 [11]. Sen et al. explored the major cross-talking edges of ceRNA networks in CML and AML utilizing patient sample data, which shed light on progression and prognosis of leukemia [12]. Therefore, studies have showed that the lncRNA-miRNA-mRNA ceRNA regulatory network is implicated in the leukemia development. However, studies on leukemia are limited, and there is still a lack of comprehensive analysis of lncRNAs, miRNAs and mRNAs related to AML based on high-throughput sequencing and large-scale sample size.
In this study, we obtained RNA-Seq data and compared the expression profiles between 151 bone marrows (BMs) of AML (The Cancer Genome Atlas, TCGA) [13] and 407 normal whole blood (Genotype-Tissue Expression, GTEx) [14,15]. Following, mRNAs and lncRNAs between the normal samples and AML patients were applied to weighted correlation network analysis (WGCNA) to enrich modules which were most related with AML [16]. And, miRNA database was used to predict target mRNA. Finally, we identified 108 lncRNAs, 10 miRNAs and 8 mRNAs to construct a lncRNA-miRNA-mRNA ceRNA network. Among the network, a survival model with 8 target mRNAs (HOXA9 +INSR+KRIT1+MYB+SPRY2+UBE2V1+WEE1+ZNF7 11) was set up for predicting AML prognosis.

Different gene expression from data between TCGA and GTEx is analyzed
The expression levels of RNAs in 151 bone marrow samples with AML and 407 normal whole blood samples were explored. The clinicopathological and molecular characteristics of AML patients were shown in Table 1 and Table 2. All gene read counts were normalized to the trimmed mean of M values (TMM) by edgeR. We found that 2667 significantly up-regulated mRNAs and 2456 down-regulated mRNAs were identified. Figure 1A showed the distribution of all the significantly different expressed mRNAs on the two dimensions of -log10 (false discovery rate, FDR) and log2 (fold change, FC) through a volcano map. The gene modules in the network are often enriched with specific functions, which are of biological significance. To test the biological function of the identified genes, information from differentially expressed genes was applied to Gene Ontology (GO) analysis. Up-regulated mRNAs were enriched in organelle fission, nuclear division and pattern specification process in biological process (BP) ( Figure  1B). Figure 1C showed the gene symbols and their interactions in BP of up-regulated mRNAs. Moreover, cell cycle, fanconi anemia pathway and homologous recombination related genes were up-regulated while hematopoietic cell lineage, natural killer cell mediated cytotoxicity, necroptosis and NOD-like receptor signaling pathways were downregulated by Kyoto Encyclopedia of Genes and Genomes (KEGG)-Gene Set Enrichment Analysis (GSEA) ( Figure 1D).

WGCNA is applied to analyze gene modules
Gene modules were analyzed using the WGCNA among the first 40% mRNAs by variance comparison. As shown in Figure 2A, softpower 7 and module size cut-off 25 were chosen as the threshold to identify coexpressed gene modules. 19 gene color modules were identified and the heatmap plot of topological overlap matrix (TOM) was shown in Figure 2B. Then, genes in the 19 color modules were continuously used to analyze the module-trait (AML and normal) coexpression similarity and adjacency. Cyan module and turquoise module showed high relationship with AML, which included 1659 mRNAs ( Figure 2C). These 1659 mRNAs were further used to GO-GSEA to display the gene enrichment, gene symbols and their interactions in BP, as shown in Figure 2D and 2E. The genes were most related to embryo development, reproductive process and reproduction. In addition, genes were highly enriched in cell cycle, transcriptional misregulation in cancer, ubiquitin mediated proteolysis and RNA transport by KEGG analysis ( Figure 2F).

LncRNAs modules are analyzed by WGCNA
Next, we continued to investigate coexpression network of lncRNAs. LncRNA modules were analyzed by WGCNA among the first 60% lncRNAs by variance comparison. As shown in Figure 3A, softpower 6 was chosen as the threshold and we identified 8 coexpressed lncRNA modules. Correlation analysis showed that turquoise module displayed highest relationship with AML ( Figure 3B and 3C; r=0.98). The numbers of lncRNAs in every module were shown in Figure 3D. The  turquoise module contained the highest numbers (2662) of lncRNAs. We then used miRcode to predict the miRNAs sponged by 2662 lncRNAs to obtain lncRNAs-miRcode-miRNAs relationship. Meanwhile, we used TCGA miRNA-Seq to analyze the first 400 miRNA with highest expression. Then the overlapped miRNAs between 400 miRNAs and lncRNAs-miRcode-miRNAs (155) were selected to obtain lncRNAs-miRNAs (47). We further explored and obtained 1710 predicted target mRNAs by starBase, miRDB, miRTarBase and Targetscan dataset, which might be bound by 47 miRNAs ( Figure 3E). Importantly, as shown in Figure 3F, we chose the overlapped target mRNAs by analyzing the predicted target mRNAs (1710), WGCNA-turquoise-cyan mRNAs (1659), as well as the significant differentially upregulated mRNAs (2667) and down-regulated mRNAs (2456) by edgeR. Lastly, we got 111 up-regulated mRNAs and 9 down-regulated mRNAs (Supplementary Table 1). The expression of these 120 genes in 558 samples was shown in Figure 3G by heatmap.

Cox regression analysis is conducted to clarify the patients' survival
Next, a univariate cox proportional hazard regression analysis was conducted to clarify the association of the expression levels of 120 genes with overall survival (OS). 22 genes were obtained by the threshold of p value <0.05 and gene ID <15000 (NCBI). The above mentioned 22 genes were brought into further multivariate cox proportional hazard regression analysis (Table 3). We then set up a survival model for 3-year OS with 8 genes: HOXA9+INSR+KRIT1+MYB+ SPRY2+UBE2V1+WEE1+ZNF711. We showed that HOXA9, INSR, KRIT1, MYB, SPRY2, WEE1 and ZNF711 were up-regulated while UBE2V1 was downregulated in AML patients ( Figure 4A). The correlationship of each gene in the 8-genes model was shown in Figure 4B and 4C. The patients from TCGA were classified into predicted low and high risk groups according to the multivariate cox score result in Figure  4D. Furthermore, the expression heatmap of the 8 genes in high risk or low risk group was shown in Figure 4E. We then estimated the accuracy of the 8-genes signature on predicting survival. Kaplan-Meier survival curves showed that patients with predicted high risk (n=75) had significantly shorter OS than those with low risk (n=76, p=0.00, Figure 4F). Receiver operating characteristic (ROC) analysis to compare the sensitivity and specificity of the survival prediction of our models was performed. TCGA dataset revealed that the area under receiver operating characteristic curve (AUC) of the 8genes signature was 0.831. Previous reports showed that gene mutation was correlated with the prognosis of AML [17]. Thus, we divided the patients into groups according to gene mutations and we found that the 8genes signature worked well in DNMT3A, FLT3 or RAS mutation, as well as NPM1 wildtype patient subgroups (Supplementary Figure 1).   Figure 5A). Since TCGA and GTEx also provided the data of lncRNAs, the differentially expressed lncRNAs were also analyzed by edgeR. 2412 up-regulated lncRNAs and 788 down-regulated lncRNAs were identified. Then these 3200 lncRNAs were overlapped with the lncRNAs (174) predicted from 10 miRNAs, and we got 108 lncRNAs ( Figure 5B). At last, a lncRNA-miRNA-mRNA ceRNA network was constructed by 108 lncRNAs, 10 miRNAs and 8 mRNAs, as shown in Figure  5C. AGING DISCUSSION Important advance in ceRNA network research developed rapidly, suggesting that the involvement of ceRNA network in human diseases, especially tumors, could be far more prevalent. The disruption of the equilibrium of ceRNA network was critical for tumorigenesis. Thus, understanding the intricate interplay among diverse ceRNA network will lead to significant insight into gene regulatory networks and have implications in cancer treatment [2]. Here, we identified 108 lncRNAs, 10 miRNAs and 8 mRNAs to construct a lncRNA-miRNA-mRNA ceRNA network by database. Given the fact there was no large-scale of public RNA-Seq database or studies with normal BMs, further validation of normal BMs by large cohorts are needed.
In lung cancer, lncRNA BARD1 9'L, transcribed from an alternative promoter in intron 9 of the BARD1 gene and shared part of the 3'UTR with the protein coding BARD1 mRNAs, counteracted the effect of miR-203 and miR-101, to promote tumor development [18]. LncRNA HOTAIR, functioned as a ceRNA, sponging miR-331-3p to derepress HER2, which was correlated with advanced gastric cancers [19]. SNHG7, whose high expression was correlated with poor prognosis, acted as a target of miR-34a to increase GALNT7 level and regulate PI3K/Akt/mTOR pathway in colorectal cancer progression [20]. Thus, ceRNA network displayed essential role in cancer progress and provided potent targets for cancer therapy.
In the present study, the significantly different expression levels of mRNAs in AML were calculated (Figure 1). Importantly, 120 overlapped genes were obtained from the predicted target mRNAs, WGCNA-turquoise-cyan mRNAs, as well as the significantly different up-regulated mRNAs and down-regulated mRNAs (Figure 2 and 3).
To further investigate the relationships of these 120 genes with prognosis, univariate and multivariate cox proportional hazard regression analysis were applied. Then a survival model for 3-year OS with 8 genes: HOXA9+INSR+KRIT1+MYB+SPRY2+UBE2V1+WEE 1+ZNF711, was set up (Figure 4). Finally, a ceRNA network was constructed by 108 lncRNAs, 10 miRNAs and 8 mRNAs ( Figure 5), which could act as biomarkers based on the patients' prognosis.
Among the 8 target genes, HOXA9, WEE1 and MYB had been demonstrated to be essential in leukemogenesis and disease process. HOXA9 had an important role in hematopoietic stem cell expansion, of which aberrant expression was a prominent feature of AML driven by diverse oncogenes. With continued study in HOXA9mediated AML, there was a wealth of opportunity for developing novel therapeutics applicable for AML with HOXA9 overexpression [24]. MiR-182 was reported to regulate percentage of myeloid and erythroid cells in CML [25]. Thus, the relationship between HOXA9 and miR-182 needed to be investigated in AML as predicted. WEE1 kinase was crucial in the G2-M cell-cycle checkpoint arrest for DNA repair before mitotic entry. WEE1 was expressed at high levels in various cancer types including leukemia and was a validated target of the miR-17-92 cluster in leukemia [26], giving support to our prediction of miR-17-WEE1 axis in AML. MLL fusion proteins negatively regulated miR-150 production, and forced expression of miR-150 inhibited leukemic cell growth and delayed MLL-fusion-mediated leukemogenesis likely by targeting MYB, suggesting a miR-150-regulated MYB signaling underlying the pathogenesis of leukemia [27]. MiR-21 was considered to be an important miRNA, which was frequently elevated in all types of myeloid leukemia, while lncRNA MEG3 inhibited proliferation of CML cells by sponging MiR-21 [28]. Primary FLT3-ITD + AML clinical samples had significantly higher miR-155 levels compared with FLT3 wild-type AML samples. MiR-155 collaborated with FLT3-ITD to promote myeloid cell expansion in vivo [29]. Besides, miR-106, miR-195, miR-424, miR-454 and miR-497 were all involved in the disease process of leukemia or solid tumors [30][31][32][33][34]. Therefore, many AGING previous studies had given great experimental support to our prediction of the ceRNA network. Pivotally, Kaplan-Meier survival curves of our predicted model showed that patients with predicted high risk had significantly shorter OS time than those with low risk. Although the studies of lncRNAs in AML were limited, these predicted lncRNAs provided novel pathways or networks to study the function of 8-genes survival model in AML development and treatment.
This study defines ceRNA network from multiple dimensions, and provides possible prognostic markers for predicting patient outcome, which will help to increase our comprehension about ceRNA networkmediated leukemogenesis. Via this study, a novel perspective will be produced to make clear leukemia mechanisms and suggest approaches to regulate ceRNA networks for leukemia therapeutics.

TCGA RNA sequence dataset
The RNA sequence data of 151 BMs with AML (Hematopoietic and reticuloendothelial systems) were retrieved from TCGA data repository (https://portal.gdc.cancer.gov/), which were derived from IlluminaHiSeq RNA-Seq platform. RNA-Seq data, miRNA-Seq and clinical data such as patient survival time and FAB classification information were obtained from TCGA.

GTEx RNA sequence dataset
All data of normal tissue samples were obtained from 407 whole blood in GTEx V7 release version (https://gtexportal.org/home/datasets). Complete description of the donor genders, multiple ethnicity groups, wide age range, the biospecimen procurement methods and sample fixation were described in GTEx official annotation.

Identification of differentially expressed genes
The ensemble ID of samples was converted by using GENCODE Gene Set-11.2017 version. LncRNAs and mRNAs ensemble ID that was not included in the GENCODE database were excluded. R package (edgeR) was used to identify significant differentially expressed genes in AML and normal samples. All q values use FDR to correct the statistical significance of the multiple test. Absolute log2FC ≥ 2 and FDR < 0.05 were considered significant [35][36][37].
For the obtained differentially expressed mRNAs, we generated volcano map using the ggplot2 packages in the R platform.

Gene Ontology, Kyoto Encyclopedia of Genes and Genomes, and Gene Set Enrichment Analysis
ClusterProfiler was used for GO, KEGG and GSEA [38][39][40]. GO was used to describe gene functions along three aspects: biological process (BP), cellular component (CC) and molecular function (MF). The KEGG-GSEA was searched for pathways at the significance level set at p<0.05.

Weighted correlation network analysis
WGCNA was an algorithm used in gene coexpression network identification by high-throughput expression profiles mRNAs or lncRNAs with different traits. Weighted coexpression relationship among all dataset subjects in an adjacency matrix was assessed using the pairwise Pearson correlation analysis. In this study, WGCNA was used to analyze mRNAs and lncRNAs to obtain the most related mRNAs or lncRNAs with AML patients.

Cox regression analysis
A univariate cox proportional hazards regression analysis was employed to identify the relationship between the expression level of mRNAs and patient's OS. Thereafter, multivariate cox analysis was employed to evaluate the contribution of the selected genes. The analysis was conducted using the R package of survival.