Imputing missing RNA-sequencing data from DNA methylation by using a transfer learning–based neural network

Abstract Background Gene expression plays a key intermediate role in linking molecular features at the DNA level and phenotype. However, owing to various limitations in experiments, the RNA-seq data are missing in many samples while there exist high-quality of DNA methylation data. Because DNA methylation is an important epigenetic modification to regulate gene expression, it can be used to predict RNA-seq data. For this purpose, many methods have been developed. A common limitation of these methods is that they mainly focus on a single cancer dataset and do not fully utilize information from large pan-cancer datasets. Results Here, we have developed a novel method to impute missing gene expression data from DNA methylation data through a transfer learning–based neural network, namely, TDimpute. In the method, the pan-cancer dataset from The Cancer Genome Atlas (TCGA) was utilized for training a general model, which was then fine-tuned on the specific cancer dataset. By testing on 16 cancer datasets, we found that our method significantly outperforms other state-of-the-art methods in imputation accuracy with a 7–11% improvement under different missing rates. The imputed gene expression was further proved to be useful for downstream analyses, including the identification of both methylation–driving and prognosis-related genes, clustering analysis, and survival analysis on the TCGA dataset. More importantly, our method was indicated to be useful for general purposes by an independent test on the Wilms tumor dataset from the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) project. Conclusions TDimpute is an effective method for RNA-seq imputation with limited training samples.

As an epigenetic modification, DNA methylation plays an important role in regulating gene expression. Integrative analysis of DNA methylation and gene expression can capture the associations of the two omics, and thus provides a comprehensive view of the molecular basis underlying cancers. However, it is common that one type of omics data is missing due to various limitations in experiments, preventing downstream analyses that need complete dataset. Imputations from one type of omics data to another is becoming important, but current methods mainly focus on single cancer dataset with limited sample size, and thus are limited by their ability to capture information from large pan-cancer dataset. Here, we present a novel transfer learningbased neural network to impute missing gene expression data from DNA methylation data, namely TDimpute. In the method, the pan-cancer dataset from TCGA was utilized to train a general model for all cancers, which was then fine-tuned on the specific cancer dataset for each cancer. By testing on 16 cancer datasets, we found that our method significantly outperforms other state-of-the-art methods in terms of imputation accuracy (7%-11% increase with different missing rates). The imputed gene expression was also validated to be useful for all downstream analyses, including the identification of both DNA methylation-driving and prognosis-related genes, clustering analysis, and survival analysis. Our method was further proved based on the Wilms tumor dataset from TARGET cancer project. biological omics data such as genomics, transcriptomics, epigenetics, proteomics, and metabolomics for a single patient. Compared with singleomics analysis, integrative analysis of multi-omics data provides comprehensive insights of cancer occurrence and progression, and thus strengthens our ability to predict cancer prognosis and to discover various levels of biomarker. However, due to technical limitations of experimental settings or high costs for acquiring the omics data, most samples aren't measured with all types of omics data, and lack one part of omics types (called "block missing"). This problem is prevalent in publicly available multi-omics dataset, such as The Cancer Genome Atlas (TCGA). Since gene expression affects clinical outcome and phenotype more directly than molecular features at DNA level (e.g. methylation and genetic variants) [1], we focused on the gene expression data imputation from DNA methylation data.
When the data is missing at random in single omics data, many methods have been proposed for imputing the missing values by using correlation structure among matrix entries, such as singular value decomposition imputation (SVD), k-nearest neighbor (KNN) [2].
However, these traditional methods may not be suitable for the cases lacking a whole set of features. In order to address this issue, several methods have been specifically designed. Voillet  transfer learning is usually considered as a promising method, where parameters trained for a task with large amount of data are reused as the initialization parameters for a similar task with limited data [13]. The transfer learning has been widely used in the computer vision including object detection [14], image segmentation [15].
For the omics data analysis of cancers, the transfer learning strategy has been applied to different tasks. Li  These results confirm that TDimpute succeeds in transferring related information from pan-cancer data to target cancer data. TOBMI [5], and SVD imputation method [2]. The default or suggested parameters were used for these methods.

Comparisons on the imputation accuracy
As shown in Fig 2A, Lasso achieves similar but consistently lower RMSE than TOBMI, which indicates that penalized regression has better prediction of the regulation between methylation and gene expression. The performance of SVD is worse, demonstrating a slow increase of RMSE from 1.06 to 1.10 when missing rates change from 10% to 70%, but then a sharp increase to 1.24 that is even higher than the result by the Mean method. Overall, the Mean method has the worst performance, which is consistent with the trend in the original paper [5]. By comparison,  over TDimpute-noTF. The RMSE by TDimpute is also 2%-5% lower than When measured by the squared correlation ( 2 ) between the imputed and actual values by each sample (Fig 2B), TDimpute is consistently the best, followed by the TDimput-self. Differently, SVD ranks the 3 rd except at a missing rate of 90%, where SVD has the lowest 2 of 0.909. The Mean imputation keeps the lowest performance. Hereafter, we will focus on the comparison with SVD, Lasso, and TOBMI methods.   The error bar shows the standard error of the mean.

Impact on the identification of prognosis-related genes
We investigated the recovery power of different imputation methods on the identification of significantly prognosis-related genes. To evaluate the selected genes, we compared the genes identified from the imputed to those from the actual data. Consistent with the performance in the imputation accuracy, Tables 2 and S5.1 show that TDimpute method achieves 2%-28% higher PR-AUC values, and 4%-54% more number of overlapped genes than those by the TOBMI method. Lasso achieves lower values than TOBMI, except at a missing rate of 90%, where Lasso performs slightly better than TOBMI.
We also investigate the enrichment of the top 100 genes (ranked by pvalues) overlapped with the prognosis-related gene list downloaded from The Human Protein Atlas [20] relative to the random. Table 3 demonstrates that TDimpute achieves the largest enrichment factors (see Methods section for definition), indicating its ability to identify the really validated prognosis-related genes.

Impact on the performance of clustering analysis and survival analysis
We also evaluate the effects of different imputation methods on clustering analysis and survival analysis. By input of top 100 prognosis-related genes, K-means algorithm is used to divided the samples into two clusters. The adjusted rand index (ARI) for evaluating the concordance between the clusters from the imputed and actual data is shown in Fig 4A. For all methods, accuracy decreases with increasing missing rates, which is consistent with the previous study [10]. As expected, TDimpute achieves the highest clustering concordance among the five imputation methods consistently under different missing rates.
A further survival analysis ( Fig 4B) shows that TDimpute achieves the best C-index, followed by TDimpute-self, SVD method, TOBMI method, and Lasso method. Despite showing a worse performance in the imputation accuracy, SVD performs better than TOBMI in this evaluation metric. In addition, the C-index of TDimpute, TDimpute-self, and SVD are relatively robust to the missing rates compared to TOBMI that showed a 9% decrease in C-index with 90% of samples missing gene expression values.
For all the mentioned experiments, the results per cancer dataset are detailed in S1-S5 Figs, and S2-S6 Tables.

Independent test on Wilms tumor from the TARGET dataset
Our method was further tested on TARGET dataset from an independent source. We constructed a model by fine-tuning the TCGA pan-cancer model using randomly selected 59 samples (50% of the dataset) with the same hyper-parameters optimized in the TCGA experiments. As expected, TDimpute achieves the lowest RMSE of 0.955 (TDimpute-self: 0.98; SVD: 1.064; Lasso: 1.006; TOBMI: 1.018). K-means method is used to cluster the 118 samples after imputation, and two resulted clusters are    Table S1.
where −1 denotes the output of previous layer − 1 , (•) is the activation function such as the sigmoid and Relu functions, and and are weight matrix and bias vector, respectively. and are parameters that need to be learned.
The loss function for training is the root mean squared error (RMSE): where 0 and are the experimentally measured and predicted expression value for gene i, and is the dimension of output vector (i.e., the number of genes). The network can be considered as a highly nonlinear regression function that maps DNA methylation data (input) to gene expression data (output).

Transfer learning-based models.
To train the prediction model for one target cancer in the TCGA, the datasets of other cancer types are combined to generate a multi-cancer model that is then fine-tuned by the target cancer data (Fig 1). The data of the target cancer was excluded to train the multi-cancer model as we need to remove different portions of the data for the target cancer to evaluate our imputation model.

Preservation of methylation-expression correlations and methylation-driving genes
Here

Impact on clustering analysis and survival analysis
We evaluated the relation of genes to cancer survivals by p-values output from the univariate Cox model. By using the top 100 genes, their expression values were used to divided samples into 2 clusters by the Kmeans. The clustering performance was assessed by adjusted rand index (ARI), which is a measure of agreement between the predicted cluster labels (on imputed dataset) and the true cluster labels (on original full dataset). We further made survival prediction with significantly related genes (p ≤ 0.05) by using the ridge regression regularized Cox model.
Here, the glmnet package [32] in R was used for model construction, which is suitable for fitting regression model with high-dimensional data.
The performance of the Cox model was assessed by the Harrell's concordance index (C-index) that measures the concordance between predicted survival risks and actual survival times. We used 5-fold cross validation (CV) to evaluate the performance.

Availability of source code and pretrained model
All the codes and pretrained pan-cancer models are available on Github: https://github.com/sysu-yanglab/TDimpute.

50.43
The results are averaged over 5 random replicas. Best results are highlighted in bold face. * indicates statistical significance (p-value < 0.05) betw een TD im pute and other m ethods.