Integrative Deep Learning for Identifying Differentially Expressed (DE) Biomarkers

As a large amount of genetic data are accumulated, an effective analytical method and a significant interpretation are required. Recently, various methods of machine learning have emerged to process genetic data. In addition, machine learning analysis tools using statistical models have been proposed. In this study, we propose adding an integrated layer to the deep learning structure, which would enable the effective analysis of genetic data and the discovery of significant biomarkers of diseases. We conducted a simulation study in order to compare the proposed method with metalogistic regression and meta-SVM methods. The objective function with lasso penalty is used for parameter estimation, and the Youden J index is used for model comparison. The simulation results indicate that the proposed method is more robust for the variance of the data than metalogistic regression and meta-SVM methods. We also conducted real data (breast cancer data (TCGA)) analysis. Based on the results of gene set enrichment analysis, we obtained that TCGA multiple omics data involve significantly enriched pathways which contain information related to breast cancer. Therefore, it is expected that the proposed method will be helpful to discover biomarkers.


Introduction
With the development of base sequence measurement tools, it has become possible to process a large amount of gene data at high speed. is has enabled the accumulation of large amounts of genetic data and facilitated the development of various analytical techniques and tools for analyzing such accumulated data. e use of high-level analysis techniques and tools is required to interpret large quantities of genetic data. For this reason, it is very important to analyze such genetic data using the most advanced computing methods and mathematical and statistical techniques available for quickly processing genetic big data.
Furthermore, it is important to discover the significant genes associated with diseases in various genetic data. Genetic big data contain sparse genes or proteins relating to the etiology of diseases, which sometimes could be difficult to identify. ese significant genes are called biomarkers. Biomarkers are indicators that could distinguish between normal and morbid conditions, predict and evaluate treatment responses, and objectively measure certain cancers or other diseases. Moreover, biomarkers could objectively assess the responses of drugs to normal biological processes, disease progress, and treatment methods. Some biomarkers also serve as disease identification markers that could detect early changes of health conditions.
In this paper, we propose the integrative deep learning for identifying biomarkers, a deep learning algorithm with a consolidation layer, and compare it with other machine learning methods based on a simulation along with real data (TCGA) analysis. Artificial neural networks (ANNs) are one of the main tools used in machine learning. Artificial neural networks (ANNs) are computing systems which are inspired by the biological neural networks of animal brains. An ANN consists of a set of processing elements, also known as neurons or nodes, which are interconnected [1]. Artificial neural networks (ANNs) which consist of an input layer, more than one hidden layers, and an output layer are called as deep neural networks. Training them is called as deep learning. In this study, we use a single hidden layer. Deep learning is widely applied in bioinformatics area. For example, Lee et al. [2] employed deep learning neural networks with features associated with binding sites to construct a DNA motif model. In addition, Khan et al. [3] developed a method of classifying cancers to specific diagnostic categories based on their gene expression signatures using artificial neural networks (ANNs).
In our method, the learning process proceeds in the following order: first, feedforward calculation is performed from the input layer to the output layer by using the weights in each layer. At this time, when the signal is passed from the input layer to the hidden layer and from the hidden layer to the output layer, the activation function is used to determine the intensity of the signal. e backpropagation algorithm is then used to reduce the difference between the output and actual values, starting from the output layer. e gradient descent optimization algorithm is used to modify the weights and minimize the errors. e feedforward and backpropagation algorithms are repeatedly carried out as many times as necessary for learning, and the learning is performed by updating the weights, which are the parameters used in each step. e algorithms are explained further in detail in Section 2.2.
Data analysis for single omics data is limited to correlation analysis, and it mostly represents the result of the reaction process rather than the cause process [4]. For this reason, in this study, we used integrated multiple omics data that integrate single omics data. As a method for omics data integration, the network biology approach emphasizes the interactions of genomic data such as genes and proteins. is provides a framework for data integration to analyze disease, and it is an approach to modeling genomic data [5]. In contrast to the integrated omics data, the size of each sample is limited for single omics data, and thus common markers were seldom found in studies on the same cancer. By integrating the data and increasing the sample size, more reliable biomarkers can be found than those found through the use of single omics data alone [6]. Various statistical methodologies are applied to analyze an integrated omics dataset, including the application of a group lasso penalty [7] and the proposed method for the lasso model [7].
Meta-analysis is another method that is useful for analyzing omics data. Meta-analysis is a method that involves objectively and quantitatively aggregating the results of many studies involving the same or similar subjects. Since life phenomena usually occur through interactions with other organs, the use of meta-analysis for omics data is very effective. A meta-analysis combines not only individual hypotheses but also the associated assumptions for significant results [8]. Furthermore, the MetaKTSP predictive model with ranking-based algorithms shows excellent performance in detecting biomarkers [9]. In addition, Kim et al. [10] proposed MetaPCA based on a statistical method.
Various methods of machine learning analyses have been proposed to analyze data accumulated in large quantities.
Machine learning analysis methods have widely been used to analyze data in a variety of fields of biology [11]. Moreover, the analytical methods of machine learning and related algorithms for processing such genetic big data have been developed. For example, see the support vector machine (SVM) [12], meta-SVM and metalogistic regression [7], and various machine learning models [13]. In addition, the MLSG (machine learning system genomics) approach, in which a machine learning method is combined with biological methods for analyzing the multiple integrated omics data, is more useful than an approach using single omics data alone [14]. e development of various machine learning methods has enabled the modeling of genes related to diseases, and they have obtained meaningful analysis results. Microarray and next generation sequencing (NGS) data are important for finding useful molecular patterns, and more studies on gene expression data are needed to identify the biomarkers associated with cancer [15]. Gene expression data could also be used to identify various diseases and potentially cancerous genes [16]. We could also see that gene expression data have a strong influence on identifying biomarkers [17].

Experimental Design.
RStudio was used for analysis along with R packages such as coda, MASS, foreach, iterators, parallel, doMC, e1071, MCMCpack, penalized, and glmnet. Deep learning, a method of machine learning, was used to classify gene expression data and other data related to breast cancer. A simulation study was conducted before analyzing the actual breast cancer data. We randomly generated data of 80 integrated layers, consisting of 16 signal genes (three in each of the two clusters) and 64 nonsignal genes. For the random study, we generated data with no signal gene.
e β values of the integrated layer for the nonsignal genes will be close to zero, whereas the β values of the signal genes will not be close to zero. In addition, several values of σ and λ, lasso penalty, were used during the data generation. As the λ value becomes very large, β values of all of the layers converge to zero. Finally, we obtained the value of the Youden J index using the actual values and the predicted values from the algorithm execution. e value of the Youden J index can be used to evaluate the performance of deep learning.

Structure.
It is common that the artificial neural networks (ANNs) are composed of fully connected layers through the architecture. Yet our proposed model, by extension, aims at not only prediction but also variable selection by means of penalization to particular gene modules. To this end, the model purposely accommodates the weights at the last layer, whose module counts are as the number of genes at the consolidating layers. e deep learning structure with a consolidation layer is constructed by adding a consolidation layer to the usual deep learning structure, as shown in Figure 1. Each consolidation layer combines the results from three genes of the datasets at the same position. e superscript is the gene number while the subscript is the dataset number. e initial value of each weight starts with 1.

Issues regarding Initial Weights.
In deep learning, it is important to set initial weight properly. ere is no guarantee that objective function is convex due to the nature of deep learning algorithm. Moreover, local minima might exist at several points. If we start from the arbitrary point, there is no guarantee that it converges to global optimum, and it even cannot converge anywhere. erefore, rather than arbitrary initial weight, we suggest more systemic method that sets initial weight: Make design matrix C (N×(K+1)) from generated K vectors and 1 vector for bias term. (4) Due to the nature of omics data, we often cannot fit the linear regression as the number of variables (p) might be greater than the number of samples (n). erefore, we fit the Ridge regression by generated design matrix C: (1) (5) Use β 0 for initial β 0 , (β 1 , . . . , β k ) for initial weights.

Feedforward
(1) Input to Hidden Layer. e values of the hidden layers are calculated from the input values as follows: first, let K be the number of genes in each dataset and let M k be the number of hidden nodes in the k-th gene. en, the values of the hidden layers can be calculated by where and σ * is ReLU (rectified linear unit) activation function. (2) Hidden to Consolidation. Using the values in (1), the values of the integrated layer are calculated by (3) Consolidation to Output. Using the values in (2), the output value is calculated by and the predicted value is calculated by (4) Objective Function. e objective function for parameter estimation is given by where We used the objective function with lasso penalty given by

Backpropagation
(1) β k . We calculate the first and second partial derivatives of R λ with respect to β k . e first partial derivative of R λ with respect to β k is given by e second partial derivative of R λ with respect to β k is calculated as follows. e second partial derivative of R λ with respect to β k is given by Since sign(β k ) is not differentiable at 0, we use a differential sigmoid function to approximate it. Let z(β) be a sigmoid function with scale parameter s given by where s is a small number, say 10 − 3 . Since as s ⟶ 0, z(β) ⟶ 1 for β > 0, and z(β) ⟶ − 1 for β > 0, we note that z(β) is approximately equal to sign(β) for small s. Since the first partial derivative of z(β) with respect to β is given by Computational and Mathematical Methods in Medicine and the second partial derivative of R λ with respect to β k is approximated by From the second partial derivative, the updated value of β k is given by In updating β k , we use the idea from the Newton-Raphson method. (2) β 0 . e first partial derivative of R λ with respect to β 0 is given by and the updated value of β 0 is given by (3) w k m . e first partial derivative of R λ with respect to w k m is given by and the updated value of w k m is given by and the updated value of v k pm is given by

Datasets.
e data generation was conducted as follows: MVN(0, σ 2 I), multivariate normal distribution with mean 0 and covariance matrix σ 2 I, was used to generate nonsignal genes, and MVN(μ, σ 2 Σ) was used to generate signal genes. 16 genes have signals, and the remaining 64 genes do not have signals. We generated three datasets and then put them in the input layer. When random study data were included, we generate datasets which have no signal genes. e three datasets for σ � 0.1 are shown in Figure 2. In addition, Figure 3 shows the heatmap for σ � 0.3 so that we can note that the heatmap for a larger σ has more noise. In light of simulated data, we randomly sampled training (70%) and testing (30%) data of three methods (i.e., meta-SVM, metalogistic regression, and integrative deep learning) in accordance with predetermined experiment designs and applied to the identical data to guarantee fair comparison.

Input layer
Hidden layer Output layer   Tables 1-3 show the simulation results of metalogistic regression, meta-SVM, and the integrative deep learning. Sensitivity is the correct classification rate for signal genes, specificity is the correct classification rate for nonsignal genes, and the Youden J index is sensitivity + specificity − 1. Table 1 shows the simulation results when no random study was included, and Tables 2 and 3 show the results when 1 and 2 random studies were included, respectively. Based on the Youden J index, integrative deep learning performs better than meta-SVM when the variance of the data is higher and performs better than metalogistic for all variances (σ) considered. Furthermore, the proposed method has more balanced values of sensitivity and specificity than metalogistic and meta-SVM. Table 1 shows the simulation results of the case with no random studies included. We noted that integrative deep learning is better than meta-SVM, except for when data are sampled with low variance (for σ � 0.1, 0.9859 for meta-SVM, and 0.5973 for integrative deep learning). Meta-SVM performs quite well on that condition, but the Youden J index decreases radically as the variance of the data (σ) increases (0.1597, 0.1491, 0.5100, and 0.0120 for meta-SVM; 0.4427, 0.4119, 0.3666, and 0.3427 for integrative deep learning). Metalogistic results in a low Youden J index due to the low sensitivity. In comparison, integrative deep learning has a good balance between sensitivity and specificity, and the Youden J index decreases relatively slowly. Tables 2 and 3 show the simulation results of the inclusion of one and two random studies, respectively. When random studies were included, the results are similar to those with no inclusion of a random study. In Table 2 (the inclusion of one random study), meta-SVM performs better than deep learning when the variance of data is low (0.8045 and 0.6345 for meta-SVM; 0.6052 and 0.4114 for integrative deep learning). However, the Youden J index decreases radically as the variance increases (0.3435, 0.0138, and 0.3354 for meta-SVM; 0.3697, 0.3375, and 0.3411 for integrative deep learning). Metalogistic still has a low Youden J index. In Table 3 (the inclusion of two random studies), integrative deep learning performs better than meta-SVM, except for when data are sampled with low variance (σ � 0.1), and it also performs better than metalogistic for all experiment scenarios. Meta-SVM even results in zero for certain values of σ. All together, integrative deep learning always performs better than metalogistic, yet, meta-SVM performs better, particularly when data are sampled with low variance. However, the results of meta-SVM lack stability for the variance of the data, as they decrease radically. By contrast, integrative deep learning Computational and Mathematical Methods in Medicine performs stably. Based on these results, we can identify that the integrative deep learning method is robust for the variance of the data. All simulations were repeated 30 times. is feature can be powerful for discovering significant biomarkers.

Tuning λ Values.
e magnitudes of the estimated signal β changes nonzero to zero values after some λ values, while the magnitudes of the estimated nonsignal β are approximately zero for all of the λ values as shown in Figure 4. In all cases, the estimated β values converge to zero as λ becomes   Computational and Mathematical Methods in Medicine increasingly large. We can find the optimal λ value that distinguishes between signal and nonsignal through cross validation.

Applications to Real Genomic Data
In this section, we apply integrative deep learning methods to real examples of breast cancer expression profiles provided by e Cancer Genome Atlas (TCGA) including mRNA, copy number variation (CNV), and epigenetic DNA methylation (http://cancergenome.nih.gov/; 300 samples of estrogen receptor binary outcome (i.e., ER+ and ER− )). ere are three types of data: mRNA, methylation, and CNV. We preprocessed the TCGA data according to Kim et al. [18]. e data obtained from TCGA data portal contain CNV for 23,235 genes, methylation levels of 22,529 probes, and mRNA expression levels for 17,814 genes. We filtered out genes with low-expressed (mean < 0.9) or noninformative (standard deviation < 0.85) features in the mRNA expression data, and thereby, we obtained 1,345 methylation probes and 828 CNV genes by matching 828 mRNA gene symbols. Table 4 presents the data descriptions. Each type of data consists of 234 controls and 66 cases for a total of 300 samples. We align three genomic data by the common cohort in the context of vertical integration. For methylation data, we selected only one gene data among the same gene data by leaving the one with the greatest IQR (interquartile range).

Results.
We applied gene set enrichment analysis to TCGA breast cancer data in order to determine whether our identified gene sets are consistent with the underlying biological pathways from the KEGG (2016) database (https:// www.genome.jp/kegg/pathway.html). e result of gene set enrichment analysis is described in Kim et al. [18] identified that TCGA multiple omics data are significantly enriched in the ABC transporter pathways, which is already well known to be correlated to breast cancer mechanisms and particularly related to estrogen receptors and drug resistance. Similarly, we found that our selected gene set enriched the ABC transporter pathways from the KEGG database. We also found that it enriched CAMs (cell adhesion molecules) pathways. According to Saadatmand et al. [19], CAM (cell adhesion molecule) pathways are known to play an important role in the process of metastasis. CAMs are a subset of proteins located on the cell surface. e major cause of breast cancer death is metastasis.
e prognostic values of the tumor expression of N-cadherin, E-cadherin, carcinoembryonic antigen (CEA), and epithelial CAM   Computational and Mathematical Methods in Medicine (Ep-CAM) were evaluated in patients with breast cancer. ere are four subfamilies of CAMs: cadherins, integrins, selectins, and immunoglobulins, such as carcinoembryonic antigen (CEA). N-cadherin and E-cadherin belong to cadherins, and CEA belongs to immunoglobulins. Ep-CAM is a type of CAM but does not belong to any of the four subfamilies mentioned above. Among these, combining E-cadherin and CEA tumor expression provides a prognostic parameter with high discriminative power that is a candidate tool for predicting prognosis in breast cancer. In addition, Li and Feng [20] identified CAMs in the paradigm of breast cancer. In breast cancer, the reduced expression of E-cadherin has been reported in approximately 50% of invasive ductal carcinomas, whereas invasive lobular carcinomas showed complete loss of E-cadherin expression in nearly 90% of cases and have also been shown to contribute to metastasis. Additionally, Li and Feng [20] stated that research on immunoglobulins in breast cancer has identified several members that are upregulated during cancer progression and that are potentially associated with an unfavorable prognosis and CEA comes under that. Overall, the results that we observe are consistent with existing biological truth.
us, integrative deep learning is found to be an efficient method for discovering significant biomarkers of disease.

Gene Networks.
NetBox is an analytic software well suited to detect connecting genes to a network, identifying statistically significant linker genes on the basis of four public data sources: NCI-Nature Pathway Interaction Database, Human Protein Reference Database, MSKCC Cancer Cell Map (http://www.mskcc.org/), and Reactome Pathway Database. Figure 5 shows the gene networks, which present the relationships among significant genes, via NetBox. e violet nodes are the selected linker genes out of the 59 genes listed in Table 5. e yellow nodes indicate linker genes that are not present in the original input list but are significantly connected to members of the input list. In gene networks, it is notable that the ESR1 gene is connected to many other genes. It appears that ESR1 plays an important role in these networks. According to Clatot et al. [21], ESR1 mutations are prominent in breast cancer. In particular, ESR1 mutations have recently emerged as a key mechanism of AIs (aromatase inhibitors) resistance in ER + metastatic breast cancer. Additionally, the ESRRG gene is also prominent in gene networks. Madhavan et al. [22] identified that ESRRG signaling is associated with poor distant metastasis-free survival in ER + as well as tamoxifen-treated breast cancer. Overall, our gene networks consist of genes related to breast cancer. We also construct gene networks by using software called String tool. (see Supplementary Figure S1).

Discussion
In order to predict the disease, it is crucial to identify genes related to the disease. In the analysis of such gene data, a machine learning method capable of processing genetic big data and statistical knowledge capable of interpreting these data are required. As technology advances, genomic data generation tools become more diverse and data generation speeds up faster. For this reason, a higher level of analysis is required. Moreover, it is generally known that combining multiple studies can improve statistical power and provide validated conclusions. In addition to metalogistic and meta-SVM, other methods to detect differentially expressed biomarkers have been devised. For example, Jia and Tseng [23] suggested an adaptively weighted (AW) statistic, and Song and Tseng [24] identified the rth ordered p value, rOP. By using meta-analysis-based methods, the statistical power (sensitivity) can be improved. In this study, we proposed an integrative deep learning method that adds a consolidating layer to the existing deep learning method. We also used the backpropagation algorithm to update the weights in the integrative deep learning. We applied the lasso penalty to the objective function for parameter estimation. In order to evaluate the performance of the proposed method, we conducted a simulation study to compare it with the performances of the meta-SVM and metalogistic regression based on the Youden J index. We observe that integrative deep learning is robust for the variance of data. Furthermore, integrative deep learning even performs well when there is noise in the datasets that do not have any signal gene among the three datasets. Generally, the simulation results of integrative deep learning are stable. We also conducted real data (TCGA) analysis. Jia and Tseng [23] mentioned that they only considered combining multiple microarray studies and that it can be extended to combinations of multiple genomic, epigenomic, and/ or proteomic studies. As we use the three different types of data, we accomplished the extended study. Based on the results of gene set enrichment analysis, we obtained that TCGA multiple omics data involve significantly enriched pathways which contain information related to breast cancer-like ABC transporter, CAMs. Overall, the results of real data analysis are consistent with existing biological truth. erefore, they show that the proposed method, the integrated deep learning, can discriminate signal genes from the nonsignal genes.
Disclosure e theme of this article is inspired by the author Jiyeon Kim's master thesis at Keimyung University.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this manuscript.