Gene shaving using a sensitivity analysis of kernel based machine learning approach, with applications to cancer data

Background Gene shaving (GS) is an essential and challenging tools for biomedical researchers due to the large number of genes in human genome and the complex nature of biological networks. Most GS methods are not applicable to non-linear and multi-view data sets. While the kernel based methods can overcome these problems, a well-founded positive definite kernel based GS method has yet to be proposed for biomedical data analysis. Methods and findings Since the kernel based methods on genomic information can improve the prediction of diseases, here we proposed a noble method, “kernel based gene shaving” which is based on the influence function of kernel canonical correlation analysis. To investigate the performance of the proposed method in comparison to state-of-the-art-method in gene saving, we analyzed extensive simulated and real microarray gene expression data set. The performance metrics including true positive rate, true negative rate, false positive rate, false negative rate, misclassification error rate, the false discovery rate and area under curves were computed for each methods. In colon cancer data analysis, the proposed method identified a significant subsets of 210 genes out of 2000 genes and suggestive superior performance compared with other methods. The proposed method can be applied to the study of other disease process where two view data is a common task. Conclusions We addressed the challenge of finding unique kernel based GS methods by using the influence function of kernel canonical correlation analysis. The proposed method has shown to have better performance than state-of-the-art-methods in gene saving and has identified many more significant gene interactions, suggesting that genes function in a concerted effort in colon cancer. In similar biomedical data analysis, kernel based methods could be applied to select a potential subset of genes. The positive definite kernel based methods can overcome the non-linearity problem and improve the prediction process.


Methods and findings
Since the kernel based methods on genomic information can improve the prediction of diseases, here we proposed a noble method, "kernel based gene shaving" which is based on the influence function of kernel canonical correlation analysis. To investigate the performance of the proposed method in comparison to state-of-the-art-method in gene saving, we analyzed extensive simulated and real microarray gene expression data set. The performance metrics including true positive rate, true negative rate, false positive rate, false negative rate, misclassification error rate, the false discovery rate and area under curves were computed for each methods. In colon cancer data analysis, the proposed method identified a significant subsets of 210 genes out of 2000 genes and suggestive superior performance compared with other methods. The proposed method can be applied to the study of other disease process where two view data is a common task.

Conclusions
We addressed the challenge of finding unique kernel based GS methods by using the influence function of kernel canonical correlation analysis. The proposed method has shown to have better performance than state-of-the-art-methods in gene saving and has identified many more significant gene interactions, suggesting that genes function in a concerted PLOS

Introduction
Gene shaving (GS), to identify significant subsets of the genes, is an important research area in the analysis of DNA microarray gene expression data for biomedical discovery. GS methods aim to remove redundant and irrelevant genes so that performing in supervised learning will be more accurate [1,2]. It leads to gene discovery relevant for a particular target annotation and contributes to better medical diagnosis and prognosis. GS is not relevant to the hierarchical clustering and other widely used methods for analyzing gene expression in the genomewide association studies. GS leads to gene discovery relevant for a specific target annotation. The selected genes using GS play an important role in the gene expression data analysis since they can differentiate samples from different populations [3][4][5][6]. Despite their successes, these studies are often hampered by their relatively low reproducibility, nonlinearity and multi-view data.
The incorporation of various statistical machine learning approaches into genomic analysis is a rather recent area of study. Since large-scale microarray data presents significant challenges for the statistical data analysis, in addition the classical approaches, there is a need for an advanced method. The kernel methods (methods based on positive definite kernel) are the appropriate tools to deal with such data set that map data from a high dimensional space to a feature space using a nonlinear feature map. The main advantage of these methods is to combine statistics and geometry in an effective way [7][8][9]. As a machine learning approach, kernel canonical correlation analysis (kernel CCA) have been extensively studied for decades to analyze multi-view data set [10][11][12]. Using the influence function (IF) of kernel canonical correlation analysis, we proposed a novel kernel method to select a significant subset of genes of biomedical data analysis.
Nowadays, IF based methods (e.g., sensitivity analysis) have been used to detect an influence observation. IF is used to find a set of vectors that have much greater effect on the estimator of the parameter [13]. A visualization method for detecting influential observations using the IF of Kernel principal component analysis has been proposed by Debruyne et al. [14]. Filzmoser et al. also developed a method for outlier identification in high dimensions [15]. However, these methods are limited to a single view data set. Due to the properties of eigendecomposition, kernel CCA and its variant are still well used methods for the biomedical data analysis [16][17][18].
The contribution of this paper is three-fold. First, we address the IF of kernel CCA. Second, we use the distribution based methods to confirm the influential observations. Finally, the proposed method is applied to identify a set of genes in both synthesized and gene expression data. The accuracy of the proposed method shows superior performance compared to the the state-of-the-art-method in gene saving based on the area under curves (AUC). In colon cancer data analysis, we used the proposed method to identify genes and perform pathway analysis [the gene ontology (GO) of biological process categories, Kyoto Encyclopedia of Genes and Genomes (KEGG)] and gene-gene interaction networks. We found that identified genes function in a concerted effort and have biological relevance to colon cancer. In addition, the selected genes based classification is superior than selected genes by other methods as well as classification using all genes. For any biomedical data analysis, the proposed method could be applied to select a potential subset of genes.
The remainder of the paper is organized as follows. In the materials and methods section, we provide a brief review of positive definite kernel, kernel CCA and IF of kernel CCA. The utility of the proposed method is demonstrated by both simulated and real data analysis from an colon cancer study in the experimental results section. In the discussion section, we also summarize our findings and give a perspective for future research.

Positive definite kernel
In kernel methods, a nonlinear feature map is defined by positive definite kernel. It is known that a positive definite kernel k is associated with a Hilbert space H, called reproducing kernel Hilbert space (RKHS), consisting of functions on X so that the function value is reproduced by the kernel [19]. For any function f 2 H and a point X 2 X, the function value f(X) is f ðXÞ ¼ hf ð�Þ; kð�; XÞi H ; where h; i H in the inner product of H is called the reproducing property. Replacing f with kð�;XÞ yields kðX;XÞ ¼ hkð�; XÞ; kð�;XÞi H for any X;X 2 X. A symmetric kernel k(�, �) defined on a space X is called positive definite, if for an arbitrary number of points X 1 ; X 2 . . . ; X n 2 X the Gram matrix (k(X i , Y j )) ij is positive semi-definite. To transform data for extracting nonlinear features, the mapping Φ : X ! H is defined as Φ(X) = k(�, X), which is a function of the first argument. This map is called the f feature map, and the vector Φ(X) in H is called the feature vector. The inner product of two feature vectors is then hΦðXÞ; ΦðXÞi H ¼ kðX;XÞ: This is known as the kernel trick. By this trick the kernel can evaluate the inner product of any two feature vectors efficiently without knowing an explicit form of Φ(�) [7][8][9].

Kernel canonical correlation analysis
Kernel CCA has been proposed as a nonlinear extension of linear CCA [10]. Researchers have extended the standard kernel CCA with an efficient computational algorithm [20]. Over the last decade, kernel CCA has been used for various tasks [21][22][23]. Given two sets of random variables X and Y with two functions in the RKHS, f X ð�Þ 2 H X and f Y ð�Þ 2 H Y , the optimization problem of the random variables f X (X) and f Y (Y) is The optimizing functions f X (�) and f Y (�) are determined up to scale. Using a finite sample, we are able to estimate the desired functions. Given an i.i.d sample, ðX i ; Y i Þ n i¼1 from a joint distribution F XY , by taking the inner product with elements or "parameters" in the RKHS, we have features where k X (�, X) and k Y (�, Y) are the associated kernel functions for H X and H Y , respectively. The kernel Gram matrices are defined as K X ≔ ðk X ðX i ; X j ÞÞ n i;j¼1 and K Y ≔ ðk Y ðY i ; Y j ÞÞ n i;j¼1 . We need the centered kernel Gram matrices M X = CK X C and M Y = CK Y C, where C ¼ I n À 1 n B n with B n ¼ 1 n 1 T n and 1 n is the vector with n ones. The empirical estimate of Eq (1) is then given bŷ where a X and a Y are the directions of X and Y, respectively.

Influence function of the kernel canonical correlation analysis
Since 1974, the IF plays an important role for detecting outlying multivariate observations in statistical analysis. The IF can usually be defined on first order approximation for estimators of parameters in a multivariate population which indicates where in the n-dimensional space of observations. The observed vectors should have a large effect on the value of the estimator of the parameter. For a sample of observation vectors, we can define the IF based on empirical distribution (EIF) to find set of these vectors that have much greater effect on the estimator. This vector is called set of outline vector [13]. In many situation outliers are often the special point of interest and their recognition is the main goal of the investigation. Although, there are several approaches to identify outliers in multivariate data analysis. The goal of this paper is to identify a set of outline observations for two view data set using IF of kernel CCA.
Using the idea of IF of the linear PCA, the kernel PCA, and the linear CCA, the IF of kernel CCA has been proposed by Alam et al., [18]. To define, given two sets of random variables (X, Y) having the distribution F XY and the j-th kernel CC (ρ j ) and kernel CVs (f jX (X) and f jX (Y)), the influence functions of kernel CC at Z 0 = (X 0 , Y 0 ) is given by The above theorem has been proved on the basis of previously established ones, such as the IF of linear PCA [24,25], the IF of linear CCA [26], and the IF of kernel PCA, respectively. The details proof is given in [18].
Using the above result, we can identify a set of observations based on its influence values. To demonstrate, we proposed a noble method, with application to DNA microarray gene expression data. This novel method can be applied to the study any disease processes, where two-view data analysis is a common task. The proposed approach consists of two basic parts: a step that aims to calculate influence value of each gene and a step that aims to determine the outline gene. For the first step, we use EIF in Eq (2) and we can use a any univariate outliers detection tools. To extract the outliers of the genes, we have considered distribution based tools.

Kernel choice
In kernel based learning, choosing a suitable kernel is key for favorable results. Most of unsupervised kernel methods suffer from the problem of kernel choice. The liner kernel is just used the underlying Euclidean space to define the similarity measure. Whenever the dimensionality of the input space, X is very high, this might allow for more complexity in the function class than what we could measure and assess otherwise. It has limitation of linearity. Using a polynomial kernel it is possible to use the higher order correlation between the data in the different purposes. But, due to the finite bounded degree such kernel will not provide us with guarantees for a good dependency measure. In addition both liner and polynomial kernels are nonrobust.
The Gaussian kernel, is a radial basis function kernels that maps X into an infinite dimensional space. The Gaussian kernel is defined as: This most applicable kernel in kernel methods has a number of theoretical properties (e.g., boundedness, consistent, characteristic, universality, robustness etc.) [27]. In this paper we consider the Gaussian kernel and use the median of the pairwise distance as a bandwidth [28,29].
The assumption of kernel methods (methods based on positive definite kernel) is that the data should be a non-empty set. The kernel methods are independent of the dimensions. Its allow us to construct spaces of functions on an arbitrary set with the appropriate structure of a Hilbert space. By the reproducing property, computing the inner product on RKHS is easy and the computational cost only depends on the sample size. It is true that kernel methods may have computational issues for very large data set in handling Gram matrices of sample size. However, recent developments on approximation methods such as random Fourier features enables us to apply kernel methods to data size of millions.

Relevant approaches
While the proposed approach is designed for two view data set, we compare its performance against other relevant algorithms in univariate data or multivariate data (one view data) set only, since a two view data comparison is not feasible. To demonstrate the performance of the proposed method in a comparison, we examine four popular gene selection methods: T-test, significance analysis of microarrays (SAM), Linear Models for Microarray and RNA-Seq Data (LIMMA) and principal components to identify outliers (PCout) [15,[30][31][32]. Computing a ttest statistic can be problematic because the variance estimates can be skewed by genes having a very low variance [30]. For each gene, SAM gives a score on the basis of change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR) [31]. LIMMA contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. This linear modelling strategy (beyond the intended analysis of gene expression data) has been found to have many applications [32]. A computationally fast procedure for identifying outliers is presented that is particularly effective in high dimensions. This algorithm not only utilizes simple properties in the transformed space but also needs less computational time than existing methods for outliers detection, and is suitable for use on very large data sets [15]. But it has limitation of linearity and a single view data set. We used all of these methods to compare to the proposed method.

Experimental results
We have used both simulated and real microarray gene expression data set of colon cancer [33]. To compare relevant approaches (T-test, SAM, LIMMA and PCout) we used four R packages including STATS, SAMR, LIMMA and PCout, respectively. The performance measures including true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), false negative rate (FNR), misclassification error rate (MER), FDR and AUC have been evaluated for each of the methods as previously described [34]. To compute the performance measures, we used R packages, which are available in the comprehensive R archive network or bioconductor.

Simulation study
To investigate the performance of the proposed method in comparison with four popular methods as mentioned above with k = 2 groups, we considered gene expression profiles from both normal distribution and t-distribution. We also considered data set of both small-andlarge-sample cases with different percentages of differently expressed (DE) genes.

Simulated gene expression profiles generated from normal distribution
We used a one-way ANOVA model to generate simulated data sets from normal distribution where x ijk , i is the expression of the ith gene for the jth samples in k group, μ ik is the mean of all expressions of ith gene in the kth group and � ijk is the random error which usually follows a normal distribution with mean zero and variance σ 2 .
To investigate the performance of the proposed method in a comparison of other four popular methods as early mentioned for k = 2 groups, we generated 100 data sets using 100 times of simulations for both small (n 1 = n 2 = 3) and large (n 1 = n 2 = 15) sample cases using Eq (3). The means and the common variance of both groups were set as (μ i1 , μ i2 ) 2 (3, 5) and σ 2 = 0.1, accordingly. Each data set for each case represented the gene expression profiles of G = 1000 genes, with n = (n 1 + n 2 ) samples. The proportions of DE gene (pDEG) were set to 0.02 and 0.06 for each of the 100 data sets. We computed average values of different performance measures such as TPR, TNR, FPR, FNR, MER, FDR and AUC based on 20 and 60 estimated DE genes by five methods (T-test, SAM, LIMMA, PCout and Proposed) for each of 100 data sets. Fig 1a and 1b represent the ROC curve based on 20 estimated DE genes by four methods for both small-and-large-sample cases, respectively. We observe that the proposed method performed better than other four methods for small-sample case (Fig 1a). On the other hand, for large-sample case (Fig 1b) proposed method keeps almost equal performance with other four methods. Fig 2 shows the boxplot of AUC values based on 100 simulated data set estimated by each of the four methods both for small-and-large-sample cases, respectively. Fig 2a and 2b represent the boxplots of AUC values with pDEG = 0.02 and 0.06, respectively. From these boxplots we obtained similar results like ROC curve for every pDEG values. We also notice that the performance of the methods increases when we increase the value of pDEG 0.02 to 0.06. Furthermore, we calculate the average values of different performance measures such as TPR, TNR, FPR, FNR, MER, FDR and AUC based on 20 (pDEG = 0.02) and 60 (pDEG = 0.06) to estimate DE genes by each of the methods. The results are summarized in Table 1. In this table the results without and within the brackets indicate average of different performance measures estimated by different methods for small-and-large sample cases, respectively. We also find the similar interpretations like ROC curve and boxplots (Table 1).

Simulated gene expression profiles generated from t-distribution
We also investigated the performance of the proposed method in a comparison of other four methods for non-normal case. Accordingly we generated 100 simulated data sets from t-distribution with 10 degrees of freedom. We set the mean and variance as previously mentioned. We estimated different performance measures such as TPR, TNR, FPR, FNR, MER, FDR and  Table 2. From this table we mentioned that the performances of all the methods become progressively worse when the datasets came from t-distribution. We also observed that the proposed method performed better than the other four methods.  Table 2. We also noticed from boxplots that the proposed method has less variability among the other four methods. From this analysis we may conclude that the performance of the proposed method has improved than the four well-known gene selection methods.

Application to colon cancer microarray data
The data consists of expression levels of 2000 genes obtained from a microarray study on 62 colon tissue samples collected from colon-cancer patients [33]. Among the 62 colon tissues, tumor tissues (40) and normal tissues (22) were coded by 2 and 1, respectively. The goal here is to characterize the underlying interactions between genetic markers for their association with the colon-cancer patients and the healthy persons. In simulation studies, we observed that the multivariate approaches (the PCOut and the proposed (KCCOut)) performed better than univariate approaches. In addition to PCOut and KCCOut, we considered liner CCA (CCOut) to colon cancer data analysis. To calculate the influence value of each gene, we used these three methods, respectively. Fig 4. visualizes the plots of absolute influence value for 2000 genes. By the outlier detection technique in the one dimensional influence value of each method, we obtained 31, 133 and 210 genes using the PCOut, the CCOut and the KCCOut, respectively. To compare the selected genes, we made a Venn-diagram of the selected genes Positive definite kernel based gene shaving from the three methods. Fig 5. presents the Venn-diagram of the PCOut, LCCAOut, and KCCAOut methods. From this figure, we observed that the disjointedly selected genes of PCOut, LCCAOut, and KCCAOut are 19, 61, and 144, respectively. The number of common genes between PCOut and LCCAOut, and PCOut and KCCAOut, and LCCAOut and KCCA-Out were 7, 1, and 61, respectively. All methods selected 4 common genes: J00231, T57780, M94132 and M87789. Genes do not function alone; rather, they interact with each other. When genes share a similar set of gene ontology (GO), they are more likely to be involved with similar biological mechanisms. To verify this, we extracted the GO of biological process categories and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway annotations of 210 genes detected by proposed KCCA using Database for Annotation, Visualization and Integrated Discovery (DAVID) [35]. The GO analysis revealed that most of genes are significantly enriched in biological adhesion, cell adhesion, viral process, multi-organism cellular process, regulation of cellular amide metabolic process etc. (see supplementary S1 Table). Table 3 presents the KEGG pathway analysis. From the table, we found that these genes are mostly enriched in toxoplasmosis, antigen processing and presentation, proteoglycans in cancer, neurotrophin signaling pathway, small cell lung cancer etc. (also see supplementary S2 Table). We also constructed the gene-gene interaction networks using STRING [36]. The STRING imports protein association knowledge from databases of both physical interactions and curated biological pathways. In STRING, the simple interaction unit is the functional relationship between two proteins or genes that can contribute to a common biological purpose. Fig 6. shows the gene-gene network based on the protein interactions among the selected 210 genes. In this figure, the color saturation of the edges represents the confidence score of a functional association. Further network  , respectively. This network of genes has significantly more interactions than expected, which indicates that they may function in a concerted effort. The proposed method can be applied to the study of other disease process, where two view data is a common task. To confirm, we have applied the proposed method to another real data set: RNA-sequence study for osteoporosis risk (Source: Tulane Center of Bioinformaties and Genomics). The details of the data and the results are provided in supplementary material, S1 File.
In addition, the data set was used to classify the colon cancer patients from the healthy controls via the PCOut and the proposed feature extraction techniques (CCOut and KCCOut) and followed by the two classifiers (the k-nearest neighbors (KNN) and liner support vector machine (SVM)). For the proposed approach, we considered the features 31, 133 and 210 that have influence effects using the PCout, the CCOut and the KCCOut, respectively. The PCOut, CCOut, and KCCOut serve as a feature extraction tool based on which the classifier is used to separate patients from healthy controls. Table 4 presents the classification error using crossvalidation (2−fold and 5−fold). From these results, it is evident that the KCCOut based classification is significantly more accurate than other methods as well as methods on all features, demonstrating that the proposed method is a better tool for feature extraction.

Discussion
Kernel based machine learning methods are vital for the biomedical data analysis. The kernel based methods provide more powerful and reproducible outputs, while the interpretation of the results remain challenging. In this paper, the influence function of the kernel CCA based gene shaving method is proposed. The performance of the proposed method was evaluated on both simulated and real data set. The extensive simulation studies show the power gained by the proposed method relative to the alternative methods. The utility of the proposed method is to further demonstrate its application to analyze cancer microarray data, e.g. colon cancer microarray data. According to the influence values, the proposed method is able to rank the influence of a gene, and the genes are identified to be highly related to disease. Using an distribution based outlier detection method, the proposed method extracts 210 genes out of 2000 genes, which are considered to have a significant impact on the patients. Incorporating biological knowledge information (e.g., GO) can provide additional evidence for the results. By conducting GO, pathway analysis, and network analysis including visualization, we find evidence that the selected genes have significant influence on the manifestation of colon cancer disease and can serve as a distinct feature for the stratification of colon cancer patients from the