Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A novel dimension reduction algorithm based on weighted kernel principal analysis for gene expression data

  • Wen Bo Liu,

    Roles Formal analysis, Methodology, Writing – original draft

    Affiliations School of Mathematics and Statistics, Qiannan Normal University for Nationalities, Duyun, Guizhou, China, Key Laboratory of Complex Systems and Intelligent Computing, Qiannan Normal College of Nationalities, Duyun, Guizhou, China

  • Sheng Nan Liang,

    Roles Data curation, Investigation, Writing – review & editing

    Affiliations School of Mathematics and Statistics, Qiannan Normal University for Nationalities, Duyun, Guizhou, China, Key Laboratory of Complex Systems and Intelligent Computing, Qiannan Normal College of Nationalities, Duyun, Guizhou, China

  • Xi Wen Qin

    Roles Conceptualization, Methodology, Writing – review & editing

    qinxiwen@ccut.edu.cn

    Affiliation School of Mathematics and Statistics, Changchun University of Technology, Changchun, Jilin, China

Abstract

Gene expression data has the characteristics of high dimensionality and a small sample size and contains a large number of redundant genes unrelated to a disease. The direct application of machine learning to classify this type of data will not only incur a great time cost but will also sometimes fail to improved classification performance. To counter this problem, this paper proposes a dimension-reduction algorithm based on weighted kernel principal component analysis (WKPCA), constructs kernel function weights according to kernel matrix eigenvalues, and combines multiple kernel functions to reduce the feature dimensions. To further improve the dimensional reduction efficiency of WKPCA, t-class kernel functions are constructed, and corresponding theoretical proofs are given. Moreover, the cumulative optimal performance rate is constructed to measure the overall performance of WKPCA combined with machine learning algorithms. Naive Bayes, K-nearest neighbour, random forest, iterative random forest and support vector machine approaches are used in classifiers to analyse 6 real gene expression dataset. Compared with the all-variable model, linear principal component dimension reduction and single kernel function dimension reduction, the results show that the classification performance of the 5 machine learning methods mentioned above can be improved effectively by WKPCA dimension reduction.

1 Introduction

DNA is organized structurally into chromosomes and functionally into genes, which are essentially pieces of DNA containing genetic information [1]. In humans, genes carry genetic information to express hair and eye colour, among many other traits, as well as information about when the body’s cells grow, divide and die. When a gene is turned on, this is called gene expression. Genetic mutations in normal cells of the human body are closely related to environmental stimuli, age, smoking, diet and other external factors, which can lead to the uncontrolled reproduction of normal cells and, ultimately, to cancer (malignant tumours) [2]. In February 2018, the National Cancer Center of China released the registration data of the National Cancer Registry for 2014, which indicated that there were approximately 3.804 million cases of cancer in 2014, including approximately 2.114 million men and 1.69 million women [3]. The development of sequencing technology has had a huge impact on cancer research, enabling researchers to analyse the expression levels of thousands of genes in a collaborative manner and to correlate gene expression patterns with clinical phenotypes, resulting in multiple tumour gene expression profiles. How to effectively analyse tumour gene expression data and how to mine and discover the information and knowledge contained therein is a hot topic in bioinformatics research. This could help distinguish cancer from normal tissue, predict cancer outcomes, detect cancer recurrence and monitor cancer treatment responses. However, gene expression data have the characteristics of high dimensionality and small sample sizes. Each sample records the expression levels of all the detectable genes in the histocyte, but only a few genes are actually related to the sample categories. These genes contain classification information about samples, which are known as "classification feature genes". At present, most research concerns how to select these informative genes from thousands of genes, which is the problem of feature selection. Many researchers have done a great amount of fruitful work in unsupervised [46] semi-supervised [79] and supervised [1012] gene feature selection. Different from feature selection, this paper mainly studies the dimension reduction of gene expression data from the perspective of feature extraction to improve the recognition rate of sample categories. Feature extraction is based on known features and obtains a subset with lower dimensionality and fully represents the original features through a specific algorithm. Moreover, the features in this subset are independent of each other. The main feature extraction algorithms are as follows.

Principal component analysis (PCA) is one of the most classic feature extraction algorithms. Its basic idea is to use fewer principal components (comprehensive variables) to replace more original features, and these principal components can contain as much information about the original features as possible and are unrelated to each other [13, 14]. PCA is good at processing linear and Gaussian distribution data. To make up for the deficiency of PCA, many related studies have been proposed to improve the PCA algorithm. Compared with PCA, the independent component correlation algorithm (ICA) is more suitable for processing non-Gaussian data. Hyvarinen [15, 16] proposed a FastICA algorithm that can quickly find the optimal iteration. This algorithm is a mature linear blind source separation algorithm at present. The above methods are all linear dimension reduction algorithms. In many practical tasks, data often presents a nonlinear distribution. If linear dimension reduction is still adopted, the original low-dimensional structure will be lost. Therefore, some nonlinear dimension reduction techniques are proposed, among which the most typical representative is the nonlinear feature extraction method based on the kernel technique. Schkopf et al. [17] proposed kernel principal component analysis (KPCA), which maps the linear indivisible data in the low-dimensional space to the high-dimensional space through nonlinear mapping and realizes linear divisibility in the high-dimensional space. Mika et al. [18] proposed kernel linear discriminant analysis (KLDA), which combines the kernel function with LDA to extract the features. Xu et al. [19, 20] proposed the fast KPCA method by introducing the key sample idea in the early 1920s. The above nonlinear dimension reduction methods are all based on the single kernel function. To further improve the dimension reduction and classification performance of the kernel methods, multiple kernel learning algorithms are proposed. Gonen et al. classified and summarized these algorithms and concluded that combining multiple kernel functions was better than using single kernel functions through experimental analysis [21]. Zhang et al. introduced the power kernel function, proposed the combined kernel function principal component analysis method, realized data mapping from the low dimension to the high dimension, and then applied the feature extraction to the nonlinear data [22]. Li Proposed pulmonary nodule recognition based on multiple kernel learning support vector machine particle swarm optimization and obtained a better recognition efficiency [23].

Although the above kernel methods have achieved remarkable practical results in many fields, these methods are all single kernel methods based on a single feature space. Because different kernel functions have different characteristics, so that in different applications, the performance of the kernel function is very different, and there is no perfect theoretical basis for the construction or selection of the kernel function. In addition, when the sample features contain Heterogeneous information, the sample size is large, the multi-dimensional data is Unnormalised data or the data is non-flat in the high-dimensional feature space, it is not reasonable to process all the samples by mapping with a single simple kernel. In view of these problems, there are many researches on kernel combination method, namely multiple learning methods. Multiple models are a kind of based kernel learning model with stronger flexibility. Recent theories and applications have proved that using multiple instead of single kernel can enhance the interpretability of decision functions, taking advantage of the feature mapping ability of each basic kernel, and can obtain better performance than single kernel model or single-kernel machine combination model [24].

In view of the advantages of multiple kernel learning, this paper proposes a novel dimension reduction algorithm based on weighted kernel principal component analysis (WKPCA). Its basic idea is to use the vectorization method to calculate the kernel matrix, construct the kernel function weights according to the eigenvalues of the kernel matrix, combine multiple kernel functions, and give the theoretical proof of the weighted kernel functions. Moreover, the t-class kernel function is constructed as a subpart of the weighted kernel function. Through a large number of comparison experiments on 6 real data sets, the results show that compared with the whole variable model, linear principal component dimension reduction and single kernel function dimension reduction, the WKPCA algorithm proposed in this paper can effectively improve the classification prediction performance of the current mainstream machine learning methods.

2 Kernel principal component analysis

Traditional dimension reduction methods assume that the mapping from the high-dimensional feature space to the low-dimensional feature space is linear. However, in many practical tasks, nonlinear mapping may be needed to find the appropriate low-dimensional embedding [25]. To compensate for the lack of linear dimension reduction, the nonlinear dimension reduction method based on the kernel function was proposed, among which applications kernel principal component analysis was the most commonly used method. The basic idea is that the original data set by the nonlinear function maps the data to the appropriate high-dimensional feature space, introducing the kernel function whose form is known, so knowing the concrete expression of the nonlinear mapping is not necessary. Then, the kernel matrix and its eigenvectors are calculated, giving the projection of the data set in the high-dimensional space based on the eigenvectors.

Suppose that the original data is D = {x1, x2, …, xm}, where xi = {xi1, xi2, …, xip}′, m is the sample size, i is the sample number, and p is the data dimension. In the high-dimensional feature space, the mapping of xi is zi = ϕ(xi), and the data set is D′ = {z1, z2, …, zm}, whose covariance matrix is (1)

The solving goal of KPCA is where ωj is the eigenvector corresponding to the eigenvalue λj of Σ, and zj is the jth coordinate component after the projection of sample x. The key to solving for zj is how to calculate ωj and obtain the expression of ϕ(x). However, ϕ(x) is often unknown, but it can be replaced through the kernel function, whose form is known. The calculation process of the kernel principal component is as follows.

(2)(3)

Here, .

(4)

Introducing the kernel function (5)

Common kernel functions can be found in the literature [26].

Substitute Eqs (4) and (5) into Eq (2) to get (6) where K = (κ(xi, xj))m×m is the kernel matrix of κ(xi, xj) and is the eigenvector corresponding to the jth largest eigenvalue λj of the kernel matrix K.

After projection, the jth coordinate component of sample x is (7) where αj is the normalized vector. It can be seen from Eq (7) that in order to obtain the projection of new samples, all the original data need to be summed, so the calculation cost is large. However, in the algorithm designed in Section 3.2, vectorization programming specific to the R language can be adopted to improve the calculation efficiency [27].

3 Weighted kernel function method

3.1 Weighted kernel function

To further improve the low-dimensional embedding ability of a single kernel function for the original data and make the selection of the kernel function more flexible, this paper proposes a weighted kernel function method to reduce the dimensionality of gene expression data with super high dimensionality, and its principle is given in the form of the following theorem.

Theorem 1 [28] Let X be the input space, and κ(·,·) is a symmetric function defined based on X × X. Then, κ(·,·) is the kernel function if and only if for any dataset D = {x1, x2, …, xm}, the "kernel matrix" K is always positive semi-definite.

Theorem 1 shows that as long as the kernel matrix of a symmetric function is semi-positive definite, it can be used as a kernel function.

Theorem 2 If κ1(x, y), κ2(x, y),…, κn(x, y) is the kernel function, then (8) is a kernel function, where .

Proof: Supposing that the original data is D = {x1, x2, …, xm}, the corresponding data matrix can be expressed as D = (xij)m×p.

The kernel matrix of κi(x, y) is Ki = (κi(xi, xj))m×m; thus, the kernel matrix of κ(x, y) is (9)

According to Theorem 1, if κ(x, y) is the kernel function, the kernel matrix K is positive semi-definite.

Let Kx = λx. Then x and λ are the eigenvectors and eigenvalues of K, respectively, so (10)

Eq (10) can be expanded therefore (11) where λi is the eigenvalue of the matrix Ki.

Since the kernel matrices K1, K2, …, Kn are positive semi-definite matrices, their eigenvalues λ1, λ2, …, λn are non-negative. According to Eq (11), all the eigenvalues of K are non-negative, so the matrix K is a positive semi-definite matrix. Because κ(xi, yj) = κ(yj, xi), is the kernel function.□

When the weighted kernel function of Eq (8) is used to reduce the dimensionality of the original data, the problem of weight value will be encountered. The basic criterion of weight construction is the ratio of the eigenvalues of each Ki in the weighted kernel to the sum of all of them. The detailed construction process is as shown below.

Assume that all the eigenvalues of the kernel matrix Ki are in sequence, where i = 1, 2, …, n, p is the dimension of the original data set, and d is the dimension taken after the reduction of the kernel function. Generally, d < p or dp. The weight of the kernel function is (12)

Through the concept of "weighted kernel function dimension reduction efficiency", the value range of the final number d of feature extractions is determined.

Definition Suppose that the eigenvalues of the kernel matrix K = (κ(xi, xj))m×m are λ1λ2 ≥ … ≥ λm ≥ 0. Then, we determine that (13) is the dimension reduction efficiency of the kernel discriminant function . (14) is the cumulative dimension reduction efficiency of the first d(dm) kernel discriminant function z1, z2,⋯zd. According to the cumulative contribution rate of principal component analysis [29], the number of features d after dimension reduction can make Rd reach 0.8 ~ 0.9.

3.2 T-class kernel function

The weighted kernel function is the combination of multiple single kernel functions. The selection of a single kernel function will directly affect the dimensional reduction effect of the weighted kernel function. Therefore, we need to try to construct the new kernel function to improve the ability of weighted kernel functions to reduce the dimensionality of high-dimensional data to improve the classification performance of subsequent machine learning algorithms. According to the following Theorem 3 and probability density function of the t distribution, the t-class kernel function can be constructed.

Theorem 3 [30] Suppose that f: XR is a bounded continuous integrable function. Then, k(xx′) = f(xx′) is a kernel function if and only if its Fourier transform

Theorem 4 When n → +∞, the probability density function of the t distribution (15) is the kernel function.

Proof: First, . We just have to prove that the Fourier transform is non-negative, as n → +∞.

Because , we have (16)

Let , where x ~ N(0, 1). Then, we have (17)

Upon substituting Eq (17) into Eq (16), we have

According to Theorem 3, the probability density function of the t distribution is the kernel function. □ In practice, generally, n ≥ 30.

Corollary 1 When n = 1, the density function of the t distribution is (18)

Then, Eq (18) is the kernel function.

Proof: Eq (18) is the Cauchy distribution density function, whose Fourier transform is [31]

Therefore, Eq (18) is the kernel function.

Theorem 4 When n → +∞, the function (19) is the kernel function.

Proof: , where is the Laplace kernel function.

According to Theorem 3 we have

When n → ∞, is the kernel function.

We call Eq (19) the pseudo t function.

Corollary 2 When n = 1, the pseudo t function (20) is the kernel function.

Proof: According to Theorem 1, we just have to prove that .

Let t = −x. Then, we have and

Now, the key problem is whether is positive or negative. We have

Because , we can determine that

Because we have

Therefore, is the kernel function.□

The kernel function in Corollary 2 can be generalized as . There is the following Corollary.

Corollary 3 When c > 0, the function (21) is the kernel function.

The constant c in the above equation can be regarded as the scale parameter, so the kernel function in Corollary 3 is a multi-scale kernel function, which can adapt to the samples with drastic changes when the scale parameter is small, and can adapt to the samples with gentle changes when the scale parameter is large. The figure of multi-scale t kernel function with different parameters is as follow.

It can be seen from the Fig 1 that the kernel function gradually flattens with the increase of scale parameters. The multi-scale t-class kernel function constructed in Corollary 3 has rich scale choices, which makes it have better adaptability when processing complex data.

thumbnail
Fig 1. Multi-scale t kernel function under different parameters.

https://doi.org/10.1371/journal.pone.0258326.g001

If only the traditional kernel functions such as polynomial kernel and hyperbolic tangent kernel are combined linearly, there is no basis for the selection and combination of kernel function parameters, and the uneven distribution of samples still cannot be solved satisfactorically, which limits the expression ability of decision function. The t-class kernel functions constructed by us can be generalized to multi-scale functions eventually. With the gradual maturity and improvement of wavelet theory and multi-scale analysis theory, the multi-scale kernel method has a good theoretical background by introducing scale space.

Some t-class kernel functions are constructed in this section, and they can be part of the weighted kernel function. By the experimental analysis in Section 4, the t-class kernel function can reduce the dimensionality of gene expression data effectively and improve the classification performance of subsequent machine learning methods.

3.3 WKPCA dimension reduction algorithm

According to the theory of kernel principal component analysis and weighted kernel function construction, the basic framework of the WKPCA dimension reduction algorithm is shown in Fig 2.

thumbnail
Fig 2. The frame of the WKPCA dimension reduction algorithm.

https://doi.org/10.1371/journal.pone.0258326.g002

3.3.1 WKPCA dimension reduction algorithm design.

Obviously, the kernel principal component depends on the selection of the kernel function. When constructing the weighted kernel function to reduce the dimensionality, kernel functions such as the Gaussian kernel, Laplace kernel, hyperbolic tangent kernel and polynomial kernel functions are generally selected. We can also choose the t-class kernel function, which is constructed in Section 3.2. Since the weighted kernel principal component requires calculating the eigenvalues and eigenvectors of the weighted kernel matrix, first, the corresponding weighted kernel matrix should be computed using the training samples.

(22)

According to Eq (22), if the sample size is only a few hundred samples, for example, m = 400, then the kernel matrix will contain 160,000 data points. With the increase of the sample size, the time cost of calculating the weighted kernel matrix will greatly increase.

To improve the operational efficiency of the algorithm, the following methods can be adopted. The Gaussian kernel and t-class kernel functions can be regarded as the distance function of any two samples, while the hyperbolic tangent kernel and polynomial kernel functions can be regarded as the function of the inner product of any two samples. Take the pseudo-t kernel function when n = 1 as an example, (23)

Its kernel matrix is (24) where distij = ‖xixj‖ is the Euclidean distance of any two samples and Dist = (distij)m×m is the distance matrix of the sample set.

Let M = (xij)m×n, and define the matrix function as (25)

According to Eqs (24) and (25), the kernel matrix based on the pseudo-t kernel can be regarded as a function of the distance matrix, i.e., (26)

Similarly, the kernel matrix based on the hyperbolic tangent kernel and polynomial kernel can be regarded as a function of the inner product matrix, i.e., (27)

Therefore, the distance matrix and the inner product matrix can be substituted into the kernel function as a whole to get the corresponding "kernel matrix". The above process is called the vectorized computation method. In terms of algorithm design, vectorization is faster and more efficient than the multiple loop statements shown.

The dimensional reduction algorithm flow of WKPCA is given as shown in Table 1.

First, the input of the WKPCA algorithm includes 3 to 4 parts—the original data matrix D = (xij)m×p, the number p of features contained in the data, and the distance matrix or inner product matrix corresponding to the original data set D. If we define both the t-class kernel (or Laplace kernel) and the hyperbolic tangent kernel (or polynomial sum) in step 1 of the algorithm, we need to use both the distance matrix and the inner product matrix; otherwise, only one type of matrix will be input.

For the first line of the algorithm, in order to ensure the simplicity of the algorithm, two or three kernel functions are generally defined. Based on the distance matrix and inner product matrix, the kernel matrix and its eigenvalues are computed from Lines 2 to 4. In Line 5, d represents the selected dimension after feature reduction, where d < p. The weight of each kernel function is determined between Line 6 and 8. The kernel matrix of the weighted kernel function and its eigenvalues and eigenvectors are calculated between Lines 9 and 11. The d dimensional coordinates of all the samples in the new feature space are calculated in Line 12.

Time complexity analysis: Due to vectorized computation, the time used to calculate the distance and inner product matrix in Line 2 is O(1), and the time used in Lines 1 to 4 is O(q). The time consumption of the WKPCA algorithm mainly occurs in Lines 5 to 13, and its time complexity is O((m + q)p). Since the number q of kernel functions is much smaller than the sample size m, the total time complexity of this algorithm is O(mp). It is important to point out that in general, m > p or m >> p, but for some data sets, such as the gene expression data set, mp. Through the experimental analysis in Section 4, it can be concluded that after the dimension reduction of the WKPCA, the value of p only needs to be a few percent of the total number of variables to achieve a better classification prediction effect, and the time cost is moderate.

4 Experimental results and analysis

In this section, the t-class kernel functions constructed in Section 3.2 are weighted and combined. WKPCA dimension reduction is performed on 6 real gene expression data sets based on the t-class weighted kernel function to obtain unrelated principal components. According to Eq (14), the number d of principal components retained is determined. Then, the current mainstream machine learning methods including naive Bayes (NB) [32], support vector machines (SVM) [33], k-nearest neighbour (KNN) [34], random forest (RF) [35], and iterative random forest (IRF) [3638] are used to make classification predictions for the subset after dimension reduction. The above machine learning algorithm is used to perform classification prediction on the all-variable (AV) data set, and the data subsets of linear principal component analysis (PCA) dimension reduction, single kernel principal component (SKPCA) dimension reduction and weighted kernel principal component analysis (WKPCA) dimension reduction.

4.1 Experimental design

The experiments were conducted on a machine equipped with the Windows 10 64-bit operating system, an Intel i7-10510 μ 2.3 GHz CPU and 16 GB memory. The algorithm was implemented in the R language (R 3.6.3). The 6 real data sets used in this paper are from the Broad Institute Genome Data Analysis Center (http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi). See Table 2 for detailed information.

To compare the performances of the machine learning classification algorithms in different dimensions, the classification macro accuracy, macro precision, macro recall, macro F1 are used and their specific definitions are as follows.

Suppose that the data set D has k categories. The ith category is considered as a positive class, and the remaining k − 1 categories are deemed to be negative class. We use Pi, Ri, F1i to denote the precision, recall and of ith category respectively.

(28)(29)(30)

From Eqs (28) to (30), it can be seen that the so-called macro is to calculate the precision, recall rate and F1 of each category, and calculate their average value respectively, so as to evaluate the performance of the algorithm in multi-class problems. The larger the macro precision, macro recall and macro F1, the better the performance of the algorithm. AUC value of the area under the ROC curve is also used in the evaluation criteria [39].

Since the number of categories of the 6 datasets is more than 2, the definition of AUC for multi-classification problems given by Hand and Till [40] is adopted. Nonlinear SVM based on Gaussian kernel function is used. The parameters of the SVM and KNN classification methods are realized by the machine learning adjustable parameter functions tune.svm and tune.kknn in the R language [41]. In the tune.svm, the parameter grid search range is set to 0.1 to 4 at step length 0.1. In the tune.kknn, the parameter grid search range is set to 1 to 30 at step length 1. RF is set to 500 trees by default, and the number of IRF iterations is set to 6.

To evaluate the overall classification performance of WKPCA combined with various machine learning algorithms, the definition of the optimal performance rate (OPR) of WKPCA is given in this paper. (31) where MN is the number of machine learning algorithms, DN is the number of data sets, EN is the number of evaluation indexes, and PN is the number of WKPCA dimension reduction algorithms reaching the maximum under each evaluation index.

By extending Eq (31), the cumulative optimal performance rate (COPR) of WKPCA is given (32) where PNi is the number of WKPCA dimension reduction algorithms reaching the jth maximum under each evaluation index and s is the number of methods compared with WKPCA.

4.2 Comparison experiment

Based on the t-class weighted kernel function, WKPCA dimension reduction is performed for the 6 gene expression data sets in Table 2. Through a large number of comparative experiments, for different datasets and different classification methods, Different kernel combination formulas for dimensionality reduction will result in different classification performance. In order to achieve the relatively optimal performance of the classification algorithm after dimensionality reduction of kernel principal component, the following three forms of kernel combination formula are mainly adopted.

(33)(34)(35)

The above equations are used to reduce dimension of the original data set based on the kernel principal component, and compared with the traditional Gaussian kernel, the experimental results are shown in Tables 3 to 8 in this paper.

thumbnail
Table 3. Performance measurement comparisons of machine learning methods based on the Breast data set after WKPCA dimension reduction.

https://doi.org/10.1371/journal.pone.0258326.t003

thumbnail
Table 4. Performance measurement comparisons of machine learning methods based on the DLBCL-B data set after WKPCA dimension reduction.

https://doi.org/10.1371/journal.pone.0258326.t004

thumbnail
Table 5. Performance measurement comparisons of machine learning methods based on the DLBCL-D data set after WKPCA dimension reduction.

https://doi.org/10.1371/journal.pone.0258326.t005

thumbnail
Table 6. Performance measurement comparisons of machine learning methods based on the Leukaemia data set after WKPCA dimension reduction.

https://doi.org/10.1371/journal.pone.0258326.t006

thumbnail
Table 7. Performance measurement comparisons of machine learning methods based on the Multi-A data set after WKPCA dimension reduction.

https://doi.org/10.1371/journal.pone.0258326.t007

thumbnail
Table 8. Performance measurement comparisons of machine learning methods based on the Lung data set after WKPCA dimension reduction.

https://doi.org/10.1371/journal.pone.0258326.t008

For SKPCA dimension reduction, the selected single kernel function is the Gaussian kernel (36)

The weights in the Eqs (29), (30) and (31) are determined according to the Eq (12) in the paper. For the determination of scale parameters c1, c2, and γ, the wrapper learning algorithm is used. The parameter selection of the kernel function is combined with the subsequent machine learning classification algorithm, and the parameters that make the classification performance optimal are selected through cross validation. Finally, these parameters are set to c1 = 0.1, c2 = 0.2 and γ = 0.1.

According to the experimental results in Tables 3 to 8, we can find that it is not difficult to find the relatively optimal parameters.

The above five machine learning methods are used to classify and predict the following 4 data sets: (1) one with all variables; (2) one obtained by linear principal component analysis dimension reduction; (3) one obtained by single kernel function dimension reduction; and (4) the last obtained by weighted kernel function dimension reduction. The comparison results obtained through nested 5-fold cross validation are shown in Tables 3 to 8, in which the optimal performance index values are bolded.

We combine the five machine learning methods with AV, PCA, SKCPCA and WKPCA, so each table (Table 3 through Table 8) contains 20 methods. In these six tables, the machine learning classification algorithm combined with WKPCA corresponds to the best performance. Taking the Breast data set as an example, compared with AV, PCA and SKPCA, NB_WKPCA, SVM_WKPCA, KNN_WKPCA and RF_WKPCA were all the largest in the four evaluation indexes. However, IRF_WKPCA did not reach the maximum on four evaluation indexes, and the other tables showed similar results. According to the experimental results from Table 3 through Table 8, among the 5 machine learning methods combined with WKPCA, there are 4, 14, 5, 13, 10 and 3 that do not reach the maximum on the four evaluation indexes. Therefore, according to Eq (31), the optimal performance rate of WKPCA on these 6 data sets is

According to Eq (32), the cumulative optimal performance rate of WKPCA on these 6 data sets is

Through OPR and COPR values, it can be concluded that the WKPCA algorithm is optimal at 71 and suboptimal at 37, and the cumulative optimal performance rate of the first two positions reaches 95%. This indicates that WKPCA dimension reduction can effectively improve the classification performance of the current mainstream machine learning algorithms. In other words, WKPCA is superior to AV, PCA and SKPCA in most cases.

It should be noted that for the SVM classification algorithm, if all variables are involved in the modelling without dimension reduction, the classification accuracy of SVM_AV on the 6 data sets is only 0.5200, 0.4833, 0.3803, 0.3188, 0.1933 and 0.7053. After WKPCA dimension reduction, the SVM classification accuracy was greatly improved, reaching 0.9184, 0.9556, 0.8071, 0.9758, 0.9805 and 0.9796, respectively. It is shown that when the number of features in the data set is much larger than the number of samples, the classification performance of some algorithms will be degraded or even become invalid if all variables are involved in the model. However, after WKPCA dimension reduction, a few principal components unrelated to each other are retained, redundant information (noise interference) is eliminated and the main information related to the sample category is retained, which improves the classification performance of the machine learning algorithm. In Tables 5 and 7, NB_AV has missing values (NA) on four performance indexes. According to experimental analysis, the reason for this problem is that the sample variance is 0 for at least 1 column variable. If all variables are included in the NB model for classification, this algorithm will fail. However, after the dimension reduction of WKPCA, PCA and SKPCA, the zero variance can be avoided, and normal classification results can be obtained.

To intuitively compare the classification effects of AV, PCA, SKPCA and WKPCA combined with the above five machine learning methods, the SVM, KNN and RF classifiers are taken as examples (other classifiers have similar situations). A bar chart of nested 5-fold cross-validation AUC values is drawn based on these six data sets, and the results are shown in Figs 3 to 5.

thumbnail
Fig 3. Comparison of the SVM_AV, SVM_PCA, SVM_SKPCA and SVM_WKPCA method AUC values.

https://doi.org/10.1371/journal.pone.0258326.g003

thumbnail
Fig 4. Comparison of the RF_AV, RF_PCA, RF_SKPCA and RF_WKPCA method AUC values.

https://doi.org/10.1371/journal.pone.0258326.g004

thumbnail
Fig 5. Comparison of the KNN_AV, KNN_PCA, KNN_SKPCA and KNN_WKPCA method AUC values.

https://doi.org/10.1371/journal.pone.0258326.g005

As seen from Fig 3, except that SVM_WKPCA is slightly inferior to SVM_PCA for the Leukaemia data, the AUC values of SVM_WKPCA for the other 5 data sets reach the maximum, which is a significantly better performance than those of SVM_AV and SVM_SKPCA and slightly better than that of SVM_PCA. In Fig 4, the AUC value of RF_WKPCA for the Breast data set is lower than those of RF_AV and RF_PCA, while for the other 5 data sets, the AUC values of RF_WKPCA all achieve the optimal values, but its advantage is not very significant. As seen from Fig 5, the AUC values of KNN_WKPCA reach the maximum for Multi-A and Lung data sets. The AUC value of KNN_WKPCA is similar to that of KNN_AV or KNN_PCA in Breast, DLBCL-B and Leukemia data sets. For the DLCBCL-D data set, the AUC value of KNN_WKPCA is the lowest. It is shown that for different data sets, dimension reduction using WKPCA can not make all classification algorithms achieve the optimal performance.

From Figs 3 to 5, overall it can be concluded that the AUC values of the SVM, RF and KNN classifiers can be improved after WKPCA dimension reduction for most data sets. The results show that WKPCA dimension reduction can effectively improve the predictive performance of the current mainstream machine learning classification algorithms.

5 Conclusion

Aiming at the characteristics of the high dimensionality, high redundancy and small sample sizes of gene expression data sets, a principal component dimension reduction algorithm based on the weighted kernel function is proposed in this paper to improve machine learning classification prediction performances and reduce the complexity of the classification process. By calculating the eigenvalues of the kernel matrix, the kernel function weight is constructed, and the t-class kernel function is also constructed to further improve the dimension reduction efficiency of WKPCA. Finally, the cumulative optimal performance rate is constructed to evaluate the overall classification level of WKPCA combined with mainstream machine learning algorithms. Through the analysis of the experimental results in 6 real data sets, compared with the all-variable model, traditional linear principal component analysis dimension reduction and single kernel principal component analysis dimension reduction, the WKPCA dimension reduction algorithm proposed in this paper can effectively improve the classification prediction performance of the current mainstream machine learning methods.

The key to WKPCA dimension reduction lies in how to choose a ‘suitable kernel function’. Our weighted kernel function makes the form of the kernel function more diversified and the selection more flexible, which allows better adaptation to data sets in different fields. In real-world problem analysis, to achieve the desired performances of machine learning on each data set in this paper, we have to attempt different kernel function combinations with different parameter settings. In other words, the best algorithm configuration is dataset-dependent. However, our WKPCA dimension reduction algorithm is quite insensitive to parameter settings.

Supporting information

References

  1. 1. Watson J D, Crick F H. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature. 1953; 248(5451): 623–624. pmid:13054692
  2. 2. Hanahan D, Weinberg R A. The hallmark of cancer. Cell. 2000; 100: 57–71. pmid:10647931
  3. 3. Chen W, Sun K, Zheng R, et al. Cancer incidence and mortality in China, 2014. Chinese Journal of Cancer Research. 2018; 30(1): 1–12. pmid:29545714
  4. 4. Rui X, Damelin S, Nadler B, et al. Clustering of high-dimensional gene expression data with feature filtering methods and diffusion maps. Artificial Intelligence in Medicine. 2010; 48(2–3): 91–98. pmid:19962867
  5. 5. Shen Q, Mei Z, Ye B X. Simultaneous genes and training samples selection by modified particle swarm optimization for gene expression data classification. Computers in Biology& Medicine. 2009; 39(7): 646–649. pmid:19481202
  6. 6. Jian Liu, Yuhu Cheng, Xuesong Wang, Lin Zhang, Jane Wang Z. Cancer characteristic gene selection via sample learning based on deep sparse filtering. Scientific Reports. 2018; 8(1):8270. pmid:29844511
  7. 7. Hindawi M, Benabdeslem K. Local-to-global semi-supervised feature selection. In Proceedings of ACM International Conference on Conference on Information & Knowledge Management. 2013; p. 2159–2168.
  8. 8. Helleputte T, Dupont P. Partially supervised feature selection with regularized linear models. In Proceedings of International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada. 2009;p. 409–416.
  9. 9. Liu B, Wan C, Wang L. An efficient semi-unsupervised gene selection method via spectral biclustering. IEEE Transactions on Nanobioscience. 2006; 5(2): 110–114. pmid:16805107
  10. 10. Lan L, Vucetic S. Improving accuracy of microarray classification by a simple multi-task feature selection filter. International Journal of Data Mining and Bioinformatics. 2011; 5(2):189–208. pmid:21544954
  11. 11. Shreem S S, Abdullah S, Nazri M Z A, et al. Hybridizing relief, mRMR filters and GA wrapper approaches for gene selection. Journal of Theoretical & Applied Information Technology. 2013; 46(2): 1258–1263.
  12. 12. Jian Liu, Yuhu Cheng, Xuesong Wang, Xiaoluo Cui. Supervised penalty matrix decomposition for tumor differentially expressed genes selection. Chinese Journal of Electronics. 2018; 27(4):845–851.
  13. 13. Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometrics and intelligent laboratory systems. 1987; 2(1–3):37–52.
  14. 14. Diamantaras K. I., & Kung S. Y. Principal component neural networks. New York: Wiley, 1996.
  15. 15. Hyvarinen A. A family of fixed-point algorithms for independent component analysis. IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1997; 5:3917–3920.
  16. 16. Hyvarinen A. Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks. 1999; 10(3): 626–634. pmid:18252563
  17. 17. Scholkopf B, Smola A, Muller K R. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation. 1998; 10(5):1299–1319.
  18. 18. Mika S, Ratsch G, Weston J, et al. Fisher discriminant analysis with kemels. Neural Networks for Signal Processing lx: Proceedings of the 1999 IEEE Signal Processing Society Workshop. IEEE, 2002; p. 41–48.
  19. 19. Xu Y, Lin C, Zhao W. Producing computationally efficient KPCA-based feature extraction for classification problems. Electronics Letters. 2010; 46(6):452–453.
  20. 20. Xu Y, Zhang D. Accelerating the kernel-method-based feature extraction procedure from the viewpoint of numerical approximation. Neural Computing & Applications. 2011; 20(7):1087–1096.
  21. 21. Gönen M, Alpaydin E. Multiple Kernel Learning Algorithms. Journal of Machine Learning Research. 2011; 12:2211–2268.
  22. 22. Zhang ZF, Ma LN, Xing LN. Infrared Face Recognition Based on Composite Kernel Function KPCA. Computer Simulation. 2013; 30(2):369–372.
  23. 23. Yang L, Zhichuan Z, Alin H, et al. Pulmonary Nodule Recognition Based on Multiple Kernel Learning Support Vector Machine-PSO. Computational & Mathematical Methods in Medicine. 2018,;2018:1–10. pmid:29853983
  24. 24. Lanckriet G R G, Cristianini N, Bartlett P, Ghaoui L E, Jordan M I. Learning the kernel matrix with semidefinite programming. The Journal of Machine Learning Research. 2004; 5(1): 27–72.
  25. 25. Xin WANG, Zhe-ming KANG, Long LIU, et al. Multi-channel Raman Spectral Reconstruction Based on Gaussian Kernel Principal Component Analysis. Acta Photonica Sinica. 2020; 49(3): 0330001.
  26. 26. Zhou ZH. Machine learning. Tsinghua University Press. 2016; p.126–128.
  27. 27. Kabacoff Robert I. R in Action (2nd edition). Posts and Telecom Press. 2016; p.442–443.
  28. 28. Schölkopf Bernhard. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press. 2003;
  29. 29. Gao HY. Apply Multivariate Statistical Analysis. Peking University Press. 2005.
  30. 30. Sheng Wang Guo. Properties and Construction Methods of Kernel in Support Vector Machine. Computer Science. 2006; 33(6):172–175.
  31. 31. Mao SS, Fu XL, Cheng YM. Probability and Mathematical Statistics. Higher Education Press. 2012.
  32. 32. Pedro Domingos, Michael Pazzani. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning. 1997; 29(2–3):103–130.
  33. 33. Cortes C, Vapnik V. Support Vector Network. Machine Learning. 1995; 20(3):273–297.
  34. 34. Venables W. N., Ripley B. D. Modern Applied Statistics with S. Springer Publishing Company, Incorporated.2010.
  35. 35. Breiman L. Random Forest. Machine Learning. 2001; 45:5–32.
  36. 36. Basu S, Kumbier K, Brown J B, et al. Iterative random forests to discover predictive and stable high-order interactions. Proc Natl Acad Sci U S A. 2018; 115(8):1943–1948. pmid:29351989
  37. 37. Anaissi A, Kennedy P J, Goyal M, et al. A balanced iterative random forest for gene selection from microarray data. Bmc Bioinformatics. 2013; 14(1):1–10. pmid:23981907
  38. 38. Shah R D, Meinshausen N. Random intersection trees. JMLR.org, 2014.
  39. 39. Bradley A P. The use of the area under the ROC curve in the evaluation of machine learning Algorithms. Pattern Recognition. 1997; 30(7):1145–1159.
  40. 40. Hand D J, Till R J. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning. 2001; 45(2):171–186.
  41. 41. Lesmeister Cory. Mastering Machine Learning with R (Second Edition). PACKT Publishing. 2017.