Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Improved linear classifier model with Nyström

  • Changming Zhu ,

    Roles Conceptualization, Formal analysis, Funding acquisition, Methodology, Writing – original draft

    cmzhu@shmtu.edu.cn

    Affiliation College of Information Engineering, Shanghai Maritime University, Shanghai, China

  • Xiang Ji,

    Roles Investigation, Writing – review & editing

    Affiliation College of Information Engineering, Shanghai Maritime University, Shanghai, China

  • Chao Chen,

    Roles Software

    Affiliation College of Information Engineering, Shanghai Maritime University, Shanghai, China

  • Rigui Zhou,

    Roles Methodology

    Affiliation College of Information Engineering, Shanghai Maritime University, Shanghai, China

  • Lai Wei,

    Roles Investigation

    Affiliation College of Information Engineering, Shanghai Maritime University, Shanghai, China

  • Xiafen Zhang

    Roles Data curation

    Affiliation College of Information Engineering, Shanghai Maritime University, Shanghai, China

Abstract

Most data sets consist of interlaced-distributed samples from multiple classes and since these samples always cannot be classified correctly by a linear hyperplane, so we name them nonlinearly separable data sets and corresponding classifiers are named nonlinear classifiers. Traditional nonlinear classifiers adopt kernel functions to generate kernel matrices and then get optimal classifier parameters with the solution of these matrices. But computing and storing kernel matrices brings high computational and space complexities. Since INMKMHKS adopts Nyström approximation technique and NysCK changes nonlinearly separable data to linearly ones so as to reduce the complexities, we combines ideas of them to develop an improved NysCK (INysCK). Moreover, we extend INysCK into multi-view applications and propose multi-view INysCK (MINysCK). Related experiments validate the effectiveness of them in terms of accuracy, convergence, Rademacher complexity, etc.

Introduction

Background

In real-world applications, most data sets consist of interlaced-distributed samples from multiple classes. If samples cannot (can) be classified correctly with a linear hyperplane, we name them nonlinearly (linearly) separable samples. As we know, linear classifiers including HK, MHKS, and SVM [1] are feasible to process linearly separable samples. While for nonlinearly ones which are ubiquitous, nonlinear classifiers including NCC [2], FC-NTD [3], KMHKS [4], KSVM [5] are more suitable. One kind of nonlinear classifiers is kernel-based ones including MultiV-KMHKS [6], MVMHKS [7], RMVMHKS [8], DLMMLM [9], UDLMMLM [10], etc [1113] and they adopt kernel functions to generate kernel matrices firstly and get optimal classifier parameters after the solution of these matrices. Here, for convenience, we summary full names and abbreviations for some terms in Table 1.

Problem and previous solutions

Most kernel-based classifiers cost an O(n3) computational complexity to decompose matrices and an O(Mn2) space complexity to store them where n is number of samples and M is number of used kernel functions. But the complexities are too high for most real-world classification problems. Fortunately, some classifiers including NMKMHKS [14], INMKMHKS [11], and NysCK which is developed on the base of cluster kernel (CK) [15] are developed to reduce complexities. (1) NMKMHKS selects s samples from n ones and uses Nyström approximation technique to get approximation form for each kernel matrix. With NMKMHKS, computational complexity can be reduced to O(Mns2) and space complexity can be reduced to O(n2). While since the numbers and parameters of used kernel functions should be initialized beforehand and s is set in random, the performance of NMKMHKS maybe poor when comes to noise cases and is sensitive to s. (2) INMKMHKS adopts clustering technology to guide the generations of kernel functions and approximation matrices. This operation can solve the defects of NMKMHKS and keep a lower complexity. (3) NysCK decomposes each kernel matrix K by K = FFT where each row in F represents a linearly separable sample and then nonlinearly separable samples can be changed to linearly ones. [15] has validated that those linearly ones correspond to the original ones and they can be classified by linear classifiers with a high accuracy.

Motivation and novelty

Since INMKMHKS avoids the setting of s and kernel parameters and NysCK changes the nonlinearly separable samples to linearly ones, we combine them in together to develop improved NysCK (INysCK) to reduce complexities further. Moreover, multi-view data set which consists of samples with multiple views and each view consists of multiple features is a widely used one in real world and many corresponding multi-view classifiers are developed [1618]. Since INysCK has not an ability to process multi-view data sets, thus we extend INysCK into multi-view applications and propose multi-view INysCK (MINysCK).

Since INMKMHKS (NysCK) was developed in 2015 (2017), thus ideas and innovations of them are still new at some extends. What’s more, to the best of our knowledge, until now, there is no method combines their ideas in together. In other words, the idea of our methods is novel and it is the first trial for this. In our methods, for the original data set, we first adopt the ideas of INMKMHKS to generate several kernel functions and get the corresponding Nyström approximation matrices. Then on the base of these matrices, we adopt the ideas of NysCK to get F. In F, each row represents a linearly separable sample which corresponds to an original sample. Then we can classify these linearly separable samples with linear classifiers. This operation is similar with the one of classifying the original samples with some nonlinear classifiers. Moreover, this operation won’t influence the classification results.

Contribution

Contributions of our work are (1) provide a new idea to process nonlinear classification problems and needn’t to initialize many parameters beforehand; (2) keep low computational and space complexities; (3) first time to process multi-view problems with such an idea.

Related work

Nyström approximation technique

For a kernel-based classifier, whether the solution is feasible or not depends on the eigendecomposition of kernel matrix and in general, the eigendecomposition needs a O(n3) computational cost where n is number of samples. In order to cut down the computational cost, [19] develops Nyström approximation technique to speed up the eigendecomposition. Simply speaking, one selects s samples from the whole data set to approximate the kernel matrix, then computational complexity can be reduced to O(ns2). Recently, Nyström approximation technique has been applied into multiple different fields. For example, [20] uses this technique to get approximate infinite-dimensional region covariance descriptor which significantly outperforms the low-dimensional descriptors on image classification task; [21] introduces Nyström into kernel subspace learning and reduces the time and space complexities; [22] combines Nyström method with spectral clustering algorithm to decrease the computation complexity of the spectral clustering with a high clustering accuracy kept.

NysCK

Suppose there is a nonlinearly separable data set X = [Xl, Xv], Xl consists of l labeled samples, Xv consists of v unlabeled samples, l < v, n = l + v. Objective of NysCK is changing nonlinearly separable samples to linearly ones with Nyström approximation technique and predicting class labels of those unlabeled ones. For these n samples, NysCK constructs a kernel matrix K firstly and then selects s samples from Xl to decompose K with and where corresponds to the s samples. After that, NysCK carries out SVD on W and gets the Nyström approximation matrix of K, i.e., where denotes the pseudo-inverse of Wk which is the best rank-k approximation of W. Finally, NysCK decomposes with where represents n linearly separable samples. Each row of F corresponds to an original sample, i.e., Fl corresponds to Xl while Fv corresponds to Xv. Finally, we can train a linear classifier on Fl and classify Fv. According to [15], computational and space complexities are O(n(nd + k2)) and O(n(d + k)) respectively where d is the dimension of each sample.

INMKMHKS

Procedure of INMKMHKS consists of four steps. (1) For a data set X with n samples, INMKMHKS adopts kernel clustering to cover X with M clusters and samples in each cluster have same class labels. Then INMKMHKS regards midpoint and width of each cluster as parameters of a RBF kernel and as a result, M kernel functions are generated without setting initial kernel parameters. (2) With usage of M kernel functions, INMKMHKS generates M kernel matrices Kps and gets corresponding Nyström approximation forms without setting initial s (p = 1, 2,…, M). (3) INMKMHKS calculates coefficient of each and constructs ensemble kernel matrix G with and corresponding coefficients. (4) INMKMHKS applies G into the KMHKS-based process and gets the final discriminant function.

INysCK and MINysCK

INysCK

Generating kernel functions without setting initial kernel parameters.

Suppose there is a L-class data set X including l training samples Xl and v test samples Xv (n = l + v). Here Xl = {X1, X2,…, XL} = {x1, x2,…, xl} and the class labels are Y = {y1, y2,…, yL}. For each class Xc (c = 1, 2,…, L), it consists of nc samples, i.e., . Here, nc is the number of samples in Xc and l = n1 + n2 + … + nL. xcj is jth sample of Xc where j ∈ {1, 2, …, nc}. Then on the base of l training ones, we generate kernel functions with the following way.

For generating the first kernel function, we compute midpoint μ of all training samples, i.e., and distance between xi and μ, i.e., . Distances are sorted in an ascending order, i.e., and the corresponding samples are denoted as x(1), x(2),…,x(l). If the class labels of x(1), x(2),…,x(u) are same while the label of x(u+1) is not same as the one of x(u), then we let kernel parameters (in our work, used kernel function is RBF and its expression is ). Then for this kernel function, we let corresponding samples x(1), x(2),…,x(u) be basic samples and u be basic number. What’s more, in order to generate the second kernel function, we remove x(1), x(2),…,x(u) from Xl and repeat previous steps. We repeat the steps again and again until each training sample belongs to basic samples of one kernel function. After this generation way, we can get M new kernel functions, i.e., k1(xi, xj),…, kp(xi, xj),…, kM(xi, xj) where p = 1, 2,…, M, we also get the corresponding M σs and μ′s.

Constructing kernel matrices with Nyström approximation technique.

(1) We construct kernel matrices according to these M kernel functions. Suppose for pth kernel function, its parameters are and basic samples are x(1),…, x(u). With all n samples, the corresponding n × n kernel matrix is Kp = (k(xi, xj))n×n and its ith row and jth column element is where i, j = 1,…, n. For convenience, in Kp, {x1,…,xl} corresponds to l training samples and {xl+1,…,xn} corresponds to v test ones.

(2) We centralize and normalize Kp with Eqs (1) and (2) just for convenient calculation. Here 1n×n is a n × n-dimensional identity matrix, trace indicates the trace of a matrix, and we use Kp to denote the centralized and normalized matrix Ktrp for convenience. (1) (2)

(3) For Kp, if both xi and xj are in the set {x(1),…,x(u)}, we combine these k(xi, xj)s together and generate an u × u-dimensional matrix, i.e, Wp. If only one of xi and xj is in this set, we combine these k(xi, xj)s together and generate a (nu) × u-dimensional matrix, i.e, Kp21. Then . If neither xi nor xj is in this set, we combine these k(xi, xj)s together and generate a (nu) × (nu)-dimensional matrix, i.e, Kp22. The relative positions of those k(xi, xj)s in Wp, Kp12, Kp21, and Kp22 are not changed. With above definitions, we decompose Kp with Eq (3) and let s = u without initializing s. (3)

(4) We carry out SVD on Wp, i.e. . Here, Λp = diag(σp1, ⋯, σpu). σpi is the ith largest singular value. Up is composed of the eigenvectors of the Wp based on σpi and diag indicates the diagonalization operation.

(5) We get the rank-k Nyström approximation matrix for Kp by (4) where is the rank-k pseudo-inverse of Wp, k (ku) is the rank of and is ith column of Up.

(6) After repeating previous steps, we can get all and for each Kp, we have a corresponding basic number u. Then we let largest u be s.

(7) According to [14] and Nyström approximation error between Kp and , we calculate coefficient αp of each by Eq (5). (5) where ‖.‖F represents the Frobenius norm, η > 0 is a predefined parameter, is a normalization factor which is used to set .

(8) Finally, we get the ensemble kernel matrix G with the following equation. (6)

Getting corresponding linearly separable samples.

(1) Once we get G, we let D = diag(D11,…, Dnn) where element and Gij represents the ith row and jth column element of G (i, j = 1,…, n). Then we decompose G with Eq (7) where , , , , . Since s is gotten with sub-step (6) in previous subsection and G is combination of multiple , so elements in WG, G12, G21, G22 are fixed here. Since s < l, ns = l + vs > v, so v × v part in the lower right corner of G22 corresponds to v test samples. (7)

(2) We carry out SVD on WG, i.e. where WGk is the best rank-k approximation of WG, ΣWG,k is a diagonal matrix and diagonal consists of first k approximate eigenvalues, and UWG,k consists of the corresponding k approximate eigenvectors. Then we get Nyström approximation matrix of G with Eq (8) where denotes the pseudo-inverse of WGk. (8)

(3) After that, we compute Σk and Uk using Eq (9) where is the pseudo-inverse of ΣWG,k. In terms of each element σi of Σk, we apply Eq (10) and get the . Generally speaking, since v is larger than 9, so we use h = l + 9 is feasible. (9) (10)

(4) Once we get , we let and then . Then, we have and get where is ith row and ith column element of . Finally, we can get and linearly separable data set by Eq (11). According to [15] said, each row of F corresponds to an original sample, i.e., Fl corresponds to Xl and Fv corresponds to Xv. Once F is gotten, we can adopt linear classifiers to train and classify them. For convenience, Table 2 shows framework of INysCK. (11)

MINysCK

Suppose there is a multi-view data set where V is the number of views and n is the number of samples. The gth view is and the ith sample is . represents gth view of ith sample. For each view Xg, its dimension is dg which indicates that this view consists of dg features. Now in the procedure of MINysCK, we conduct INysCK on each view Xg and get the corresponding F, i.e., Fg. Then for V views, we get V groups F. For the X, its linear form is F = {F1,F2,…,FV}. Finally, we can adopt some multi-view classifiers to process F. Table 3 shows framework of MINysCK.

Computational complexity and space complexity

According to [15], the computational complexity and space complexity of NysCk are O(n(nd + k2)) and O(n(d + k)) respectively where d is the dimension of each sample. Then compared with NysCk, the main added steps of INysCK are the generation of kernel functions and matrices. Thus, the added computational complexity is O(Ml2) and the added space complexity is O(Ml2). Since in real-world applications, Mn and ln, so computational complexity and space complexity of INysCk are almost same as ones of NysCK. For MINysCK, the computational complexity is , the space complexity is . Thus, we find that computational and space complexities of INysCK and MINysCK are same as the ones of NysCK.

Experiments

Experimental setting

We adopt four multi-view data sets (NUS-WIDE, YMVG, DBLP, Cora) and four UCI machine learning repository (UCI) [23] data sets (YCS, AA, BC, Arrhythmia) for experiments in niche targeting. Among these data sets, half of them are large-scale and the left are small-scale. Information of used UCI data sets is given in Table 4 and in terms of four multi-view ones, we describe them as below where D denotes dimensionality. (1) NUS-WIDE is a web image data set which consists of 269648 samples (images), 5018 classes, and 6 views [24]. The six views are color histogram (Col-h, 64-D), color correlogram (Col-c, 144-D), edge direction histogram (Ed-h, 73-D), wavelet texture (Wav, 128-D), block-wise color moments (Bw-cm, 225-D), and bag of words based on SIFT descriptions (B-SIFT, 500-D); (2) YMVG [25] is the abbreviation of YouTube multi-view video games and it consists of 120000 samples (videos) from 31 classes (games). Each sample consists of 13 views. They are audio mfcc (A-m, 2000-D), audio sai boxes (A-s-b, 7168-D), audio sai intervalgrams (A-s-i, 4096-D), audio spectrogram stream (A-s-s, 1024-D), audio volume stream (A-v-s, 64-D), text description unigrams (T-d-u, 558936-D), text game lda 1000 (T-g-l, 1000-D), text tag unigrams (T-t-u, 422627-D), vision cuboids histogram (V-c-h, 512-D), vision hist motion estimate (V-h-m-e, 64-D), vision hog features (V-h-f, 647-D), vision hs hist stream (V-h-h-s, 1024-D), and vision misc (V-m, 838-D); (3) DBLP [26, 27] is abbreviation of digital bibliography and library project. Original DBLP is very large and we select 5000 samples from 4 classes for experiments. Each sample has two views, one is paper name (P-n, 6167-D) and the other is term (Te, 3787-D); (4) Cora [27, 28] is adapted from original Cora data set [29] and it consists of 12004 scientific articles (samples) from 10 thematic classes and for each sample, it has two views, i.e., content (Co, 292-D) and relational (Re, 12004-D). What’s more, these data sets are third party ones and others would be able to access these data in the same manner as what we have done. Moreover, we confirm that we don’t have any special access privileges that others would not have.

Then we adopt CK, NysCK, INysCK, and MINysCK to change nonlinearly separable samples to linearly ones. If we use original data sets for experiments, we adopt ‘Null’ for representation. Here, we treat CK as a baseline method and if we only compare with NysCK, NysCK can be regarded as a baseline one.

We use classifiers shown in Table 5 for further processing and linear classifiers are only feasible for linearly separable data sets while nonlinear classifiers are feasible for both nonlinearly and linearly ones. Similarly, multi-view classifiers can process not only multi-view but also single-view data sets while single-view classifiers are only feasible for single-view ones. We adopt SVM and MSVM as two baseline classifiers in respective experiments.

What’s more, for each data set, 70% of samples are chosen in random as training samples and the remaining are for test. In order to get the truly experimental results, we adopt 10-fold cross validation strategy [10]. Moreover, one-against-one classification strategy is used for multi-class problems here [3033]. In order to get the average experimental results, we repeat the experiments for 10 times. The computations are performed on Intel Core 4 processors with 2.66GHz, 4G RAM DDR3, Win 7, and MATLAB 2014 environment.

Independent experiments

This part shows the performances of our proposed methods on different kinds of data sets.

Accuracy comparison on large-scale single-view data sets.

First, we show effectiveness of our INysCK on two large-scale single-view data sets YCS and AA. For fair comparison, we select 8 single-view classifiers shown in Table 5 for experiments and adopt CK, NysCK, INysCK to change samples into linearly separable ones. Moreover, as we know, accuracy, true positive rate (acc+), true negative rate (acc), positive predictive value (PPV), F-Measure, G-Mean, etc. [44] are widely used to evaluate the classification performances. Here, we only show the results about accuracy due to for other evaluation criteria, we draw similar conclusions. Then the top two sub-figures of Fig 1 show the results. According to these two sub-figures, we can see that for large-scale single-view data sets, our INysCK brings a best accuracy no matter which classifier is used. Specially, we find compared with CK and NysCK, for the case AA with SVM, the improvement of INysCK is little while for other cases, the improvement is more. In order to elaborate this phenomenon, we analysis the distributions of YCS and AA. Since these two data sets are large-scale, so we won’t show the distributions of their samples with figure and just describe the distributions in short. We find that for these two data sets, their samples distribute with an interlaced way and a high nonlinearity. After carrying out CK-related methods, most samples become linearly separable and compared with CK and NysCK, with INysCK used, samples have a higher linearity. Moreover, since the sizes of YCS and AA are large, so the advantages of linearities derived from INysCK are larger. Thus when we use those single-view classifiers no matter nonlinear ones and linear ones to process the changed samples, accuracies are higher. In terms of the case AA with SVM, we find with CK, NysCK, INysCK used, the classification functions provided by support vectors are similar. That’s why that our INysCK brings a little improvement for this case.

thumbnail
Fig 1. Accuracy with related classifiers and CK-based methods on used single-view data sets.

CK-related method in italic represents baseline one and one in bold denotes the proposed one. For classifiers, SVM is used as the baseline one and we just clarify this point in words rather than in font. In other figures and tables, we have similar representations.

https://doi.org/10.1371/journal.pone.0206798.g001

Accuracy comparison on small-scale single-view data sets.

Second, we adopt data sets BC and Arrhythmia for experiments so as to validate effectiveness of INysCK on small-scale single-view problems. Similarly with the above experiments, single-view classifiers shown in Table 5, and CK, NysCK are adopted. Then bottom two sub-figures of Fig 1 show the results. According to these two sub-figures, it is found that for small-scale single-view problems, our INysCK still brings a best accuracy in average. But compared with the results given in above experiments, on more cases, INysCK only brings a little improvement and we find for the case Arrhythmia with KMHKS, NysCK outperforms INysCK. For this phenomenon, we also analysis the distributions of BC and Arrhythmia. We find that with INysCK used, samples of these two data sets are linearly separable. But since the sizes of them are small, so advantages of linearities derived from INysCK are not obvious even more not exist. Thus for some cases, the improvement of INysCK is little and for the case Arrhythmia with KMHKS, INysCK performs worse than NysCK due to the advantage of linearity derived from INysCK is not exist in terms of KMHKS.

Accuracy comparison on large-scale multi-view data sets.

Then, we use NUS-WIDE and YMVG to show the effectiveness of proposed methods on large-scale multi-view problems. The used classifiers are multi-view ones shown in Table 5. Then here, we use CK, NysCK, INysCK, and MINysCK to change samples. Although NUS-WIDE and YMVG are multi-view data sets, in order to carry out CK, NysCK, and INysCK, we regard all views as a whole view. The top two sub-figures of Fig 2 show the results. According to these two sub-figures, we find that as a multi-view method, our MINysCK brings a best accuracy in average and the improvement is much more than the previous results, especially, for the cases NUS-WIDE with MSVM (MultiV-KMHKS, DLMMLM, MGGM, MV-LSSVM, SMVMED, KMLRSSC) and YMVG with MSVM (MultiV-KMHKS, MGGM). The main reason is that for multi-view data sets, MINysCK is more feasible than the single ones including CK, NysCK, and INysCK. What’s more, taking all views as a whole view to carry out CK, NysCK, and INysCK cannot reflect the differences among views and this operation can only keep the linearity of samples on the whole view and cannot promise linearity on respective view. That’s why MINysCK performs best on NUS-WIDE and YMVG.

thumbnail
Fig 2. Accuracy with related classifiers and CK-based methods on used multi-view data sets.

CK-related method in italic represents baseline one and ones in bold denote the proposed ones. For classifiers, MSVM is used as the baseline one and we just clarify this point in words rather than in font. In other figures and tables, we have similar representations.

https://doi.org/10.1371/journal.pone.0206798.g002

Accuracy comparison on small-scale multi-view data sets.

Now we adopt DBLP and Cora for the accuracy comparison on small-scale multi-view problems and other settings are same as ones given in above experiments. The bottom two sub-figures of Fig 2 show the results. From these two sub-figures, we can see that our MINysCK performs best on small-scale multi-view data sets as well. But compared with the above experimental results in careful, we find since the sizes of DBLP and Cora are more smaller than the ones of NUS-WIDE and YMVG, so advantages of linearities derived from MINysCK are not obvious, as a result, the improvement of MINysCK has a reduction.

Comparison about time cost.

Besides the accuracy comparisons, we show the time cost comparison here. As we said before, the computational complexity and space complexity of INysCK and MINysCK are same as the ones of NysCK, i.e., O(n(nd + k2)) and O(n(d + k)) respectively where d is the dimension of each sample and all of these three NysCK-related methods are used to change the nonlinearly separable samples to the linearly separable ones. Here, we show the practice time of them on different data sets in Table 6 and from this table, it is found that (1) since the procedures of INysCK and MINysCK are more complicate than NysCK, so they both cost longer time in average while the increased time is acceptable; (2) for multi-view data sets, MINysCK costs less time than INysCK, we think the main reason is that we won’t regard the multiple views as a single whole view with some fusion techniques and process each view in each small problem. This maybe brings a smaller total time cost.

thumbnail
Table 6. Comparison about time (in seconds) cost for the three NysCK-related methods.

https://doi.org/10.1371/journal.pone.0206798.t006

Comprehensive experiments

This part shows average performances of our proposed methods on all used data sets.

Distributions of samples with different CK-related methods.

Here, we use a two-dimensional binary-class data set X to compare the performances of CK, NysCK, and INysCK when they change nonlinearly separable data to linearly ones. The distributions of samples before or after carrying out CK-related methods are shown in Fig 3. According to this figure, it is found that (1) all CK-related methods can change nonlinearly separable samples to linearly separable ones at some extends; (2) with INysCK used, for the same class, most samples locate in an area centrally and only few samples locate far from this area. With calculation, we find that with CK used, 20% samples locate in the area which belongs to different classes. For NysCK, the ratio is 6.5% while for INysCK, the ratio is only 3.5%. This means with our methods used, samples have a higher linearity. What’s more, since it is always hard for people to show a multi-view data set or multi-dimensional data set whose dimensionality is large than two with a two-dimensional picture, thus we won’t show the distribution of samples with MINysCK used and only use a two-dimensional data set for experiments here. But this won’t influence our conclusion.

thumbnail
Fig 3. Distributions of samples with different CK-related methods on a binary-class data set.

https://doi.org/10.1371/journal.pone.0206798.g003

Convergence analysis.

Convergence is an important criterion to assess the effectiveness of a classifier and if a classifier can converge within limited iterations with a better classification performance, we say this classifier is effective. What’s more, the distribution of samples also affect the convergence and samples with a high linearity always accelerate the optimization of a classifier. Here, we adopt an empirical justification given in [45] to measure the convergence of classifiers with our methods used and Table 7 shows the results. Each cell in this table denotes the average number of iterations of a classifier on all used data sets with a CK-related method used. According to this table and combining the results given before, we know that with our proposed INysCK and MINysCK used, the changed samples have a higher linearity and these samples accelerate the optimization of classifiers which indicates a smaller numbers of iterations. What’s more, since MINysCK is more feasible for multi-view data sets, so for the multi-view classifiers, they can converge faster with MINysCK used.

Rademacher complexity analysis.

As [14] and [11] said, Rademacher complexity is a reflection about generalization risk bound and performance behavior of a classifier. A smaller Rademacher complexity indicates a better performance of a classifier and a lower generalization risk bound. Here, we adopt the same method given in [11] to compute Rademacher complexity for classifiers with different CK-related methods used. Fig 4 shows the results and according to this figure, we know (1) in terms of single-view classifiers, since samples with INysCK used have a higher linearity, so classifiers have smaller Rademacher complexities; (2) in terms of multi-view classifiers, since MINysCK is more feasible and it also makes the samples have a higher linearity, thus related Rademacher complexities are smaller.

Significance analysis.

We adopt Friedman-Nemenyi statistical test [46] to validate the difference between our proposed methods and the previous work is significant. In terms of Friedman-Nemenyi statistical test, Friedman test is used to analyze if the differences between all compared algorithms on multiple data sets are significant or not while Nemenyi test is used to analyze if the differences between two compared algorithms on multiple data sets are significant or not.

In order to carry out Friedman test, we treat each CK-related method as an ‘algorithm’ and regard each classifier as a ‘data set’. Then according to the average accuracy of an ‘algorithm’ on a ‘data set’, Friedman test ranks the ‘algorithm’s for each ‘data set’ as shown in Table 8. (1) For single-view cases, since we use 4 ‘algorithm’s and 8 ‘data set’s, we carry out Friedman test and get and FF = 59.37 (the computation equations of and FF can be found in [46]). Further, with 4 ‘algorithm’s and 8 ‘data set’s, FF is distributed according to the F distribution with 4 − 1 = 3 and (4 − 1) × (8 − 1) = 21 degrees of freedom. The critical value of F0.05(3, 21) when α = 0.05 is 3.0725 and F0.10(3, 21) when α = 0.10 is 2.3649. As FF > 3.0725 and FF > 2.3649, we say for the single-view cases, the differences between all compared CK-related methods on multiple classifiers are significant. (2) Similarly, for multi-view cases, with 5 ‘algorithm’s and 10 ‘data set’s used, related , FF = 63.91, F0.05(4, 36) = 2.6335, and F0.10(4, 36) = 2.1079. Since FF > 2.6335 and FF > 2.1079, we can draw a conclusion that for the multi-view cases, the differences between all compared CK-related methods on multiple classifiers are also significant.

thumbnail
Table 8. Average rank comparisons for different CK-related methods and classifiers.

https://doi.org/10.1371/journal.pone.0206798.t008

Then we use Nemenyi test for pairwise comparisons. (1) For single-view cases, when α = 0.05, the critical value q0.05 is 2.569 (see Table 9) and the corresponding CD is . When α = 0.10, the critical value q0.10 is 2.291 (see Table 9) and the corresponding CD is . Then according to the principle of Nemenyi test, since 1.84 < 1.16 + 1.66 = 2.82 < 3.84 and 1.84 < 1.16 + 1.48 = 2.64 < 3.84, so we say the differences between INysCK and CK (NysCK) are (not) significant. (2) For multi-view cases, according to Table 9, since q0.05 = 2.728 and q0.10 = 2.459, the corresponding CDs are and respectively. Then since 1.56 < 3.34 < 1.44 + 1.93 = 3.37 and 1.56 < 1.44 + 1.74 = 3.18 < 3.34, we say the differences between MINysCK and CK are significant and the ones between MINysCK and NysCK are significant to a certain extent.

thumbnail
Table 9. Critical values for the two-tailed Nemenyi test.

https://doi.org/10.1371/journal.pone.0206798.t009

As a summary, we can draw a conclusion that according to Friedman-Nemenyi statistical test, our proposed INysCK (or MINysCK) is an improvement on previous work CK (or NysCK) statistically.

Influence of ratio of training samples.

In the previous experiments, for each data set, we randomly choose 70% of samples as training part and the remaining as test part. Here, we change the ratio of training samples and show its average influence on accuracy with Fig 5. According to this figure, it is found that with the increasing of the ratio of training samples, the average accuracy also boosts.

thumbnail
Fig 5. Average influence of ratio of training samples on accuracy with INysCK and MINysCK and corresponding classifiers used.

https://doi.org/10.1371/journal.pone.0206798.g005

Conclusions and future work

Conclusions

Traditional nonlinear classifiers are developed to process nonlinearly separable data sets and they always use kernel functions to generate several kernel matrices. After the optimization of these matrices, the optimal classifier parameters can be gotten. While one always costs high computational and space complexities to compute and store these matrices, so in order to reduce the complexities, people develop INMKMHKS which adopts Nyström approximation technique and NysCK which changes nonlinearly separable samples to linearly ones. In this work, we combine ideas of them in together to develop INysCK and MINysCK to reduce the complexities further and process single-view data sets and multi-view data sets respectively. In order to validate the effectiveness of them, we use CK and NysCK for comparisons. Then we adopt some large-scale, small-scale, single-view, multi-view data sets and single-view, multi-view, nonlinear, linear classifiers for experiments in niche targeting. Corresponding experiments about accuracy, time cost, convergence, Rademacher complexity, and so on have validated the effectiveness of INysCK and MINysCK.

According to experimental results, we can draw the following conclusions. (1) INysCK and MINysCK can change nonlinearly separable samples to be linearly separable with higher linearities and the accuracies of corresponding classifiers boost. (2) Compared with NysCK, INysCK and MINysCK both cost longer time in average while the increased time is acceptable. (3) With INysCK and MINysCK used, classifiers can converge faster and their Rademacher complexities are smaller. (4) INysCK (or MINysCK) is an improvement on previous work CK (or NysCK) statistically.

Future work

Although our proposed methods perform better for nonlinear classification problems, according to [47] said, Nyström approximation technique is data-dependent even though we adopt the Nyström approximation technique used in INMKMHKS to avoid parameter setting problems. In [47], on the base of Hellinger’s kernel and χ2 kernel, scholars use two mapping functions which are both data-independent to enhance the classification performance. Thus, in our future work, we try to introduce the idea of [47] to our work. In other words, we will try to use data-independent mapping functions to change the nonlinearly separable samples to the linearly separable ones. What’s more, besides what we have discussed in this work, there are some other pattern recognition fields attract scholars to research, for example, unsupervised feature selection [48, 49] and multi-label learning [50]. So in our future work, we will try to introduce our methods into these fields.

Acknowledgments

This work is supported by (1) Natural Science Foundation of Shanghai under grant numbers 16ZR1414500, 16ZR1414400 (2) National Natural Science Foundation of China under grant numbers 61602296, 41701523 (3) Shanghai Pujiang talent plan under grant number 16PJ1403700 and the authors would like to thank their supports.

References

  1. 1. Vapnik V. Statistical Learning Theory. New York: John Wiley & Sons; 1998.
  2. 2. Hassan MF, Abdelqader I. Improving pattern classification by nonlinearly combined classifiers. International Conference on Cognitive Informatics & Cognitive Computingt. 2016;489–495.
  3. 3. Zhu XB, Pedrycz W, Li Z. Fuzzy clustering with nonlinearly transformed data. Applied Soft Computing. 2017;61:364–376.
  4. 4. Leski J. Kernel Ho-Kashyap classifier with generalization control. International Journal of Applied Mathematics and Computer Science. 2004;14(1):53–61.
  5. 5. Zhang K, Lan L, Wang Z, Moerchen F. Scaling up kernel SVM on limited resources: A low-rank linearization approach. Conference on Artificial Intelligence and Statistics (AISTATS). 2012;22:1425–1434.
  6. 6. Wang Z, Chen SC. Multi-view kernel machine on single-view data. Neurocomputing. 2009;72:2444–2449.
  7. 7. Wang Z, Chen SC, Gao DQ. A novel multi-view learning developed from single-view patterns. Pattern Recognition. 2011;44(10–11):2395–2413.
  8. 8. Wang Z, Xu J, Chen SC, Gao DQ. Regularized multi-view machine based on response surface technique. Neurocomputing. 2012;97:201–213.
  9. 9. Zhu CM, Wang Z, Gao DQ, Feng X. Double-fold localized multiple matrixized learning machine. Information Sciences. 2015;295:196–220.
  10. 10. Zhu CM. Double-fold localized multiple matrix learning machine with Universum. Pattern Analasis and Application. 2017;20(4):1091–1118.
  11. 11. Zhu CM, Gao DQ. Improved multi-kernel classification machine with Nyström approximation technique. Pattern Recognition. 2015;48:1490–1509.
  12. 12. Zhou ZL, Wang YL, Wu QMJ, Yang CN, Sun XM. Effective and Efficient Global Context Verification for Image Copy Detection. IEEE Transactions on Information Forensics and Security. 2017;12(1):48–63.
  13. 13. Xia ZH, Wang XH, Zhang LG, Qin Z, Sun XM, Ren K. A Privacy-preserving and Copy-deterrence Content-based Image Retrieval Scheme in Cloud Computing. IEEE Transactions on Information Forensics and Security. 2016;11(11):2594–2608.
  14. 14. Wang Z, Zhu CM, Niu ZX, Gao DQ, Feng X. Multi-kernel classification machine with reduced complexity. Knowledge-Based Systems. 2014;65:83–95.
  15. 15. Hou BJ, Zhang LJ, Zhou ZH. Storage Fit Learning with Unlabeled Data. Twenty-Sixth International Joint Conference on Artificial Intelligence. 2017;1844–1850.
  16. 16. Ye HJ, Zhan DC, Miao Y, Jiang Y, Zhou ZH. Rank consistency based multi-view learning: a privacy-preserving approach. ACM International on Conference on Information and Knowledge Management. 2015;991–1000.
  17. 17. Sharma A, Kumar A, Daume H, Jacobs DW. Generalized multiview analysis: a discriminative latent space. IEEE Conference on Computer Vision and Pattern Recognition. 2012;157:2160–2167.
  18. 18. Wang W, Zhou ZH. Multi-view active learning in the nonrealizable case. Neural Information Processing System. 2010;23:2388–2396.
  19. 19. Williams CKI, Seeger M. Using the Nyström method to speed up kernel machines. Conference on Neural Information Processing Systems. 2000;661–667.
  20. 20. Faraki M, Harandi MT, Porikli FM. Approximate infinite-dimensional region covariance descriptors for image classification. IEEE International Conference on Acoustics, Speech and Signal Processing. 2015;1364–1368.
  21. 21. Iosifidis A, Gabbouj M. Nyström-based approximate kernel subspace learning. Pattern Recognition. 2016;57:190–197.
  22. 22. Li LC, Wang SL, Xu SJ, Yang YQ. Constrained spectral clustering using Nyström method. Procedia Computer Science. 2018;129:9–15.
  23. 23. Frank A, Asuncion A. UCI machine learning repository (http://archive.ics.uci.edu/ml). Irvine: University of California; 2010.
  24. 24. http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm
  25. 25. http://archive.ics.uci.edu/ml/datasets/youtube+multiview+video+games+dataset
  26. 26. http://dblp.uni-trier.de/xml/
  27. 27. Wang SK, Wang EK, Li XT, Ye YM, Lau RYK, Du XL. Multi-view learning via multiple graph regularized generative model. Knowledge-Based Systems. 2017;121:153–162.
  28. 28. http://www.cs.umd.edu/sen/lbc-proj/data/cora.tgz
  29. 29. Jacob Y, Denoyer L, Gallinari P. Classification and annotation in social corpora using multiple relations. Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011;1215–1220.
  30. 30. Milgram J, Cheriet M, Sabourin R. “One against one” or “one against all”: which one is better for handwriting recognition with SVMs?. Tenth International Workshop on Frontiers in Handwriting Recognition. 2013.
  31. 31. Debnath R, Takahide N, Takahashi H. A decision based one-against-one method for multi-class support vector machine. Pattern Analysis and Applications. 2004;7(2):164–175.
  32. 32. Hsu CW, Lin CJ. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks. 2002;13(2):415–425. pmid:18244442
  33. 33. Cortes C, Vapnik V. Support vector machine. Machine Learning. 1995;20(3):273–297.
  34. 34. Rice I. Improved data visualisation through nonlinear dissimilarity modelling. Pattern Recognition. 2018;73:76–88.
  35. 35. Liu L, Ma SW, Rui L, Lu J. Locality constrained dictionary learning for non-linear dimensionality reduction and classification. IET Computer Vision. 2017;11(1):60–67.
  36. 36. Wang Z, Zhu ZH. Matrix-pattern-oriented classifier with boundary projection discrimination. Knowledge-Based Systems. 2018;149:1–17.
  37. 37. Yang B, Shao QM, Pan L, Li WB. A study on regularized weighted least square support vector classifier. Pattern Recognition Letters. 2018;108:48–55.
  38. 38. Huang CQ, Chung FL, Wang ST. Multi-view L2-SVM and its multi-view core vector machine. Neural Networks. 2016;75:110–125. pmid:26773824
  39. 39. Brbić M, Kopriva I. Multi-view low-rank sparse subspace clustering. Pattern Recognition. 2018;73:247–258.
  40. 40. Houthuys L, Langone R, Suykens JAK. Multi-view least squares support vector machines classification. Neurocomputing. 2018;282:78–88.
  41. 41. Li JX, Zhang B, Lu GM, Zhang D. Generative multi-view and multi-feature learning for classification. Information Fusion. 2019;45:215–226.
  42. 42. Chao GQ, Sun SL. Semi-supervised multi-view maximum entropy discrimination with expectation Laplacian regularization. Information Fusion. 2019;45:296–306.
  43. 43. Houthuys L, Langone R, Suykens JAK. Multi-view kernel spectral clustering. Information Fusion. 2018;44:46–56.
  44. 44. Zhu CM, Wang Z. Entropy-based matrix learning machine for imbalanced data sets. Pattern Recognition Letters. 2017;88:72–80.
  45. 45. Ye J. Generalized low rank approximations of matrices. Machine Learning. 2005;61(1–3):167–191.
  46. 46. Demsar J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research. 2006;7:1–30.
  47. 47. Wang QL, Li PH, Zuo WM, Zhang L. RAID-G: robust estimation of approximate infinite dimensional Gaussian with application to material recognition. IEEE Conference on Computer Vision and Pattern Recognition. 2016;4433–4441.
  48. 48. Wang WZ, Zhang HZ, Zhu PF, Zhang D, Zuo WM. Non-convex regularized self-representation for unsupervised feature selection. Proceedings of International Conference on Intelligent Science and Big Data Engineering (part II). 2015;55–65.
  49. 49. Zhu PF, Zhu WC, Hu QH, Zhang CQ, Zuo WM. Subspace clustering guided unsupervised feature selection. Pattern Recognition. 2017;66:364–374.
  50. 50. Zhu PF, Xu Q, Hu QH, Zhang CQ, Zhao H. Multi-label feature selection with missing labels. Pattern Recognition. 2018;74:488–502.