Hilbert – Schmidt component analysis

We propose a feature extraction algorithm, based on the Hilbert–Schmidt independence criterion (HSIC) and the maximum dependence – minimum redundancy approach. Experiments with classification data sets demonstrate that suggested Hilbert–Schmidt component analysis (HSCA) algorithm in certain cases may be more efficient than other considered approaches.


Introduction
In many cases the initial representation of data is inconvenient, or even prohibitive for further analysis.For example, in image analysis, text analysis and computational genetics, high-dimensional, massive, structural, incomplete, and noisy data sets are common.Therefore, feature extraction, or the revelation of informative features from raw data is one of the fundamental machine learning problems.
In this article we focus on supervised feature extraction algorithms, that use dependence-based criteria of optimality.The article is structured as follows.In Section 2 we briefly formulate an esimators of a Hilbert-Schmidt independence criterion (HSIC), proposed by [5].In Section 3 we propose a new algorithm, Hilbert-Schmidt component analysis (HSCA).The main idea of HSCA is to find non-redundant features which maximize HSIC with a dependent variable.Finally, in Section 4, we experimentally compare our approach with several alternative feature extraction methods.Therein we statistically analyze the accuracy of k-NN classifier, based on LDA [4], PCA [6], HBFE [2,10], and HSCA features.

Hilbert-Schmidt independence criterion
The Hilbert-Schmidt independence criterion (HSIC) is a kernel-based dependence measure proposed and investigated in [5,8].Let T := (x i , y i ) m i=1 be a supervised training set, where x i ∈ X are inputs, y i ∈ Y -corresponding desired outputs, and X , Y are two sets.Let k : X × X → R, and l : Y × Y → R be two positive definite kernels [7], with corresponding Gram matrices K, and L. There are proposed two empirical estimators of HSIC (see [5,8]) 1 : and The ( 1) is biased with an O(m −1 ) bias, and the ( 2) is an unbiased estimator of HSIC [5,8].

Hilbert-Schmidt component analysis (HSCA)
In this section we suggest an algorithm for Hilbert-Schmidt component analysis (HSCA), which is based on the HSIC dependence measure.The choice of HSIC is motivated by its neat theoretical properties [5,8], and promising experimental results achieved by various HSIC-based feature extraction algorithms [2,3,5,10].
Suppose we have a supervised training set T := (x i , y i ) m i=1 , where x i ∈ R Dx are observations, and y ∈ R Dy are dependent variables.Let us denote the data matrices X = [x 1 , x 2 , . . ., x m ], and Y = [y 1 , y 2 , . . ., y m ], and assume that the kernel for the inputs is linear (i.e.K = X T X).
In HSCA we iteratively seek d D x linear projections, which maximize the dependence with the dependent variable y and simultaneously minimize the dependence with the already computed projections.In other words, for the t-th feature we seek a projection vector p, which maximizes the ratio where P t = [p 1 , . . ., p t−1 ] are projection vectors extracted in previous t − 1 steps, and HSIC is an estimator of HSIC.Note that, at the first step, only HSIC (p T X, Y) is maximized.For example, plugging (1) estimator into (3), we have to maximize the following generalized Rayleigh quotient where the kernel matrix of features L f (i, j) = l(P T t−1 x i , P T t−1 x j ).The maximizer is principal eigenvector of the generalized eigenproblem The case of unbiased HSIC estimator (2) may be treated in the similar manner.Well known kernel trick [7] allows to extend HSCA to arbitrary kernel case, however we ommit the details due to space restrictions.

Computer experiments
In this section we will analyze twelve classification data sets, eleven of them are from the UCI machine learning repository [1], and the remaining Ames data set is from chemometrics. 2e are interested in the performance dynamics of the k-NN classifier, when the inputs are constructed by several feature extraction algorithms: unsupervised PCA [6], supervised LDA [4], HBFE [2,10] and HSCA.
The measure of efficiency we will analyze therein is the accuracy of k-NN classifier, calculated over the testing set.The following procedure was adopted when conducting experiments.Fifty random partitions of the data set into training and testing sets of equal size was generated, and feature extraction was performed using all the above-mentioned methods.The projection matrices of the feature extraction methods were estimated using only the training data.The features generated from the testing set then were classified using k-NN classifier.The feature dimensionality was selected using a training data and 3-fold cross validation.Wilcoxon's sign rank test [9] with the standard p-value threshold of 0.05 was applied to the samples of corresponding classification accuracies.The following comparisons were made, indicating the statistically significant cases in the table : 1. HBFE 1 with HBFE 0 , and HSCA 1 with HSCA 0 (better one indicated in bold text); 2. The most efficient method with the remaining ones (statistically significant cases are reported in underlined text); 3. HSCA with HBFE (data sets where HSCA was more efficient are indicated with •, and • means that it turned out to be less efficient); 4. The most efficient HSIC-based algorithm (i.e.HBFE 0 , HBFE 1 , HSCA 0 or HSCA 1 ) with the remaining ones (⋄ means that HSIC-based algorithm outperformed other ones, and ⋆ means that PCA, LDA or unmodified inputs were more efficient).
The results in Table 1 show that HSCA approach may allow to achieve slightly better classification accuracy for some data sets.

Conclusions
Suggested HSCA (Hilbert-Schmidt component analysis) algorithm (Section 3) optimizes ratio of feature relevancy, and feature redundancy estimates.Both estimates are formulated in terms of HSIC dependence measure.Optimal features are solutions of generalized eigenproblem (4).In section 4 we statistically compared HBFE with several alternative feature extraction methods, analysing classification performance (accuracy) as the measure of feature relevance.The results of the conducted experiments demonstrate practical usefulness of HSCA algorithm.