Tensor dimensionality reduction via mode product and HSIC

Tensor dimensionality reduction (TDR) is a hot research topic in machine learning, which learns data representations by preserving the original data structure while avoiding convert samples into vectors and solving the problem of the curse of dimensionality of tensor data. In the work, a novel TDR approach based on mode product and Hilbert–Schmidt Independence criterion (HSIC) is proposed. The contributions of authors’ work is described as following: (1) HSIC measures the statistical correlation of two random variables. However, instead of measuring the statistical correlation of two random variables directly, HSIC ﬁrst transforms the two random variables into two reproducing kernel Hilbert spaces (RKHSs), and then measures the statistical correlation of transformed random variables by using Hilbert–Schmidt operators between the two RKHSs. The exploitation of RKHS increases the ﬂexibility and applicability of HSIC. Although HSIC is widely used in machine learning, the authors have not seen its application to dimensionality reduction (DR)(except for authors’ previous work). (2) A novel HSIC-based TDR approach is proposed, which ﬁrst applies HSIC to capture statistical information of tensor data set for DR. The authors give the mathematical derivation of HSIC for tensor data and establish a framework of TDR based on HSIC, named HSIC-TDR for short, which aims to improve the DR results of tensor by exploring and preserving the statistical information of original data set. (3) Fur-thermore, to solve the out-of-sample problem, the authors learn an explicit expression between the dimensionality-reduced tensors and the higher-dimensional tensors by introducing mode product to HSIC-TDR. The experimental results between the proposed method and other state-of-the-art algorithm on various datasets demonstrate the well performance of the proposed method.


INTRODUCTION
With the progress of powerful sensors and computer technologies, unprecedented massive data (multi-dimensional data) are exponentially grown in many areas, such as Internet [1], geophysical data [2] and human genome [3] and so on. The processing technologies of these data have become a hot research topic, especially, dimensionality reduction (DR) is one of the most important technology [4]. In general, most DR methods need to convert the multi-dimensional data into the high-dimensional vector before preforming their algorithms. Obviously, the process of the transformation ignores the underlying structure of original data sets and increases a huge computational cost. Therefore, the researches turn to focus on algorithms are inspired by famous vector-based DR, such as multi-linear PCA (MPCA) [18], tensor maximum margin criterion (TMMC) [19], multi-linear discriminant analysis (MLDA) [20], tensor discriminant locality alignment (TDLA) [21], tensor locality preserving projection (TLPP) [22] and tensor marginal fisher analysis (TMFA) [23]. Manifold learning, as a hot topic in non-linear DR area, also is introduced to TDR. For example, Zhang et al. employed the neighbourhood relationship between data to learn the low-rank factorization model, named low-rank matrix decomposition with manifold regularization (MMF) [24]; graph-Laplace tensor decomposition algorithm (GLTD) [25] applies the properties of the graph and similar information to improve the robustness of the algorithm; [26] employed local homeomorphism of manifold learning algorithm and subspace learning for the DR of tensors. Some other researchers focus on tensor decomposition for TDR. Such as Zhou et al. [27] proposed an effective non-negative Tucker decomposition (NTD) to approximate the tensor by the mode product of a low-dimensional tensor with effective projection matrices; Zhang et al. [28] used the low-rank regularized heterogeneous tensor decomposition (LRR-HTD) to learn a set of orthogonal factor matrices for data; Cai et al. [29] proposed a regularized non-negative matrix factorization (GNMF) algorithm, which introduced an affinity graph to obtain the geometric information for seeking a factor matrix factorization; [30] applied sparse Tucker tensor decomposition to the compression of the hyperspectral image or video sequence. These tensor-based models have well performance in TDR. But they merely consider the statistical information of tensor data, i.e. without taking the essential global structure of the tensor data into account. This paper introduces the statistical information of tensor data to TDR, and proposes a novel DR algorithm based on Hilbert-Schmidt independence criterion (HSIC). HSIC [31,32] measures the statistical correlation between two random vector sets. But instead of directly measuring the statistical correlation of two data sets, HSIC converts the two random vector sets into two regenerative reproducing kernel Hilbert spaces (RKHSs), and then employ the Hilbert-Schmidt (HS) operators of the two RKHSs to measure their statistical correlation [31]. HSIC has been applied in machine learning, such as feature selection [33][34][35], domain subspace learning [36], hyperspectral image classification [37] and so on. Generally, the sizes of two data sets in HSIC are assumed to be equal. However, this constraint usually is not to be meet in practice. Refs. [38] and [39] addressed the problem by the so-called surrogate kernel matrices. When HSIC is applied to feature selection [34,35], it actually measures the statistical correlation between the column vectors of data set matrix (represent feature vectors whose components are to be selected) and the corresponding labels. Gangeh et al. introduced HSIC to supervised dictionary learning [40,41] by first mapping original data set to subspace and then exploring the statistical correlation between mapped data and the corresponding labels. Although HSIC is widely used in machine learning, we have not seen its application to DR (except for our previous work [42]). But same to vector-based DR approaches, these methods cannot directly solve the tensor data. This paper proposes a novel DR method for tensor data based on mode product and HSIC, which has the merits of the HSIC and the mode product. The proposed algorithm has the following contributions.
(1) A novel HSIC-based TDR approach is first proposed to capture statistical information of tensor data set. HSIC gains success in many areas of machine learning since it is proposed. However, we have not seen the reported work about HSIC-based DR, expect for our previous work [42]. Although our team has proved the effectiveness of HSIC for DR, it is only suitable for vector data, cannot be directly applied to tensor data. Thus, the first contribution in the paper is to provide the theoretical description of HSIC for TDR. (2) Based on the theory of HSIC, tenor data is transformed to RKHS by the reproducing kernel of RKHS. Therefore, the flexibility and applicability of HSIC can be increased by choosing different RKHS. Moreover, with it, HSIC-TDR becomes a non-linear DR algorithm. The experimental results prove the effectiveness of our proposed method in TDR. (3) In order to solve the out-of-sample problem, a mode product-based HSIC-TDR algorithm (MPHSIC-TDR) is proposed to offer an explicit expression between lowdimensional representation and high-dimensional tensor data. In tensor algebra, the mode product of tensor and matrix is used to change the dimensionality of tensor, and thus we assume the reduced tensor is the mode product of original tensor and a series of matrices. Therefore, MPHSIC-TDR optimizes these matrices by maximizing HSIC.
The remaining part of the paper is organized as follows. In Section 2, we describe the tensor algebra for better understanding the tensor representation and tensor decomposition, and introduce the basic definition of RKHS and HSIC. Section 3 gives some related work about TDR algorithms for comparison. Section 4 contains two part, the framework of TDR with HSIC, and the proposed algorithm (MPHSIC-TDR). Experimental results are presented in Section 5, including data sets description, clustering experiments and classification experiments, and experiment analysis. Finally, Section 6 gives a conclusion of our work.

Tensor algebra
The tensor  represents a multi-dimensional data set,  = {x i 1 ⋯i N |x i 1 ⋯i N ∈ R, 1 ≤ i n ≤ L n , n = 1, … , N }. N is the order of , and L n is the nth dimension of the tensor data. The tensor  can be simply written as  ∈ R L 1 ×⋯×L N . Obviously, the number of data contained in the tensor  is || = Tensors can be expressed in the form of matrix. For example, the following representation is called the n mode of tensor  and The mode representation of tensor is completely equivalent to the set representation of tensors, and to know any mode of tensors is to know tensors. The mode product is an important operation in tensor algebra. Let  ∈ R L 1 ×⋯×L N , A n ∈ R J n ×L n , then the n mode product of  and A n is a new tensor  ∈ If  is the low-dimensional representation of , the task of the DR of tensor  becomes learn an optimal matrix A n . Furthermore, the n mode product of  can be written as follows: The operation of the mode product satisfies the following characters: (1) Associative Law∶ If  ∈ R L 1 ×L 2 ×⋯×L N , A n ∈ R J n ×L n and B n ∈ R K n ×J n , then × n A n × n B n = × n (B n A n ). then Here, ⊗ represents the Kronecker product between two matrices.

Reproducing kernel Hilbert space
HSIC is developed based on the theory of RKHS. To better understand HSIC and our method, we here introduce some definition of RKHS. RKHS can be generated from kernel function, or from function space with data. Such as, let f ∶ Ω → R be a mapping function from data space Ω to real data space R, then all the f , which meets ∫ x∈Ω | f (x)| 2 dx < +∞, construct a square integrable function space S (Ω). Furthermore, a complete inner product space (Hilbert space) H [43] can be generated with S (Ω) and the inner product ⟨•, •⟩ on S (Ω). For example, for any f , g ∈ S (Ω), the inner product of them can be defined then k is the reproducing kernel of H , and (H , k) is the corresponding RKHS. According to the above definition, we also can utilize the kernel function k to generate RKHS. Following Mason's theorem [43], one kernel function only produces one RKHS, and is defined with the following characters: (1) Symmetry∶ for any data x, y ∈ Ω, k(x, y) = k(y, x), (2) Positive definiteness∶ for any finite number of data∶ x 1 , … , x N ∈ Ω, the kernel matrix is positive definite,

Hilbert-Schmidt independence criterion
HSIC measures the statistical correlation between two random variables X and Y . In fact, HSIC does not directly measure the statistical correlation between X and Y . It converts two random vector sets into two regenerative RKHSs, and then employ the HS operators of the two RKHSs to measure their statistical correlation. Therefore, the HSIC [31] of two random variables X and Y is defined as here,‖ • ‖ HS denotes the norm on HS operators space, E XY denotes the cross-covariance operate of X and Y , X (X ) and Y (Y ) describe that convert the two random variables into two corresponding RKHSs. X and Y are the mean function of X and Y in the corresponding RKHS H X and H Y , respectively. The HS operators space is defined as follows. Definition 1. [32] Let H X and H Y be two separable Hilbert spaces and {e X i |i ∈ I } are the standard orthogonal basis of H X . Let T ∶ H X → H Y is a compact operator, if ∑ i∈I ‖Te X i ‖ 2 Y < +∞, then we can call T is the HS operator.

Theorem 2. Let HS (H X → H Y ) be the linear space composed of all HS operators from H X to H Y . If for any T , S ∈ HS (H
is a Hilbert space, where the determination of inner product ⟨•, •⟩ HS is as follows: Theorem 3. [31] Let H X and H Y be two separable Hilbert spaces, .
Remark: f 0 ⊗ g 0 is called as a tensor product of f 0 and g 0 , and Theorem 3 shows that, f 0 ⊗ g 0 is an HS operator.
The definition of HSIC indicates that HSIC essentially measures of the statistical correlation between X (X ) and Y (Y ). In practice, the probability distributions of random variables X and Y are not known. Therefore, according to probability theory, we can use the sample average instead of statistical average, and assume that the probability of random event {X = x i ; Y = y j } is 0. Then, HSIC (Equation 6) can be derived as the following from [42] HSIC here C N = I N − 1 N Γ N Γ T N is the centralized matrix, K X and K Y are the kernel matrices with data sets X and Y , respectively.

RELATED WORK
Tensor analysis has been widely used in machine learning [8,17,44]. Tensor analysis (TA) [44] and tensor decomposition [45,46] are two important work for TDR. Ref. [17] gives an overview of tensor-based models and methods for multi-sensor signal processing, such as the Tucker decomposition [27,47], the canonical polyadic decomposition [48], the tensor-train decomposition (TTD), the structured TTD and so on. Ma et al. [16] used tensor singular value decomposition(t-SVD) to learn a lowrank tensor from limited coefficients in any ortho-normal basis. Liu et al. [49] proposed a proximal operator for the approximation of tensor nuclear norms based on tensor-train rank-1 decomposition with the SVD. Here, we describe some related TDR work in detail, especially used in our experiments for comparison.
Many TDR approaches are around the Tucker decomposition of tensor. Let  ∈ R L 1 ⋯×L N be a tensor, the Tucker decomposition of tensor  can be expressed as follows: where A n ∈ R L n ×J n , n = 1, … , N and  ∈ R J 1 ×⋯×J N . If L n ≫ J n , then  is the low-dimensional representation of .
Since the matrices can be regarded as the second-order tensors, the famous matrix non-negative factorization can be regarded as special kind of tensor Tucker composition. Tucker composition can be combined with various regularization terms for TDR. For example, the regularized non-negative matrix factorization algorithm (GNMF) [29] is described as where X ∈ R D×N , U ∈ R D×d and Y ∈ R d ×N . Π ∈ R N ×N is the graph regularized Laplacian matrix, and can be calculated by the similarity between the column vectors of X . If the column vectors of X are regarded as the data samples, the column vectors of Y are the compact representation of data, as claimed in [29]. Manifold regularization non-negative Tucker decomposition (MRNTD) [50] transforms the GNMF into the tensor version: where  ∈ R L 1 ×⋯×L N −1 ×M ,  ∈ R J 1 ×⋯×J N −1 ×M and A n ∈ R L n ×J n n = 1, … , N . Π ∈ R M ×M also is a graph Laplacian matrix. If the row vectors of  (N ) are regarded as data, then the row vectors of  (N ) are the compact or dimension-reduced representation of data. Orthographical constraint Tucker decomposition is another research topic of Tucker decomposition. For example, MMF [24] learned the low-rank factorization with manifold constraints, which is described as here, the requirement U T U = I d represents low-rank approximation. The MMF algorithm incorporates manifold regularization into matrix factorization. The new regularization model is better than graph regularization non-negative matrix factorization, and has global optimal solutions and closed-form solutions. It is argued in [24] that MMF achieves better results than GNMF in some applications. Zhou et al. [27] introduced non-negative Tucker decompositions method to approximate the low-dimensional tensor with effective projection matrices, such as This objective function for the last mode matrix A N is a function on Riemannian manifold. At present, most optimizations for objective functions defined on Riemannian manifold only involve the gradients of objective functions (first-order Riemannian geometry). HTD expands the objective function from first-order Riemannian geometry (gradients) up to second-order Riemannian geometry (Hessians). LRR-HTD [28], as a famous TDR approach, is developed from NTD where LRR-HTD combined various constraints in different modes to enhance the robustness of the proposed algorithm. Specifically, due to the noise and redundancy in the original tensor, LRR-HTD found a set of orthogonal factor matrices for all modes to map the tensor to a low-dimensional subspace. Zhang et al. [26] introduced the local homeomorphism characters of manifold data and global subspace mapping to TDR, named LHGPD, where W denotes the projection matrix of subspace. Actually, LHGPD and MRNTD introduced the idea of classical manifold learning algorithms local tangent space alignment (LTSA) [51] and Laplacian eigenmaps (LE) [52] to TDR, respectively. Different from these methods, our proposed algorithm introduce the statistical information of the original data set to learn the low-dimensional representation of tensor for improving the performance of TDR.

THE PROPOSED ALGORITHM
In this section, we propose a novel TDR method that explores the statistical information of tensor data set by HSIC and learns an explicit expression between the dimensionality-reduced tensors and the higher-dimensional tensors via mode product.

The framework of HSIC-TDR
HSCI cannot to be directly applied to tensor DR. Therefore, we first derive the mathematical processes of HSIC for tensor data and establish a framework of TDR based on HSIC, named HSIC-TDR. We denote by  ∈ R L 1 ×L 2 ×⋯×L N −1 ×M the N − 1dimensional multi-dimensional data sets to be used. And M denotes the number of tensor data included in multi-dimensional data set. Then, the N mode of  can be described as The aim of DR for tensor  is to find a low-dimensional ten- Note that the number of  and  is the same. Except it, each dimension of  is smaller than the corresponding dimension of . Therefore,  is the low-dimensional representation of .
With  and , we can establish an objective function for TDR by the criterion of HSIC maximization Here, In Equation (18), k  is the kernel function of dimensionalityreduced tensor data space Ω  , where Ω  ∈ R J 1 ×⋯×J N −1 and k  ∶ Ω  × Ω  → R. k  is a symmetrical positive definite kernel. In theory, if a function meets the condition of symmetrical positive definiteness, it can be used as kernel function. Therefore, the selection of k  is an open problem. The learning problem of HSIC-TDR is an optimization problem of the  with k  . If k  is non-linear, the difficulty of the optimization process of (18) can be increased. Therefore, we choose a linear kernel function for  Substituting Equation (22) into the objective function of HSIC-TDR, then Equation (18) can be transformed into tr (C M K  C M ) has nothing to do with the optimization objective (23), thus the objective function of HSIC-TDR can be simplified as: ) .
In the objective function of HSIC-TDR (23), k  is the kernel function of the high-dimensional tensor data space Ω  , that is, And k  is symmetrical positive definiteness. Different with k  , the choice of k  is open in our method, as long as meets the condition of symmetry positive definite. Therefore, the framework of HSIC-TDR can generate different algorithms by selecting different kernel func-tions or learning different kernel functions for different application occasions.

MPHSIC-TDR
In general, the task of TDR can be expressed as Here, J n ≪ L n , n = 1, … , N − 1.
According to tensor algebra, the mode product of tensor with a matrix can change the dimension of the tensor in certain order. In the section, we first propose an algorithm for TDR based on mode product (MP-TDR): where A n ∈ R L n ×J n , n = 1, … , N − 1. According to the definition of the mode product,  satisfies the requirements of DR for the high-dimensional tensor . MP-TDR gives an explicit expression between the dimensionality-reduced tensor and the high-dimensional Note that the determination of the mode matrix A n is open. Therefore, different strategies under MP-TDR maybe lead to different algorithms.
This paper employs the HSIC maximization criterion to learn the matrix A 1 , … , A N −1 for TDR, that is, Furthermore, according to the theorem of tensor algebra, we have where . Therefore, the objective function of MPHSIC-TDR algorithm based on the model product and HSIC can be expressed as ALGORITHM 1 Procedures of MPHSIC-TDR Input:  ∈ R L 1 ×⋯×L N −1 ×M , the dimensional number of the dimensionality-reduced tensor J n , n = 1, … , N − 1, and kernel function k Calculating the N mode product  (N ) of tensor  by Equation (27), and the centralizing matrix C M by Equation (27); 2. Choosing the most suitable kernel function from different kernel functions for HSIC maximization, i.e. calculate K  with tensor data ; 3. Solving the objective function (28) by Rayleigh entropy, i.e.
is a symmetric semi-positive definite matrix. MPHSIC-TDR not only realizes the DR of tensor data, but also gives the explicit expression between the high-dimensional tensor  and the DR tensor , i.e.  (N ) =  (N ) W .
For the new set of tensor data  ′ ∈ R L 1 ×⋯×L N −1 ×H and the data distribution of  ′ and  is the same, then  ′ can be reduced by using the formula  (N ) =  (N ) W , namely, where  ∈ R J 1 ×⋯×J N −1 ×H is a dimensionality-reduced tensor of  ′ , H ≤ 1 is the number of data  that contains. Furthermore, we solve the learn problem of MPHSIC-TDR by adding constraints A T N A n = I J n , n = 1, … , N − 1. Therefore, there is At the same time, the learning problem of MPHSIC-TDR becomes the problem of Rayleigh quotient. The column vectors of the matrix W are the J 1 ⋯ J N −1 basis of the symmetric semi-positive definite matrix Π, i.e. the orthogonal eigenvectors corresponding to the largest eigenvalues J 1 ⋯ J N −1 of Π.

Complexity analysis
In this section, we analyse the complexity of our algorithm. The computational complexity can be started from four aspects: the calculation of C M , the kernel matrix K  of the original tensor, C M K  C M and the dimensionality-reduced matrix  (N ) . C M can be viewed as the product of a column vector and a row vector, so the complexity of C M is (M 2 ). Kernel function is the mapping of two vectors to a real number, where the inner product of two vectors needs to be constructed, therefore the complex- Since the kernel matrix has M 2 elements, all the above complexity is ( ). The last step is to obtain the matrix  (N ) after DR, which can be obtained by Equation (27). Therefore, the complexity of multiplying between two matrices needs to be calculated that is (M ( Finally, the total computational complexity of the algorithm is (M 2 +

Data sets
COIL20 [53] contains 20 objects taken from different angles, seen from Figure 1, and 72 images per object. Each image is uniformly processed as the size of 32*32. Furthermore, we have transformed the entire data set as a tensor of 32*32*1440. ETH80 (see Figure 2) [54] is a multi-visual image database, which contains 8 categories and 10 objects and 41 different visual images with the size of 128*128 in each object. In the experiments, the size is reset to 32*32 and the all images are expressed as a tensor object of 32*32*3280.
ORL face data set [55] (see Figure 3) consists of 400 facial objects with 40 images under different poses and expressions. The size of each image is 32*32. We restore the data set to the form of 32*32*400 for the experiments.
USPS [56] (see Figure 4), another famous handwritten digit data set, consists of a total of 2000 handwritten digital images. The entire data set is converted to a tensor form of 16*16*2000.    Figure 6).

Parameter selection
In the experiments, we choose the suitable kernel function from multiple kernel functions for different data sets. The kernel function set contains six typical kernel functions: the Gaussian kernel, the wavelet kernel, the GTS (Generalized T-Student), the hyperbolic tangent kernel (Tanh), the sigmoid kernel and linear kernel. In classification experiments, the KNN classifier is implemented to test the classification accuracy of all the compared algorithm and its nearest neighbour number is set to 10. In clustering experiments, the K -means is utilized to collect clustering information of the proposed algorithm. The reduced dimensions of the eight data sets (COIL20, ETH80, ORL, MNIST, USPS, CMU PIE, Faces94 males and Olivetti) are experientially set to be 8×8, 8×8, 6×6, 8×8, 7×8, 8×8 and 10×8, respectively.

Clustering experimental results
In this section, we aim to demonstrate the TDR capability of our proposed algorithm by performing clustering experiments on the reduced tensor data. Clustering accuracy (AC) and normal mutual information (NMI) are used to evaluate the performances of the compared algorithms. AC is defined as where l * j , l j denote true label and clustering label, respectively. When l * j = l j , (•, •) = 1, otherwise, (•, •) = 0. NMI is defined as where H (l * ) and H (l ) denote the entropy of l * and l , respectively. MI denotes the mutual information of l * and l . The clustering method adopts the distance-based K -means for data clustering. It means that the closer the distance between two objects is, the greater the similarity will be. Since the Kmeans algorithm is sensitive to the selection of initial data, we repeat the clustering process 10 times in each compared algorithm experiment, and take the average results as the finally experiment results. On each data set, we randomly select D samples from C classes as the training set for clustering experiments. Then 10 training sets are selected, and the average of the clustering results is calculated at the final.
For the ETH80 data set, we randomly select 41 samples from each class as the training set for clustering experiments. In the following, we first conduct the experiments of selecting the best kernel function for the proposed algorithm based on the AC and NMI, respectively (see Table 2). It shows that under different clustering criterion, the best kernel maybe different. With the decrease of the data size and number of classes, the CA and clustering NMI increase significantly. We randomly select 20 samples from each class of Faces94 data set as the training set for clustering experiments. The experiment results of selecting the best kernel function for MPHSIC-TDR are  shown in Table 3. It can be seen that with different clustering criterion, the best kernel also is different. It means that different data sets and different evaluation criterion need different kernel function. With the decrease of the data size and number of classes, the CA and clustering NMI on Faces94 also increase significantly. The reason may be that single kernel function performs well in less categories. Inspired by it, the research of multiple kernels for HSIC will be one of our further work.

FIGURE 7
The clustering experimental results on ETH80, and Faces94 data sets. The clustering accuracy and NMI performances of eight compared algorithms versus different number of categories (C ) on ETH80, and Faces94 male are shown here. The first row shows the AC and NMI on ETH80, and the second row shows the AC and NMI on Faces94 data set. MPHSIC-TDR is the proposed algorithm Seen from Figures 7(a) and (b), most of algorithms can obtain a good performance when the number of categories is small. Although the clustering results of each algorithm ETH80 have a little different under different computed criteria, their AC and NMI performances are basically consistent with the increasing of the number of categories. In other words, GNMF has the worst performances on ETH80, and GLTD follows it. In most cases, LHGPD and MRNTD have a little better than GLTD. The performance of MMF is better than LHGPD and MRNTD. But MMF is not stable, such as when C = 6, it works worse than LHGPD and MRNTD. LRR-HTD and our MPHSIC-TDR have the best performance when the number of categories is small. But with increasing of the number of categories, the performance of LRR-HTD has deteriorated a lot. Seen from Figures 7(c) and (d), MRNTD and LHGPD almost have the worst performances on Faces94 male data set, and show poor robustness. MMF and NTD have different performance under different clustering criteria. GLTD and GNMF work better on the Faces94 data set. On the whole, our MPHSIC-TDR also has the best performance. The reason is that we choose the best kernel function, i.e. the most suitable kernel, for each data set. Therefore, the corresponding RKHS generated from the best kernel function contains unique information of data set. Then MPHSIC-TDR can explore the global probability distribution information for TDR.
In order to show more direct observation, the average clustering results of eight compared algorithms on ETH80 and Faces94 male data sets are shown in Figure 8 for visualization. It can been seen that the experimental results of our proposed algorithm is higher than other compared algorithms. The clustering experimental results verify the effectiveness of the proposed algorithm.

Classification experimental results
In the classification experiments, we adopt the KNN classifier with k = 10 on reduced-dimensional data sets, and use the cross-validation method for computing the accuracy. For each data set, we first conduct the experiments of selecting the best kernel function for the proposed algorithm based on classification accuracy. We randomly select D samples from C classes as the sample set for classification experiments. For example, we randomly select 72, 41, 10, 200 and 20 samples from each class of the six data sets (COIL20, ETH80, ORL, USPS, Faces94 males and Olivetti) as the sample set, respectively. The compared algorithms contains GNMF, MMF, LRR-HTD, MRNTD, LHGPD, NTD and MPHSIC-TDR, and MPHSIC-TDR is the proposed algorithm. In each experiment, we randomly select 80% of the reduced-dimensional data as the training set and the remaining 20% as the testing set for classification.  Each experiment tests 20 times and the average classification accuracy as the final result. Table 4 lists the classification results versus different number of categories (C ) for the best kernel function selection of MPHSIC-TDR in COIL20. From Table 4, we see that the best kernel is different under different number of categories. The classification results of seven compared methods for COIL20 data set are shown in Figure 9(a). It can be seen that all methods have a good performance when the number of categories is small. GNMF get the worst results, and MMF follows it. The reason may be that they are sensitive to the change of the number of neighbours. The performances of LRR-HTD are still not stable as the number of categories increase. NTD and MRNTD work better than the above three algorithms. LHGPD shows better effectiveness in the task of classification. MPHSIC-TDR has the best performance in the task of classification, which can achieve the separability of distribution information of different categories in COIL20. Table 5 exhibits the classification results versus different number of categories (C ) for the best kernel function selection of MPHSIC-TDR on ETH80. Figure 9(b) shows the classification performances of seven algorithms on ETH80. First, we observed that overall, the compared methods have a good classification results in ETH80. But when the number of categories starts to grow, the classification accuracy of them are decrease. However, MPHSIC-TDR still outperforms the other methods. It proves that the proposed algorithm can effectively explore the global statical information of tensor data set with learning the best kernel function. Table 6 lists the classification results versus different number of categories (C ) for the best kernel function selection of MPHSIC-TDR in ORL. Figure 9(c) shows the classification performances of seven compared algorithms on ORL data set. From this figure, we see that NTD almost has the worst performance, and MRNTD follows it. This may be because that there are too few samples in ORL data to obtain an effective NTD. LHGPD perform better in the task of classification on ORL. Overall, our proposed MPHSIC-TDR has got the best classification results in most case. Table 7 shows the classification results versus different number of categories (C ) for the best kernel function selection of MPHSIC-TDR on USPS data set. Figure 9(d) shows the classification performances versus different number of categories in seven algorithms on USPS. Similar to the classification results on ORL, the performance of NTD and MRNTD on the USPS data set are poor. When the number of class is less than 8, the performance of GNMF and LRR-HTD are similar. The performance of MMF is close to LHGPD. And the classification performances of MPHSIC-TDR on this data set still are satisfactory. Table 8 lists the classification results versus different number of categories (C ) for the best kernel function selection of MPHSIC-TDR in Faces94 males data set. Figure 9(e) and    Table 9 show the classification performances of all compared algorithms on Faces94 males. From them, we can observed that GNMF and MPHSIC-TDR (our method) have the best performance. The reason maybe that matrix-based method  maybe have a better performance than some high-order-tensorbased method on some data set. And the MPHSIC-TDR has a stable and good performance by preserving the global statical information.    Table 10 lists the classification results versus different number of categories (C ) for the best kernel function selection of MPHSIC-TDR on Olivetti data set. Figure 9(f) shows the classification performances of all compared methods on Olivetti. Seen form Figure 9(f), the performance of MRNTD is worst, and MMF and LRR-HTD have a higher classification results than it. And obviously, our proposed MPHSIC-TDR has got the best classification. Moreover, the effectiveness obtained by the proposed MPHSIC-TDR method on Olivetti data set is significant compared with the six state-of-the-art algorithms.
To observe the classification performances of all methods numerically, Table 11 summarizes the average classification results of each algorithm on six data sets. We also provide the standard deviation of the average classification results in the last line of Table 11 to show the robustness of our algorithm. Obviously, the proposed MRHSIC-TDR performs better than other algorithms, which confirms its effectiveness in data-classification task. According to the experiment tables and figures, it is verified that HSIC criterion can be applied to the algorithm of data DR, which enriches the algorithm library of data DR to some certain extent. The validity and reliability of the proposed algorithm can be verified by combining the experimental results of clustering and classification.

CONCLUSION
This paper proposes a novel approach based on mode product and HSIC. HSIC criterion measures the statistical correlation between two data sets. However, instead of directly measuring the statistical correlation between two data sets, HSIC first transforms them to two RKHSs, and then uses the HS operator of the two RKHSs to measure the statistical information of data sets. The RKHS can be generated from different kernel function according to different application situation, increasing the applicability and flexibility of HSIC. In our work, we first establish a novel framework for TDR (named HSIC-TDR), which captures statistical information of tensor data set. According to tensor algebra, the dimensionality of tensor can be changed by the mode product of the tensor and a matrix. Based on it, MP-TDR is generated for learning an explicit expression between the high-dimensional tensor and its lowdimensional representation. In MP-TDR, the determination of the mode product matrix is open. Furthermore, we propose a novel algorithm (MPHSIC-TDR), which has the merits of both HSIC and mode product. MPHSIC-TDR not only explores the global statical information of tensor data set, but also learns an explicit mapping for out-of-samples. Finally, we conduct the clustering and classification experiments on six real-world data sets with some state-of-the-art algorithms to demonstrate the effectiveness of the proposed method.