Statistical Dependence Test with Hilbert-Schmidt Independence Criterion

This paper studies an approach of nonlinear independence test between random variables. The method is Hilbert–Schmidt independence criterion (HSIC), one of the popular kernel methods for testing statistical dependence between random variables. There are several advantages of using HSIC method. It is easier compared with other kernel methods like the canonical correlation, because it requires no extra regularization terms for an appropriate finite sample behavior. Also, the learning rates of HSIC method is faster because there exists a well-defined population quantity in the estimate. The paper studies the detailed concepts and criterion associated with HSIC, its p-value and sensitivity maps. Results in the experiment about the wine data from UCI data set show the good performance of the approach.


Introduction
stimating dependencies among various random variables is well-studied in statistics. Some traditional methods are widely used, such as Pearson's correlation [1] and Spearman's rank correlation [2]. These two approaches are instances of non-parametric statistics that most researchers are familiar with and are widely available in many statistical software packages [3]. However, the former can only be used in linear associations between random variables, while the latter shares the same drawback with the former: it can only be used among pairs of scalar variables. With these two methods, the multidimensional cases of dependence test are limited by repeating the testing process for all the pairwise variables and combining the results to get the dependence matrix.
To solve the problems in nonlinear dependence measures, researchers find various advantages with kernel dependence methods [4]. Several advantages are revealed such as stable properties with higher dimensional variables, small number of data samples setting required, the criteria relations between variables in higher-order and it is simple to execute the empirical estimates [5]. Under pairwise independence with multivariate components situations, some approaches are introduced. Szekely et al. [6], Szekely and Rizzo [7] proposed a test on the basis of distance covariance with fixed probabilities and infinite sample size. Also, Szekely and Rizzo [8] proposed a t-test on an account of a modified distance covariance method. Furthermore, Gretton et al. [9] introduced with a method corresponding to Hilbert−Schmidt independence criterion (HSIC). The HSIC is one of the kernel methods to evaluate the statistical dependence between various random variables [10] and the paper will be developed about this topic. In this paper, some details of the HSIC methods are explored. In section 2, the paper introduces linear dependence, the HSIC estimate, its decision threshold, an appropriate kernel selection method, the concept of sensitivity maps and the setup background of experiment using the data of wines [11]. In section 3, the paper discusses the process and results of the experiment. Finally, the work makes a conclusion in section 4 with some commentaries and further potential research advice.

Experimental Section
This section presents an experiment with a criterion for the HSIC method. There are two main parts in the section. One is to define the estimator for the HSIC, its decision threshold and its corresponding sensitivity maps. The other part is to introduce the experiment setup for the wine data [11].

Linear Dependence
To make it clear, here some notations are defined. Let us define ⊆ and ⊆ as two spaces that gather the sample pairs , from the joint distribution . Then, the covariance matrix is defined as (1) where is the expectation of with respect to , is the expectation of the marginal distribution . Also, all the variables and quantities are assumed to be existed so that the following expressions can be defined. The covariance matrix contains the first-order dependences between variables and . A statistic that is well studied to summary this matrix is called the Hilbert-Schmidt norm [9]. The squared norm is equal to the summation of its squared singular values of eigenvalues: || || HS ∑ . If there does not exist a first-order dependence between and , then the value of that quantity is zero.

Kernel Dependence
Some nonlinear extension of the covariance notation was introduced in [9]. A mapping ϕ: → ℱ is defined and , ′ ⟨ϕ , ϕ ′ ⟩ is a positive kernel function that shows the inner product between features. The feature space ℱ shows a structure of a reproducing kernel Hilbert space (RKHS). Then another map ψ: → with the corresponding positive definite kernel function , ′ ⟨ψ , ψ ′ ⟩ can be defined, which shares a similar process with the mapping before. Then, there exists a cross-covariance operator between the two maps, which shares the similarities with the covariance matrix in (1). Then the cross-covariance operator can be a linear operator : → ℱ such that where ⊗ is a tensor product between two vector spaces, and . Interest readers can find some detailed research proposed by Baker [12] and Fukumizu et al. [13]. Then a squared norm of the cross-covariance operator || || HS is named as the HSIC and can be written as In the equation, ′ ′ is the expectation of both , ∼ and ′ , ′ ∼ , where ′ , ′ is another pair of variables drawn independently under the same condition. More details are studied by Gretton et al. [9].
Given sample datasets, ∈ and ∈ with pairs data sets , , … , , drawn from . Then the empirical estimator of HSIC is where Tr is the trace of the matrix, and represent the corresponding kernel matrices for and with , , and , , and centers the and features in ℱ and . Here, we define the size of the empirical HSIC with size .

Decision Threshold
The next important issue is to set the decision threshold θ for the kernel-based measures of independence test. Song et al. [14] proposed some statistical independence tests based on empirical HSIC estimators (3). The test focuses on a test between a null hypothesis H 0 : and an alternative hypothesis H 1 : . This hypothesis decision can be completed through the comparison between a test statistic and a threshold. Within all the definitions of the threshold for the HSIC estimate, one of the approaches is through approximation and define the null distribution as a two-parameter gamma distribution so that the estimator becomes: HSIC / HSIC and HSIC / HSIC . Then, the threshold θ can be calculated by an inverse cumulative density function with 1 − α, where α is a defined significance level. Usually α is taken to be 0.05 so a 95% confidence interval about the true value can be obtained. The two random variables are then considered dependent if HSIC , in the case the null hypothesis is rejected and be independent otherwise. An alternative method is through the computation of the HSIC's p-value directly from the corresponding HSIC estimate and cumulative density function using a permutation test to verify the dependency.

Kernel Selection
The method contains two kernel functions: the and . About the data, common kernels are usually used, such as radial basis functions or polynomial functions. Here the radial basis function expressed as , || || /2 , ∈ . In the dependence estimation, if ℱ and are RKHSs with its corresponding universal kernels and , which is studied by Steinwart[15]. Then, ℱ, , 0 if and are independent. Therefore, radial basis functions or Laplacian kernels are universal and polynomial kernels may be only preferred in some cases.
Another approach to select the kernel functions to maximize the power of the test, i.e. to maximize the probability of rejecting the null hypothesis H 0 : when x and y are actually dependent. One method is to use the test statistic Tr as a surrogate for power. However, the problem in this case is that using the surrogate to do the test is invalid because it also depends on the data. So, an alternative method is to pick half of the data ⟨ , ⟩ / and use them as the kernel function to do the test on the rest half of the data ⟨ , ⟩ / .

Sensitivity Maps
The learned relations for any kernel method are related to the kernel feature mapping. The study of this part can help researchers set a visualized interpretation of dependence estimation. The definition about sensitivity maps was proposed by Kjems et al. [16] and Camps-Valls [5]. The concept of sensitivity maps is related to leverage and influential points. A function : → is defined with parametrization of , … , . Then the sensitivity map of the feature is equal to the squared derivative of with respect to its arguments, which is showed below: where is a probability density function of input ∈ . Therefore, the goal of its sensitivity map is to calculate the variation of in the direction of the space of its input. The derivative parts are squared so as to avoid the cancellation of possible terms because of the signs. Because of this, some other transformations, such as taking absolute values can also be used. The empirical sensitivity map is calculated by using a summation over m samples instead of using the integral sign in (5) which can be represented by putting in a vector to get the definition of a sensitivity vector ≔ , … , .

HSIC Sensitivity Maps
The previous section is the general definition of sensitivity maps. However, the defined sensitivity map is constrained in some areas. First, it cannot be applied directly to some general functions, which are dependent on matrices. Second, it estimates the feature relevant , so it ignores the individual samples. The former definition limits the HSIC since it is based on matrices of data. In the following contents, the HSIC is derived with respect to its input samples and features to fix the disadvantages. Let ≔ and . So, the pairwise HSIC is composed by ∼ , | with a matrix parametrized to ∼ , | . HSIC is derived with respect to its input data matrix entries and in order to get the sensitivity maps of HSIC. Meanwhile, the first-order derivatives and chain rule are used on the results for where the entries in is 1 , , and ∘ is the Hadamard product between matrices. With the similar process, the corresponding expression for can be obtained ≔ Tr ∘ with entries in . Noted that though each sensitivity map can be independently used, the result is in a vector form. So, the result of the sensitivity map should be treated jointly. Therefore, the general sensitivity map for all the features and samples is ≔ , ∈ . The empirical sensitivity maps with for both features and samples can be computed by integration in the corresponding domains, where each corresponding empirical estimates are ∑ and ∑ . This shows the information about the directions that have the most influence on the estimate and we can get a quantitively geometric evaluation on the test result.

Proposed Criterion
This is a study of an alternative criterion for dependence based on HSIC's sensitivity in two directions: forward and backward: where represents the forward and is backward direction. The superscripts represent the sensitivities of corresponding observations and , and corresponding residuals, and .

Experimental Setup
HSIC method is used to analyze the dependence of 13 attributes on a set of red wine data [11]. The data of red wine is divided into three groups A, B, C in total and the data of each group is standardized. When where ~ , K is the kernel matrix of function , and with is the size of the data set. Since the 13 attributes are in the same space, any two of the attributes or kernel functions can be selected arbitrarily, and the correspondence between the two attributes and the kernel function would be changed with the selection. After rearranging the correspondence between the two attributes to obtain a new set of data that is theoretically independent. Through fitting and distribution test, the distribution and the result corresponds, the two attributes' p-value is , where is the distribution function corresponding to the corresponding distribution of results. Through the combination of different kernel functions and , we finally obtained the independence test results of 13 attributes.    Table 2. Shows the P values between the 13 attributes in group B:   Table 3. Shows the P values between the 13 attributes in group C:  Table 3 shows the P values between the 13 attributes in group C

Experiment Results and Problems
In theory, different kernel functions are selected in the HSIC method in order to measure the attributes in different spaces and the selection of kernel functions should not change the test results. Theoretically, the values of the symmetrical positions in the above table should be the same, which should be reasonable since they indicate the independence of the same pair of attributes. However, the tables shown above are not symmetrical. In order to study whether the combination of kernel functions will increase or decrease the difference between p-value in the symmetrical position, we divide the combination of kernel functions into three groups: 1) Function combinations are two kernel functions of the same kind. For example, if we choose two Gaussian kernel functions: the parameters of the two Gaussian kernel functions are not the same. The p-value in the symmetrical position will not be much different.
2) Function combinations are two different kinds of kernel functions, but both kernel functions are bounded. For example, if we choose the Gaussian kernel function and rational quadratic function: There is still no significant difference of the p-value in the symmetrical position.
3) At least one unbounded function was selected in the combination of kernel function, there is no significant difference of the p-value in the symmetrical position. Data normalization is very important for unbounded functions, so we experienced on the data without normalization. However, there is still no significant difference of the p-value in the symmetrical position.

Analysis and Discussion of Problems
Since the calculation of the p-value heavily depends on the random sorting, different random ordering will indeed result in different distributions corresponding to the results. In fact, although the p-values of the symmetrical positions in the table are not the same to each other, they are in the same order of magnitude. At the same time, it is found in this experiment that the choices of kernel functions will not significantly affect the results, so the experimental process should be simplified, such as simplifying the amount of calculation. Although data normalization is very important for an experiment, it is found in the process that removing data normalization in the HSIC method would not have much impact. It can also be seen from the experiment that some p-values are near 0.05 or 0.01, which are the commonly used significance indexes when we make hypothesis tests. Therefore, in HSIC applications, some two attributes may get different results if the same experiment is repeated. It is because the selected significance index is too close to the two attributes' p-value.

Conclusion
This report has presented details about one of the kernel methods for measuring nonlinear dependence between various variables. The result studied the HSIC independence criterion based on regression and dependence estimation and mentioned about the sensitivity maps of the estimates. The method was tested in the actual wine data problems and the performance of the method was reviewed with the interpretations. The empirical estimate was relatively easy to calculate and was useful for data inference. Due to the limitation of time and resources, the sensitivity maps part could have some further application. Therefore, some potential research about this would be subjects in some further studies.