Revisiting the predictive power of kernel principal components

In this short note, recent results on the predictive power of kernel principal component in a regression setting are extended in two ways: (1) in the model-free setting, we relax a conditional independence model assumption to obtain a stronger result; and (2) the model-free setting is also extended in the infinite-dimensional setting. © 2020 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
The efficient statistical analysis of high-dimensional datasets is considered one of the most challenging problems in recent years. The ability to collect and store massive amounts of data in a cheap way has allowed scientists to collect a lot of high-dimensional data. In an effort to reduce the dimensionality of the datasets, researchers are resorting to a preprocessing step which allows them to reduce the dimensionality of the dataset. In a regression setting, when the predictors are high dimensional and there is need to reduce the dimension of the dataset for an efficient analysis a number of preprocessing steps have been proposed in the literature. One of these approaches is the principal component analysis (PCA) which reduces effectively the dimensionality of the predictors (see for example Chiaromonte and Martinelli, 2002).
During the 20th century, a long debate in the Statistics community, evolved around the effectiveness of using PCA to reduce the dimensionality in a regression setting (and more generally -in a supervised setting). The debate on this topic, had some prominent statisticians taking opposing sides (see for example Mosteller and Tukey, 1977;Cox, 1968). Cook (2007) gives a very detailed overview of this debate.
Following the discussion of this topic in Cook's Fisher Lecture (Cook, 2007) by (Li, 2007) a number of researchers tried to give a probabilistic answer on the predictive potential of principal components, that is on the probability that the higher order principal components will be more correlated with the response rather than the lower order principal components. Artemiou and Li (2009) discussed this in a linear model under the assumption of an orientationally uniform covariance matrix Σ = var(X ) and they proved that the probability of a higher order principal component having higher correlation with the response than a lower order one is greater than 1/2. Ni (2011) extended the result showing that the exact probability is (2/π )E(arctan √ λ i /λ j ) where λ i is the ith eigenvalue of Σ and i, j where i < j are the subscripts discussed this for the kernel principal components. The more general result in  essentially shows that in a model-free setting, if someone randomly chooses a measure for the conditional distribution of Y |X similar results as the one proved by Artemiou and Li (2009) and Ni (2011) hold. In this paper we generalize further some of the results in  for the predictive potential of kernel principal components on different directions. First of all, we generalize the results in the model-free setting, that is, we extend the results in the case where the conditional independence Y ⊥ ⊥ Σ |X is relaxed to g(Y ) ⊥ ⊥ Σ |X and then we propose a way to extend the results in the case that we have infinite-dimensional kernels. In  the model free setting was discussed only in the case where we have finite dimensional kernels. In this work, we incorporate an extra assumption to demonstrate that the results can be extended to infinite dimensional kernels.
The rest of the paper is structured as follows. First, we revisit some key results and notation from  in Section 2. In Section 3 we discuss the extensions of the previous results. We close with a short discussion in Section 4. Here we emphasize that the proofs are very similar to the ones presented in  and therefore they can be omitted completely. In any case we provide the proof of the more general result in this work which is the most general result to this day on the predictive potential to kernel principal components. (Essentially this is the most general result on the predictive potential of any form of principal component analysis).

Predictive power of kernel principal components
In this section we revisit the most general results from  which are the ones for the predictive potential of Kernel Principal Components in the model-free setting. The key assumption in their results is the Y ⊥ ⊥ Σ|X which we relax in the next section.

Σ is a random covariance operator where the distribution is invariant under unitary transformation. In other words,
Σ has the same distribution as UΣ U −1 for any unitary U : H → H. It is assumed that, almost surely, the non-zero eigenvalues have unit multiplicity.
Then, with probability 1, The next theorem gives the more general result  proved where it states that we can arbitrarily choose a conditional distribution for Y |X and the result still holds. In Section 4 we will extend this, as well as the above result, to allow for the Hilbert space H to be infinite dimensional.
g is a real-valued measurable function of Y such that the random function m ν (·) = ∫ g ν(dω, ·) belongs to H almost surely and, with probability 1, Then for any i < j,

Model free setting
In this section we present the most important results of this paper. We first demonstrate how one can relax the assumption in Theorems 1 and 2 to extend the results in a more general setting. Then we also demonstrate how the model free results can be extended in the infinite dimensional Hilbert space H. We emphasize here that the results in this section are the most general to date in the predictive power of kernel principal components. More importantly the results on the infinite dimensional Hilbert space H were not addressed at all in .
Before we outline the result we explain the assumption we use in the model free setting. In  the assumption Y ⊥ ⊥ Σ|X was used. To extend Theorem 1, a random conditional distribution can be chosen for g(Y )|X rather than for Y |X. See below an example why such relaxation is important.

Remark 1.
An example of a model for which g(Y ) ⊥ ⊥ Σ|X holds but Y ⊥ ⊥ Σ|X fails is given by:

Finite dimensional setting
Although both Theorems 1 and 2 can be extended in this setting we focus on the extension of Theorem 2 which presents the more general setting. the extension for Theorem 1 is straight forward.
Theorem 3. Suppose that: 1. Let H be a finite dimensional Hilbert space 2. g is a real-valued measurable function of Y such that the random function m ν (·) = ∫ g ν(dω, ·) belongs to H almost surely and, with probability 1,

Σ is a random covariance operator where the distribution is invariant under unitary transformation. In other words,
Σ has the same distribution as UΣU −1 for any unitary U : H → H. It is assumed that, almost surely, the non-zero eigenvalues have unit multiplicity.
4. ν is a random conditional distribution for g(Y )|X such that P(ν ∈ K 0 ) = 0 where K 0 denotes the set of conditional distributions for which X and g(Y ) are independent Then for any i < j,

Infinite dimensional kernels
In this section we show how one can address the model-free setting in the infinite dimensional kernel case which was not considered in . By making a uniformity assumption on a ''restriction" of Σ, the results can be extended.

Assumption 1.
Suppose that Σ is a random compact covariance operator. There exists a set of integers V = {v 1 , . . . , v l }, for some l ∈ N, such that is invariant under unitary transformations. Without loss of generality, it will be assumed that v 1 < v 2 . . . < v l .
The following theorem says that you can choose any measure to define the relationship between X and Y as long as the two are not independent. This is the more general theorem of the predictive power of kernel principal components. For this reason, we provide its proof (although it is very similar to the one in . (Similarly to the previous section we show the extension of Theorem 2 under the new Assumption. One can adjust Theorem 1 simlarly).
Theorem 4. Suppose that: 1. g is a real-valued measurable function of Y such that the random function m ν (·) = ∫ g ν(dω, ·) belongs to H almost surely and, with probability 1, 2. Σ is a random compact covariance operator satisfying Assumption 1. It is assumed that, almost surely, the non-zero eigenvalues have unit multiplicity.
3. ν is a random conditional distribution for g(Y )|X such that P(ν ∈ K 0 ) = 0 where K 0 denotes the set of conditional distributions for which X and g(Y ) are independent Proof. We begin similarly to the proof of theorem 11 by noting that for any i Also note that because of the assumption Y ⊥ ⊥ Σ|(X, ν). We see that ν ⊥ ⊥ (X , Σ) implies m ν ⊥ ⊥ (X , Σ). Thus, for any κ ∈ K, we have and therefore . Therefore, for any κ ∈ K, that using the results to prove Theorem 3 in Jones and Artemiou (2020), the right hand side is (2/π ) arctan and by taking the conditional expectation on both sides of the above equality we get the desired result of the proof. □

Discussion
In this paper we extend recently proposed results in the literature on the predictive potential of kernel principal components. There are two important contributions in this work. The most important contribution of this paper is the relaxation of the assumption on the model free case. The new conditional independence proposed is more general than the previous assumption and therefore we demonstrate that the result holds in a much broader range of cases. The second important contribution is the extension of the model free approach in the infinite dimensional Hilbert space settings.
The results in this work enhance the discussion around a topic that has troubled Statisticians in the 20th century and has received renewed interest lately due to the volume of high-dimensional data collected nowadays, which forces researchers to perform variable screening in supervised settings using PCA approaches (which are unsupervised). In the last decade, a series of papers provided evidence of the predictive potential of principal components in various settings and in this work we expand the results on the predictive potential of kernel principal components which was first addressed in .
This discussion is still open and there are a lot of interesting questions one can try to address. One of the most obvious one is that we are measuring the relationship between Y and X on nonlinear regression models using correlation which measures linear relationship. It will be interesting to extend this into a different measure of association which is more appropriate for nonlinear relationships. One such approach is given in Artemiou (2021).