Deep Kernel Principal Component Analysis for Multi-level Feature Learning

Principal Component Analysis (PCA) and its nonlinear extension Kernel PCA (KPCA) are widely used across science and industry for data analysis and dimensionality reduction. Modern deep learning tools have achieved great empirical success, but a framework for deep principal component analysis is still lacking. Here we develop a deep kernel PCA methodology (DKPCA) to extract multiple levels of the most informative components of the data. Our scheme can effectively identify new hierarchical variables, called deep principal components, capturing the main characteristics of high-dimensional data through a simple and interpretable numerical optimization. We couple the principal components of multiple KPCA levels, theoretically showing that DKPCA creates both forward and backward dependency across levels, which has not been explored in kernel methods and yet is crucial to extract more informative features. Various experimental evaluations on multiple data types show that DKPCA finds more efficient and disentangled representations with higher explained variance in fewer principal components, compared to the shallow KPCA. We demonstrate that our method allows for effective hierarchical data exploration, with the ability to separate the key generative factors of the input data both for large datasets and when few training samples are available. Overall, DKPCA can facilitate the extraction of useful patterns from high-dimensional data by learning more informative features organized in different levels, giving diversified aspects to explore the variation factors in the data, while maintaining a simple mathematical formulation.


Introduction
Principal Component Analysis (PCA) is a popular technique for dimensionality reduction [1] and has been widely used in many fields [2].In fact, high-dimensional data are very common in data science when multiple variables are used to describe one sample; e.g., in biology, PCA has been applied to mass spectrometry, where thousands of proteins can be quantitatively profiled [3].PCA learns the most effective principal components to successfully reduce the dimensionality of the data while retaining most of the trends and patterns.This relies on the assumption that the given observations lie in a lower-dimensional linear subspace.Under this assumption, PCA seeks the best low-rank representation of the given data.PCA can be efficiently computed using the Singular Value Decomposition (SVD) and is optimal when data are corrupted by small Gaussian noises [4].Real-world data commonly show nonlinear relationships, so, for nonlinear problems, PCA can be extended to Kernel PCA (KPCA) [5], which manages to simplify such complexity and high dimensionality to extract useful patterns in nonlinear subspaces.KPCA first maps the inputs to a high-dimensional feature space and then applies PCA to the mapped features either through nonlinear feature mappings in the primal or equivalently kernel functions in the dual.In the Lagrange dual formulation of KPCA, the feature map does not need to be explicitly defined and positive-definite kernel functions are instead used by Mercer's theorem [6].
In deep learning, dimensionality reduction and learning informative features are also widely studied through the latent space models, such as Variational Autoencoders (VAEs) [7], which have become popular tools to extract latent features describing the factors of variation in the given training distribution.These models assume that there exists a prior distribution p(z) over a small number of ground-truth factors of variation, such that an observation x is obtained by first sampling z from p(z) and then sampling from a conditional distribution p(x|z).In this setting, the goal is to find a representation of the data that learns the factors of variation in z independently, i.e., that disentangles the factors of variation.State-of-the-art models for disentangled feature learning include InfoGAN [8], Restricted Boltzmann machines [9,10], β-VAE [11] and its variants [12,13].For instance, in β-VAE, p(z) = N (0, I) and the encoder q(z|x) is matched to the prior p(z) by minimizing the Kullback-Leibler divergence D KL (q(z|x)||p(z)).Neural networks are used to model the generative model with probabilistic encoder q(z|x) and decoder p(x|z) [7].A recent large-scale extensive experimental research has shown that the performance of VAE-based models varies greatly with random initialization, hyperparameters, and dataset, so reliable extraction of independent components describing the variation factors of data remains challenging [14].
While (K)PCA has been widely used in science and industry, the modelling flexibility of using a single feature mapping or kernel function can be insufficient and it also cannot learn well-disentangled representations [11].For such featurelearning tools, disentanglement of the variation factors (components) in the data is highly desirable [15,16] and it has been suggested that disentangled representations can benefit interpretation analysis, e.g., in the medical domain [17,18].For instance, a model trained on gene expression data may learn components such as the cell type or the cell state.In addition, because (K)PCA is a shallow model employing a single feature mapping, it learns only one flat level of components.On the other hand, deep learning has achieved pervasive empirical success with great modelling flexibility [19], but a framework combining deep architectures and principal component analysis remains lacking.
Deep kernel learning tackles multiple latent spaces for greater flexibility, more informative hierarchical investigation of the data, and kernel-based interpretations.There exist many works in deep kernel learning considering supervised learning (see [20] and references therein), but little investigation has been spared on the unsupervised settings, though a concatenation of operator-valued kernel layers was considered for data autoencoding in [21].In [22], it is proposed to conduct the shallow PCA to extract principal components, which are then applied to another KPCA, where each KPCA independently and sequentially optimizes its variance maximization.Importantly, when extending to deep architectures, [23] warn that simply doing a sequential kernel learning is not enough to achieve good accuracy due to the lack of backward feature correction, meaning that shallow layers need to use the information from deeper layers to boost their own learned representation.In [23], it is proved that hierarchical learning cannot be efficiently achieved without backward feature correction.
In this paper, we establish a novel Deep Kernel Principal Component Analysis (DKPCA) framework with the following main aspects.
• DKPCA presents multiple levels of principal components associated with the key properties of the data for more informative feature learning in multiple subspaces.The objective of each level is attained as an upper bound of a shallow KPCA 2 problem, and multiple levels are constructed by coupling the latent space of level j − 1(j ≥ 2) with the input space of level j, where the depth is given by the learned spaces directly relating to the principal components, as shown in Fig. 1.We derive that the optimization problem of our method explicitly formulates a set of nonlinear equations for each level resembling an eigenvalue problem of some matrix M j , in contrast with black-box optimization in deep learning.• Interestingly, M j fuses the hidden features of previous and subsequent levels.This means that the proposed deep architecture introduces not only forward couplings between the levels, but also backward couplings, which by far has not been explored in kernel methods and yet is crucial for effective hierarchical representation learning according to the theoretical analysis in [23].As the levels are coupled together, we formulate a multi-level constrained optimization problem with an eigenvalue problem at each level with hidden features as optimization variables, also facilitating deep approximation analysis of the given data.• The solution of the proposed optimization process gives both the deep eigenvectors and the deep eigenvalues of the DKPCA: they correspond to the solution of the eigenvalue problem of each level.Within the considered deep architecture, we then construct a generative procedure for the DKPCA by defining both an out-of-sample encoding scheme and a decoding procedure, discussing connections with Autoencoders.The generative procedure generates new samples from multiple latent spaces in different levels, makes it possible to explore the role of the deep eigenvectors of each level through the latent space traversals, and gives diversified aspects to explore the variation factors of data.Our method can also be implemented with out-of-sample extensions which allow to efficiently tackle large-scale cases.
Extensive numerical experiments demonstrate the efficacy and advantages of the proposed DKPCA from different aspects and in different tasks on multiple data types.1) DKPCA gives higher explained variance than shallow KPCA, indicating that more information is captured in fewer components.We also provide a strategy for practitioners to select …  1 ()  2 ( (1) ) ( (  −1) ) (1)   (2)   (  )   2  1    Figure 1: Topology of the RKM-based deep KPCA with n L levels.An input vector x is mapped to the feature space of the first level using a feature map ϕ 1 with hidden features h (1) in the latent space of the first level.Subsequently, the input of level j, with feature map ϕ j , are the hidden features of level j − 1.
the numbers of components and levels, which is in contrast with typical deep learning tools that use trial and error strategies in determining the network structure.2) DKPCA effectively facilitates hierarchical data exploration, as the role of each principal component in each level can be investigated through the generation of new data.In images of 3D objects with different generative factors (i.e., colors, size, etc.), our deep method creates a learning hierarchy in the components in each level.Prevailing features are typically learned in the shallower levels, e.g., colors, while the deeper levels capture more subtle features, e.g., the specific object shape.3) Quantitative performances are evaluated by comparing to state-of-the-art methods in disentangled feature learning [11,13,12] when few training samples are available, which is of particular interest in many real-world problems where data are difficult or expensive to collect.4) We show that the more informative features extraction by DKPCA can be applied to multiple data types benefiting various downstream tasks in data science, such as regression and classification.

Background
This section describes the shallow KPCA problem introducing its formulation in the RKM framework through the Fenchel-Young inequality.The RKM formulation of KPCA gives another expression of the Least-Squares Support Vector Machine (LS-SVM) KPCA problem [24] with visible and hidden units similar to the energy of Restricted Boltzmann Machines (RBMs) [15,25,10,26].In this new formulation, contrary to RBMs, both the visible units and the hidden units can be continuous.To derive this formulation, consider training data D = {x i } N i=1 with x i ∈ R d , a feature map ϕ : R d → R d F , and let s be the number of selected principal components.In the LS-SVM setting, the KPCA problem can be written as minimizing a regularization term and finding directions of maximum variance [27]: where W ∈ R d F ×s is the interconnection matrix, e i ∈ R s are the score variables along the selected s projection directions, and Λ = diag{λ 1 , . . ., λ s } 0, η > 0 are regularization hyperparameters.
The RKM formulation of KPCA [28] is given by an upper bound of J kpca obtained component-wise with the Fenchel-Young inequality 1 2λ e 2 + λ 2 h 2 ≥ eh, ∀e, h ∈ R which introduces the hidden features h and leads to the following objective with conjugate feature duality: where h i ∈ R s are the conjugated hidden features corresponding to each training sample x i ; in representation learning, h i is also known as the latent representation of x i consisting of s latent variables or of s hidden features.Note that the first term of ( 2) is similar to the energy of an RBM with connections between visible units x i in the input space and hidden units h i in the latent space.The stationary point conditions of J kpca (W , h i ) are given by: Eliminating W and considering a positive definite kernel function k : the stationary points of J kpca (W , h i ) are given in the dual by the following eigenvalue problem where K ∈ R N ×N denotes the kernel matrix induced by k(•, •) and the matrix H = [h 1 , . . ., h N ] incorporates the conjugate hidden features for all N data points.In (4), the hidden features H conjugated along s projection directions now correspond to the first s eigenvectors, with the first s eigenvalues corresponding to the hyperparameter Λ in (2).Meanwhile, η becomes a scaling coefficient that does not change the solution space, and thus can be simply set as 1.
Note that, in the conjugate feature duality of RKMs, the dual variables h correspond to the latent variables playing the role of hidden features living in the latent space.
The dual problem (4) corresponds to the kernel PCA problem as defined in [5].While ( 4) is regularized by normalizing the eigenvectors to the unit ball in feature space, the primal problem ( 2) is explicitly regularized with coefficients Λ, η chosen at the hyperparameter selection level.Each eigenvalue/eigenvector pair corresponds to a principal component in KPCA.Therefore, for the first s principal components, one can solve the dual problem (4) by considering the s largest eigenvalues and their eigenvectors, which lead to J kpca = 0. Since J kpca is unbounded below regarding its optimization in the primal, [28] proposed to instead minimize a stabilized version to make the objective suitable for minimization, such that J kpca,stab = J kpca + cstab 2 J 2 kpca , where c stab > 0 is a hyperparameter.It can be shown that J kpca and J kpca,stab share the same stationary points [29].Deep kernel methods based on the RKM framework were considered in [28,30].In [28], the KPCA levels are used as feature extractors for regression and classification.For these supervised learning tasks, [28] described a heuristic algorithm for the case of linear kernels with a level-wise forward phase, while the backward phase is only considered from the last level to the first one, discarding backward connections of all intermediate levels.Furthermore, [28] did not deal with the interpretation of the induced eigenvalues in the deep RKM; in this work, we detail the role of different eigenvalues in relation to the importance of each level and its principal components.In [30], a two-level architecture for unsupervised learning was considered with orthogonality constraints on the latent variables within each level and between the levels.By formulating the constraints into a penalty term in the objective, a straightforward numerical approach was employed to solve such unconstrained optimization problem, where the backward couplings between the hidden features were omitted.Though cast in the RKM framework, in this paper we consider more general deep KPCA architectures with multiple levels and latent spaces through the lens of a set of level-wise shallow KPCA problems, and importantly both forward and backward dependencies between levels are involved.Thanks to such new problem formulation, novel training schemes are proposed together with theoretical error bounds, where a generative model and the out-of-sample extension are also discussed, demonstrating empirical evidence of the advantages of our deep architectures and facilitating interpretations of the obtained deep principal components.
Another form of deep KPCA was proposed in [22], where PCA was firstly conducted to extract principal components of the data and then further dimensionality reduction was sequentially applied to the extracted features from the previous (K)PCA layer.This serial approach makes each layer straightforwardly optimize its variance maximization objective, which is independent of other layers.

Deep Kernel Principal Component Analysis
In this section, we present the proposed DKPCA.We start by describing the model formulation of DKPCA.Next, we derive the optimization algorithm.Finally, the generative DKPCA model is introduced.

DKPCA Model Formulation
We construct the objective function of DKPCA by joining the KPCA objectives of multiple levels in the Restricted Kernel Machine (RKM) framework [28], which combines the flexibility of deep architectures and the interpretations rooted in kernel methods.DKPCA considers general cases consisting of n levels (n levels ≥ 2) KPCA levels stacked in the corresponding latent spaces, i.e., the hidden features of level j are the input of level j + 1, inducing inter-level couplings, similar to the stacked Autoencoders [15].Correspondingly, the objective for the proposed DKPCA is formulated in the Level 1 Figure 2: Graphical illustration of the DKPCA dual problem (6) with n L levels.Each arrow goes from the level that is characterized by the corresponding hidden features to the level where it is used as input.DKPCA introduces not only forward couplings (green arrows), but also backward couplings (blue arrows) between the levels.For simplicity, η j = 1 in the diagram.
primal model representation: The feature map ϕ 1 : R d → R d F 1 of the first level takes the original data as the input, while ϕ j : R sj → R d F j is the feature map of level j = 2, . . ., n levels that takes the hidden features h (j−1) i of level j − 1 as the input, where is the interconnection matrix of level j.Here, the matrix N ] ∈ R N ×sj incorporates the hidden features conjugated along s j projection directions for all N data points, where s j is the number of selected principal components by the j-th level of our DKPCA.In the primal formulation, Λ j = diag{λ (j) 1 , . . ., λ (j) sj } and η j = 0 both serve as the hyperparameters of level j.While η > 0 in the shallow KPCA case for variance maximization in (1), this constraint is not required in the deep objective (5), having complex inter-level couplings.Note that, in our DKPCA formulation, the visible units x i in the input space are conjugated with the multi-level hidden features h (j) i in the latent space of each level j, giving an energy function that resembles the deep Boltzmann machine [31].The DKPCA topology in its primal formulation is visualized in Fig. 1.
The projection directions of shallow (K)PCA are uncorrelated due to the orthogonality of different principal components as in (4).Similarly for DKPCA, we impose intra-level orthogonality on H j , i.e., H j H j = I.From the stationary points of (5), the formulation of DKPCA in the dual variables is: Level n levels : A graphical illustration of ( 6) is given in Fig. 2. The kernel matrices are obtained as follows: is the kernel function of level j = 2, . . ., n levels by the kernel trick.Instead of first defining a feature map ϕ j , one can simply choose a positive definite kernel k j due to Mercer's theorem [6], guaranteeing the existence of a feature map ϕ j such that k j (y, z) = ϕ j (y) ϕ j (z).The stationary points of (5) can be found in A.1.
In (6), G j (H j , H j+1 ) ∈ R N ×sj are matrices jointly depending on the conjugated hidden features H j and H j+1 .In particular, they are formulated as The stepsize α is selected via backtracking for each variable.
Below, two examples are illustrated on the derivations of G when different kernel functions are chosen.
Example 3.1 (Linear kernel).In the case of linear k j (z, y) = z y, we obtain J κj,i = H j−1 , so we further have In two-level architectures with k 2 (z, y) = z y, G 1 has a linear dependency on where the eigendecomposition for the first level is written in the form of . The partial derivative is , where 1 is a vector of all ones and K j :i is the i-th column of K j .
The derivations to the dual formulations show that Λ j relates to the first s j eigenvalues corresponding to the s j eigenvectors H j in the optimization of DPKCA, indicating that all the pairs (Λ j , H j ), j = 1, . . ., n levels solving the dual problem constitute a pool of candidate solutions that lead to J = 0 in the primal objective (5).Thus, the regularization hyperparameters Λ j in the primal are automatically determined in the dual by the solutions of ( 6).Such obtained H j and Λ j with j = 1, . . ., n levels are named as deep eigenvectors and deep eigenvalues in DKPCA, respectively.The dual problem of DKPCA in each level is interpreted as an eigenvalue problem, giving the conjugated hidden features (principal components) H j solved by the deep eigenvectors corresponding to level j.The existing (shallow) KPCA is a special case of DKPCA with n levels = 1, where Λ 1 degenerates to the first s 1 eigenvalues corresponding to the s 1 eigenvectors (principal components) H 1 of the kernel matrix K 1 .

Optimization Algorithm
For general positive definite kernels k j , ( 6) is interpreted as a set of eigendecompositions with optimization variables H j coupled with previous and subsequent layers.In the algorithmic aspect, we propose to train the DKPCA by residual minimization of ( 6), which considers the orthogonality constraints on intra-level hidden features and results in the following constrained optimization problem: where J denotes the optimization objective and the residual error is adopted as the Frobenius norm • F .During the training, the hidden features not only flow forward from the previous level, but also backward from the subsequent level, as H j comes from the eigendecomposition depending on H j−1 and H j+1 in a level-wise fashion.
The constraint set for the hidden features of level j is the Stiefel manifold St(s Optimization of ( 9) can be tackled by the Projected Gradient Descent (PGD) algorithm, where the iterates for H j are specified by Π St(sj ,N ) being the Euclidean projection onto the Stiefel manifold, and α k is the stepsize selected via backtracking.The projection is computed using the compact SVD of H k j .This algorithm is detailed in Algorithm 1.Since PGD requires the SVD of H j at each iteration for the projection, it can be computationally expensive for large N and s j .For this setting, the Riemannian Adam algorithm [32] can be an alternative for this constrained optimization, where each iteration is computationally less expensive.

Generative DKPCA
In linear PCA, performing reconstruction is straightforward by a linear basis transformation, while the nonlinear KPCA faces the well-known pre-image challenges in reconstructions [33].The proposed DKPCA employs multiple nonlinear feature maps and consists of multiple latent spaces, posing even greater challenges for the reconstruction.We propose a procedure for generative DKPCA from the sampled hidden features h (j) in latent spaces with parametric feature maps ϕ j of each level, which also induces a positive definite kernel matrix [27,29].We also describe how the proposed generative model can facilitate the exploration of the role of the deep eigenvectors of each level.
Given the learned h (j) , we consider a generative objective introducing one term per level to the objective (5) for a point x: 1  2 ϕ 1 (x) ϕ 1 (x) for the first level and 1 2 ϕ j h (j−1) ϕ j h (j−1) for level j = 2, . . ., n levels .By the characterization of the stationary points given in A.2, a new point x is generated through the inverse maps of the multiple levels: such that a (j) = ϕ −1 j W j a (j+1) , j = 2, . . ., n levels , a (nlevels+1) = h (nlevels) , and where ϕ j is invertible with the inverse map denoted as ϕ −1 j .Note that (10) has a similar structure to the decoder of an Autoencoder architecture.This process is visualized in Fig. 3.
In practice, it is particularly useful to employ parametric feature maps, as they can learn to well map high-dimensional complex data from the unknown training distribution.For instance, in computer vision tasks one can define a convolutional neural network as the feature map ϕ 1 .A transposed convolutional network ψ 1 is used in the generation formula (10) to approximate the inverse map ϕ −1 In such cases when the inverse map ϕ −1 1 is unknown explicitly in advance, one can employ a learnable pre-image map to approximate the inverse map, which resembles the decoder part in an Autoencoder architecture.Thus, we add the reconstruction error, e.g., 2 to the optimization objective J in (9) for the learning of the inverse feature map ψ 1 .The full objective is thereby cast as where ψ 1 is the learnable pre-image map that approximates the inverse map ϕ −1 j , L i is the reconstruction error of sample x i , and γ > 0 balances the reconstruction error and the residuals minimization.Besides H j and Λ j in J, the network parameters of ϕ 1 and ψ 1 also need to be learned.In this optimization problem, an alternating update scheme is adopted: the Adam optimizer [34] is used to update the parameters of ϕ 1 and ψ 1 , while keeping the deep eigenvectors and eigenvalues fixed; the hidden features H j and the corresponding eigenvalues Λ j are updated using the DKPCA training algorithm described in Section 3.2 with ϕ 1 and ψ 1 fixed.
In this case, the optimization to the proposed generative model includes both the latent variables h (j) i in the dual and the explicit feature map ϕ 1 in the primal.This combination allows both the couplings of the levels in the latent variables of each level and deep powerful parametric feature maps better suited for more complex tasks.The deep architecture of DKPCA consists of feature maps over multiple levels, where depth is given both by multiple KPCA levels and by feature maps possibly consisting of multi-layered neural networks.This generative model resolves the pre-image problem in performing reconstruction and also enables to obtain new data corresponding to any sampling in the multiple latent spaces.For reconstruction, given any input, its hidden features (principal components) in latent spaces are first computed and are then fed to the inverse feature maps for reconstruction in the original input space.For generation, given any sampling in latent spaces, their correspondingly generated samples in the input space can be obtained through the inverse feature maps using (10).This makes it viable to explore the role of the deep eigenvectors relating to the principal components of each level, i.e., the generation of newly sampled latent variables can be investigated by changing only one latent variable (principal component) at a time, performing the traversals over these latent variables.
Besides, DKPCA also pertains the out-of-sample extension, which allows to predict unseen input data without retraining.This property is of particular interest in large-scale case for such unsupervised settings [27], as a subset of M N samples are used for the efficient training and the rest N − M samples can be predicted through out-of-sample extensions, as detailed in A.2.In this way, the storage complexities for the kernel matrices and the hidden features matrices of level j decrease from O(N 2 ) and O(N s j ) to O(M 2 ) and O(M s j ), respectively.One approach to the subset selection is to take a random subsample of M data points for the training, which is capable of well balancing both efficiency and accuracy as evaluated in Fig. 8.One can also use more sophisticated selection schemes, such as the quadratic Renyi entropy [35] or the leverage score sampling [36].The optimal selection strategy is nevertheless data-dependent in practical applications [37,38].

Analytical Findings
In this section, first we show that the optimization problem of our method explicitly formulates a set of nonlinear equations for each level resembling an eigenvalue problem of some matrix M j fusing the principal components of previous and subsequent levels, i.e., DKPCA introduces not only forward couplings, but also backward couplings between the levels.Further, we illustrate that the additional levels act as a regularization on the first level.Then, we apply the Eckart-Young theorem to the deep kernel machine for approximation error bounds on the kernel matrix of the given data.Finally, we show conditions under which the explained variance of DKPCA is strictly greater than the one from KPCA.

Forward and Backward Couplings between Levels
The equations in (6) give the level-wise eigendecomposition interpretation of DKPCA, in which the forward and backward couplings between levels are embodied.The first level resembles the eigendecomposition of the regularized kernel matrix of the given data M 1 the last level is the eigendecomposition of the symmetric matrix M nlevels 1 ηn levels K nlevels ; the intermediate levels j = 2, . . ., n levels − 1 are related to the eigendecomposition of with deep eigenvectors H j and deep eigenvalues Λ j .Fig. 2 visualizes this process.
The optimization of DKPCA discussed in Section 3.2 is interpreted as a set of n levels eigendecomposition problems, each of which (H j ) depends on the hidden features of both previous (H j−1 ) and subsequent (H j+1 ) levels.In this way, information not only flows forward but also backward in the learning process, as M j has dependency on both H j−1 and H j+1 .This is an important property, as previous theoretical works in deep learning such as [23] stressed that forward propagation alone in a level-wise fashion is not enough to learn efficient deep architectures, as the levels also need to be coupled in backward directions so that more abstract representation of subsequent levels can be utilized to improve the learning of the current level.With the forward and backward couplings between levels, eigenvalue problems in (6) cannot be independently solved in series, which motivates the DKPCA training algorithm by residual minimization of the set of nonlinear equations (6) described in Section 3.2.

Deep Approximation Analysis
For theoretical analysis, we consider the two-level DKPCA with k 2 (z, y) = z y, as the optimization can be simplified.In this case, M j does not depend on , where H 1 and H 2 are implemented as the eigenvectors in Level 1 and Level 2, respectively: where the first level performs KPCA of Here, the second level can be regarded as playing a regularization role: the second level leads to a regularized K 1 with the regularization constant η1 η2 .Note that H 2 is unknown a priori, so one has to solve the sets of nonlinear equations ( 13) for both levels rather than first solving the eigenvalue problem for level 1 and then for level 2, reflecting the forward and backward dependency.
We analyze approximation error bounds for the conceived two-level architectures through the Eckart-Young theorem [39], as both of the matrices to be factorized are symmetric, providing additional insights into the DKPCA.Lemma 4.1 (Error bounds).Applying the Eckart-Young theorem to both levels in (13) with orthonormality constraints, the following bound for the deep approximation of K 1 is obtained with s 1 ≤ r 1 , where ) and λ i is the i-th largest eigenvalue of K 1 .Lemma 4.1 gives the error of approximating the data kernel matrix K 1 with the low-rank matrix of hidden features H 1 of the first level as a lower bound depending on the remaining eigenvalues of K 1 regularized with the matrix of hidden features H 2 of the second level.The smaller η 2 , the greater the effect of the second level.On the other hand, a very large η 2 indicates high regularization on the second level, reducing its effect, in which the deep architecture behaves resembling a shallow low-rank approximation.If the number of columns s 1 of the approximating matrix is greater than the rank r 1 of the matrix to be approximated, one can choose s 1 = r 1 achieving an error-free approximation.See Fig. 4 for numerical evaluation and A.3.1 for the proof.
In the next Lemma, we study the cumulative explained variance given by the principal components of the considered two-level DKPCA with comparisons to shallow KPCA, analytically showing the higher explained variance of DKPCA.
where λ i > 0 is the i-th largest eigenvalue of the kernel matrix K 1 , which is taken positive-definite, and λ i is the i-th largest eigenvalue of K 1 + 1 η2 H 2 H 2 , for all 1 ≤ n < N .The above Lemma gives conditions on η 2 under which the considered two-level DKPCA is advantageous compared to shallow KPCA in terms of the explained variance of the first n principal components.When choosing , where λ N is the smallest eigenvalue of the data kernel matrix, the cumulative variance explained by the first n components of the first DKPCA level is strictly greater than the variance explained by the first n components of shallow KPCA.In other words, DKPCA can capture more information in fewer components.See the next Section for associated numerical experiments and A.3.2 for the proof.

Numerical Experiments
We present a series of experiments to assess and explore DKPCA, showing the efficacy and advantages of the proposed deep method from different aspects in the following subsections.DKPCA is implemented in Python using the PyTorch library.The code is available at https://github.com/taralloc/deepkpca,where all datasets used in this study and the setup details are publicly available and described in the repository.
Datasets Both synthetic and real-world data are used to assess the proposed method with empirical evidence.Three synthetic datasets are presented: a 2D square dataset (Synth 1), a complex 2D dataset consisting of one square, two spirals and one ring (Synth 2), and a 140-dimensional multivariate Normal dataset (Synth 3), where samples are drawn randomly from mixed Gaussian distributions.For real-world data, we consider MNIST [40], 3DShapes [12], Cars3D [41], and SmallNORB [42].In particular, we evaluate disentanglement on the 3DShapes, Cars3D, and SmallNORB, which are popular benchmarks for evaluating variation factors.
Evaluation metrics and compared methods Different related unsupervised learning methods are adopted to comprehensively evaluate DKPCA.A comparison to the shallow KPCA is presented with the learned principal components on multiple aspects.We also consider the state-of-the-art methods β-VAE [11], FactorVAE [12], and β-TCVAE [13] for general disentangled feature learning.We keep the same encoder ϕ 1 and decoder ψ 1 architecture for all methods.For quantitative evaluations, we employ the IRS metric [43], where a higher value indicates better robustness to changes in variation factors.The shared hyperparameters among all methods are fixed to be the same.For the model-specific hyperparameters, we used the suggested values in their papers.It is worth mentioning that the compared methods are sensitive to hyperparameter selections, as shown in [14].Our method does not suffer from such issue as Λ j is automatically determined by the solution of the deep KPCA problem and η j is a scaling factor fixed to 1.More details of the setups are given in B.1.

DKPCA Provides Interpretable Deep Principal Components
This part examines the roles of each individual deep principal component and of the components in each level.Contrary to shallow KPCA owning one set of eigenvectors/eigenvalues, DKPCA have multiple sets of eigenvectors/eigenvalues for each level.Thus, the features can be represented in a more hierarchical way that benefits the interpretation explorations.In fact, via the proposed deep generative procedure (10), sampled hidden features and their pre-image mappings to the input space can be computed.By traversing the latent space in some specific dimensions, i.e., varying a single deep principal component while keeping the others fixed and generating the corresponding sample in input space, what each component learns can be observed.In DKPCA, with the extracted deep eigenvectors H j , the model can well disentangle the factors of variation in the data.This is verified quantitatively and qualitatively, comparing the traversals on the learned principal components with the state-of-the-art FactorVAE.
Notably, we show that DKPCA effectively facilitates hierarchical data exploration, as the role of each principal component in each level can be investigated through the generation of new data.Specifically, we consider images of 3D objects with different generative factors, i.e., colors, sizes, etc.For individual components, our method can find new principal components such that, when sampling along one of them, only one generative factor changes, e.g., only the object scale changes, while its color and other factors remain fixed.For the components in each level, our deep method creates a learning hierarchy: prevailing features are typically learned in the shallower levels, e.g., colors, while the deeper levels capture more subtle features, e.g., the specific object shape.Fig. 5 summarizes the main results for 3DShapes.Detailed analysis is given in the following for each principal component in all levels and for each level separately.
Individual principal components In Fig. 5a and 6, we show the traversals in the latent spaces of a DKPCA with explicit feature maps.Aside of high visual reconstruction quality in the second row, other rows show the generated images while traversing along the individual principal component of the first level (h (1) ) or of the second level (h (2) ) of the proposed DKPCA that explains the corresponding generative factor.In FactorVAE, a single latent space is obtained, and the images are generated by traversing along each dimension in the latent space of FactorVAE.For instance, in 3DShapes (Fig. 5a), the component in Row 3 captures the factor of wall hue, as both the floor and object hue remain almost constant.In 3DShapes, DKPCA better disentangles the scale of the object, which only slightly varies in FactorVAE.In Cars3D (Fig. 6a), the three factors of elevation, car type, and azimuth by DKPCA are well captured and disentangled, while FactorVAE gives entanglement in differentiating the learning of azimuth with the two other components of elevation and car type.A similar analysis is conducted for the other rows, showing that the deep components well capture the factors of variation of the data.Besides, thanks to the eigenvalues Λ j obtained in the h (2)   h (1)   orig.optimization, DKPCA can identify an ordering of the components, providing a way to reflect their relative importance.This cannot be done with the considered VAE-based methods [11,12,13].
Principal components in each level Besides individual components, we further explore the level-wise interpretation of the learned deep principal components in DKPCA.In Fig. 5a, the two components of the first level capture the background, which corresponds to the factors of the highest variation, i.e., the wall and floor hue, as they involve the most pixels in the images.The two components of the second level capture subtle characteristics of the object, e.g., scale and orientation, as the deeper components capture generative factors for more detailed information with less variation among samples.In other words, DKPCA learns a hierarchy of abstraction in its deep components, from less abstract, i.e., background, to more abstract, i.e., object.Similar conclusions hold for Cars3D (Fig. 6a): the first level learns the car type, which is the factor of highest variation, while the second level learns more sophisticated factors capturing the elevation and azimuth of the car.
h (1)   orig.Disentanglement learning A quantitative evaluation of disentangled feature learning is performed by comparing with the state-of-the-art methods β-VAE [11], FactorVAE [12], and β-TCVAE [13] on the commonly used IRS metric [43].The studied DKPCA architecture has n levels = 2, s 1 = s 2 set to the number of generative factors, and the latent representation of a data point x i is given by the concatenation of h (1) 1 and h (2) 2 .The dimension of the latent space of the compared methods is set to s 1 + s 2 .Fig. 7 gives the performance evaluation with models trained on a subset of N = 200 samples.The proposed DKPCA shows overall favorable performance for disentanglement learning on the tested datasets, notably outperforming the state-of-the-art VAE-based methods in Cars3D.Those advantageous results of DKPCA achieved under this setting reflects better sample efficiency in this set of experiments: from only hundreds of data points, the DKPCA can learn more disentangled representations than the compared data-hungry deep learning methods.In real-life scenarios, this property can be of particular interest, as the training examples might be available in limited quantity or expensive to collect, so models better capturing the true generative factors from a limited number of data are desirable.
DKPCA can be implemented with out-of-sample extensions for large-scale cases by selecting a subset M N for training and then obtaining the latent representations of the remaining data.To evaluate the performance of DKPCA on the full dataset, in Fig. 8 we evaluate the entire corresponding datasets through out-of-sample extensions using M = 200 samples for the training.The results shows a higher mean IRS is attained over all compared methods which are trained on the full dataset.This comparison further verifies the disentanglement of the hidden features learned by our method, as well as its sample efficiency: only hundreds of samples are needed by DKPA to effectively learn disentangled representations and outperform the deep learning methods trained on thousands of data points.

DKPCA Learns More Informative Features
In this section, we further investigate the features learned by DKPCA.DKPCA gives higher explained variance than shallow KPCA, indicating that more information is captured in fewer components.We therefore show the superiority of DKPCA as a feature extractor for downstream supervised tasks for multiple data types.We also investigate the problem of selecting the number of principal components in each level and the number of levels, providing a selection strategy in an unsupervised setting, in contrast with typical trial and error tuning in deep learning.Deep eigenvalues As presented in Section 3.2, deep eigenvalues Λ j are learned by DKPCA in different levels j = 1, . . ., n levels , compared to Λ of the single level in shallow KPCA.We now investigate the learned deep eigenvalues in terms of the percentage of variance explained and compare with shallow KPCA, where the nonlinear case is considered by using the RBF kernel in all levels.Fig. 5b, 9a, and 9b plot the variance explained by each component by DKPCA (orange bars) and by shallow KPCA (blue bars), as well as the cumulative variance explained by DKPCA (orange line) and by shallow KPCA (blue line).
In Fig. 9a for Synth 2 with 30 components in each level, it shows that the cumulative explained variance reaches almost 100% after around 10 deep principal components, while a much slower explained variance growth in the shallow case.Even if both methods use the same kernels, the first principal component of DKPCA explains around 20% of the   This experiment shows the minimum number of components such that the approximation error is small enough so that practitioners have a guarantee on the faithfulness of the representation learned by the proposed model.For this dataset, the reconstruction error is 0 with s 1 = s 2 = 6.variance compared to only around 8% of KPCA.This experiment shows that our method can lead to more informative principal components, ultimately resulting in a more powerful representation in fewer components with the deep architecture.Comparing the deep eigenvalues Λ 1 (solid orange line) of the first level with the ones Λ 2 (dotted orange line) of the second level, the former shows faster initial growth, while the latter gives a flatter cumulative explained variance.A similar analysis is conducted for 3DShapes in Fig. 5b, while Fig. 9b presents the numerical evaluation of Lemma 4.2 in the two-level DKPCA with RBF first level and linear second level on Synth 3, with η 2 chosen to be the largest value satisfying the conditions of 4.2.
Additionally, a 4-level DKPCA with 10 principal components in each level is trained on the handwritten digit images dataset MNIST [44] in Fig. 9c: the first and the second levels follow a similar pattern, and each subsequent level shows a flatter curve with increasingly higher explained variance in the top components.The fourth level explains almost the entire variance in the first few components, indicating that the current four levels are sufficient.In this way, the minimum number of levels to fully explain a given dataset can be determined.This observation can also be a useful suggestion for tuning the kernel settings in the different levels: the kernel settings might need to be better tuned when introducing additional levels does not lead to a sufficient increase in explained variance.
Selection of principal components in each level Contrary to shallow KPCA, different numbers of principal components can be selected for each level in deep architectures of DKPCA.In this experiment, we train a two-level DKPCA, introduced in Eq. ( 13), with linear kernels on the synthetic datasets to investigate the influence of the numbers of selected principal components s 1 and s 2 of the first and second levels, respectively.In practice, one would like to select the smallest number of principal components to suffice the required small enough reconstruction error that depends on the specific applications, so a general method for selection of s j is needed for practitioners.This selection can be performed by analyzing the relative importance of each deep component through its explained variance.
A two-dimensional synthetic dataset located as a noisy square is exemplified (Synth 1).As shown in Fig. 10a, the eigenvalues of the first level drop distinctively after the second principal component, and the percentages of explained variance by the first and second component are similar.This is consistent with the ground-truth properties of this two-dimensional dataset.In Fig. 10b, the reconstruction error decreases with s 2 increasing and shows its largest drop after the first two components in the second level, where the MSE reaches 0 with s 1 = s 2 = 6.In fact, our method can always achieve 0 reconstruction error in the case of the full decomposition with s 1 = s 2 = N , also as verified on the real-world 3DShapes in Fig. 11.For 3DShapes, the ground-truth number of variation factors is 6, so the cumulative explained variance climbs quickly as most variance has been captured by only a few components.The reconstruction error shows the opposite behavior, dropping sharply after around 10 principal components and reaching 0 for the full decomposition.Such evaluations are conducted in an unsupervised setting, and thus practitioners can accordingly use these evaluations to determine s j of the DKPCA architecture for faithful reconstructions.
Extracted principal components for downstream tasks KPCA is often used as a feature extraction step for downstream supervised tasks.Similarly, DKPCA can extract multiple levels of disentangled features that can facilitate different tasks.Specifically, it has been suggested that disentangled features could be useful for the supervised downstream problems due to the compact structure of the representation of the input distribution [14].The following experiments show that DKPCA extracts more informative features that improve the performance of supervised learning problems compared to shallow KPCA.We fed the concatenation of the deep representation learned by an unsupervised two-level DKPCA to a linear classifier/regressor and compared with shallow (K)PCA with s principal components using the same overall number of components, i.e.  [45].
Results are shown in Table 1.DKPCA outperforms shallow KPCA in all datasets in terms of both accuracy (ACC) and root mean squared error (RMSE).The WINDIN metric [46] evaluates the disentanglement of a representation z when the ground truth factors of variations are not known: it measures both the informativeness and the separability of the representation through the conditional mutual information between the input x and its latent representation z.DKPCA produces significantly more disentangled representations than KPCA; for instance, in the Liver dataset DKPCA improves the WINDIN by approximately 10 times over KPCA.Overall, DKPCA leads to better supervised performance than KPCA while using the same number of components, showing the improved informativeness of the deep representation, which can more efficiently capture the trends of the data that are most relevant for supervised prediction.
6 Discussion and Conclusion

Discussion
Our proposed DKPCA establishes a novel framework for deep nonlinear principal component analysis by leveraging the RKM formulation.DKPCA exploits the Fenchel-Young inequality introducing conjugate feature duality, and extends the classical shallow KPCA to multiple levels, where both neural network feature mappings and kernel functions can be adopted in different levels for flexible modelling.In contrast to shallow KPCA involving a single eigendecomposition to the kernel matrix, DKPCA gives different eigenvalue problems across levels and yields the so-called deep eigenvectors and deep eigenvalues, as characterized by the stationary conditions.DKPCA can be applied to general feature learning tasks in place of classical KPCA or VAE-based methods in various applications.Conventional KPCA may need many components to attain a high explained variance, while DKPCA can capture information more efficiently in fewer components.Compared to the black-box optimization in deep learning-based methods, the optimization problem of DKPCA explicitly formulates a set of nonlinear equations for each level resembling an eigenvalue problem.
DKPCA formalizes the couplings between levels in terms of the conjugated hidden features, playing the roles of principal components in the latent spaces with dual formulations.The proposed deep kernel method is not a simple forward level-wise algorithm, but the optimization of features flows backwards in the deep architecture, so that components in levels of lower abstraction can benefit from the representation learned in levels of higher abstraction.This property has been theoretically verified as essential for effective hierarchical learning, and yet has not been explored in deep kernel methods.We then devise a multi-level optimization algorithm for DKPCA, where the deep eigenvectors and deep eigenvalues regarding the level-wise principal components are taken as optimization variables.For a specific case with two-level architectures, the optimization is simplified with solutions being the singular vectors in each level, which facilitates theoretical analysis for greater insights: the Eckart-Young theorem is applied to establish approximation bounds, interpreting the role of the second level as a regularizer, and the explained variance by DKPCA is analytically compared with shallow KPCA.
We also develop the generative DKPCA, so that hidden features in multiple levels can be sampled from the latent spaces and their correspondingly newly generated data can be attained.The role of each component or each level can be explored by traversing it in the latent space and keeping the others fixed, providing diversified aspects to explore the meaning of principal components and the variation factors of data.The pre-image problem is a well-known challenging problem in KPCA, and its solution to general cases of multi-level KPCA was not investigated before.In DKPCA, we incorporate the reconstruction errors, minimized to approximate the pre-image feature mappings, so that the reconstruction procedures can be conducted.Compared to the generation and reconstruction in VAE-based methods, DKPCA creates multiple latent spaces, which not only enhances modelling flexibility with deep architectures but also provides multi-level feature learning.Out-of-sample extensions are also allowed in DKPCA to predict unseen data.The scalability issue commonly exists in kernel-based methods, but this can be well resolved by the out-of-sample extensions owned by DKPCA.When a small subset with M samples is used in training and the rest N − M samples are predicted via out-of-sample extensions, the maximal storage complexity of level j drops from O(N 2 ) to O(M 2 ).

Conclusion
In this paper, the proposed DKPCA introduces a novel deep architecture for unsupervised multi-level feature learning, where deep kernel machines and neural networks can both be exploited.DKPCA realizes forward and backward learning and provides more informative features enabling the exploration and interpretations on hierarchical feature abstractions.Both theoretical derivations and numerical evaluations verify the effectiveness of DKPCA.The data representations learned by DKPCA can be utilized in various tasks and on different types of data with promising practical values in the era of versatile data.In future works, variants of KPCA can be extended to deep architectures for greater efficiency or reliability, such as sparse KPCA and robust KPCA.

A Proofs and Derivations
In this section, mathematical derivations to the modelling, optimization, and analytical properties of the proposed DKPCA are elaborated.DKPCA establishes a novel deep architecture of KPCA, which has long been an important unsupervised feature learning methodology.A.1 provides the formulations leading to the optimization interpreted by a set of eigendecompositions.It demonstrates how DKPCA leverages the RKM formulations bridging neural networks and kernels and enjoys the merits of flexible deep architectures and more interpretable kernel methods.In what follows, technical details of the generative modelling are presented in A.2, showing promising potentials for versatile scenarios in real-world applications.Proofs for the lemmas in Section 4.2 are given in A.3, providing more details and insights towards the proposed DKPCA under the considered settings with analytical properties.

A.1 Derivation of DKPCA
The objective (5) of DKPCA in the primal formulations is given by the compositions of latent spaces of multiple levels, and its dual formulations can be attained by characterizing the stationary points to (5): By eliminating the weight matrices W j , one obtains the following non-linear equations in the hidden features h i , h with j = 2, . . ., n levels − 1.
By organizing the above (17) into matrices, the dual formulation of DKPCA in ( 6) is obtained equivalently.

A.2 Derivation of Generative DKPCA
For the challenging pre-image problem for multi-level nonlinear PCA, we propose a procedure for generative DKPCA from the sampled hidden features h (j) in latent spaces with explicit feature maps: the feature map ϕ j of each level is known and can also be parametric with learnable parameters.
Assume that ϕ j is invertible, with the inverse map denoted as ϕ −1 j , and that h (nlevels) is given, which can be the hidden feature vector of a training or test point, or newly sampled from the latent space.First, given the learned h (j) i from the training, we introduce an additional term per level to the objective (5) for a point x: 1  2 ϕ 1 (x) ϕ 1 (x) for the first level and 1 2 ϕ j h (j−1) ϕ j h (j−1) for level j = 2, . . ., n levels .
Characterizing the stationary points w.r.t.ϕ 1 (x) and ϕ j h (j−1) , we obtain 1) , ∂J ∂ϕ j h (j−1) = 0 ⇒ ϕ j h (j−1) = W j h (j) , ∀j = 2, . . ., n levels , (18) so that the feature map ϕ j (•) of each level can be calculated from the given hidden features h (j) .DKPCA then generates new samples through the inverse maps of the multiple levels, and accordingly a generated sample x is attained through ϕ −1 1 in the first level that maps W 1 h (1) back to the input space, as shown in (10).In case the inverse map ϕ −1 1 is unknown explicitly, one can learn a pre-image map by minimization of the AutoEncoder reconstruction as described in Section 3.3.
We also developed an extension to attain the hidden features in each level corresponding to an out-of-sample point x from the first, third, and fifth equations in (16).For the two-level case with linear k 2 , where it is more straightforward to obtain the out-of-sample extension, we obtain by eliminating the interconnection matrices: 2) ).

A.3 Proof of Deep Approximation Analysis
In this section, we give the proofs of Lemmas 4.1 and 4.2 in the two-level case of (13).

A.3.1 Proof of approximation bounds
With the level-wise SVD interpretation to the discussed two-level cases in ( 13), the Eckart-Young theorem can be applied to both levels, deriving the approximation errors: with r 1 = rank(K 1 + 1 η2 H 2 H 2 ) and r 2 = rank( 1 η2 H 1 H 1 ).We fix η 1 = 1 and vary the regularization factor η 2 .With orthonormality constraints in the second level, H 2 H 2 F = √ s 2 , the lower bound in Lemma 4.1 is obtained.
Using the orthonormality constraints of the second level, we square (19) and rewrite it as with . By the Cauchy-Schwartz inequality, we further obtain Recalling that K 1 is positive semi-definite, the inequality for the upper bound when η 2 > 0 in Lemma 4.1 is obtained by using (21) in (20).When η 2 < 0 in Lemma 4.1, with symmetric K 1 , note that max H T 2 H2=I Tr(H 2 K 1 H 2 ) = s2 i=1 λ i , which gives the upper bound by combining with (20).Therefore, the proof of deriving the bounds for the approximation analysis in Section 4.2 is completed.factors.In other words, if a latent variable is associated with some generative factor, the inferred value of this latent variable shows little change when that factor remains the same, regardless of interventions to the other generative factors.Other metrics for disentanglement evaluation have been proposed, but it has been shown that they are closely correlated with each other [14].
Hyperparameter selection In unsupervised learning experiments, for consistent evaluations, the shared hyperparameters among all methods are fixed to be the same, e.g., the RBF bandwidth in KPCA.We fix η j = 1, j = 1, . . ., n levels and γ = 1 in (11) to equally balance the AE and deep KPCA error.For the more challenging SmallNORB, we set γ = 100.In the qualitative disentanglement experiments, we use the Riemannian Adam algorithm [32] with 80000 maximum number of epochs; concerning the principal components, s 1 = s 2 is set to the true number of generative factors and the factors of variations involving the fewest pixels are trained on a subset with fixed factors of highest variation as these factors dominate the principal components as they have the largest number of pixels.In the quantitative disentanglement experiments, we employ the two-level architecture of ( 13) with linear kernels.In the explained variance experiments, subsampling of 3DShapes and MNIST is performed with N = 50 and N = 100, respectively.In the supervised experiments, the RBF kernel is used for all datasets.Tuning is carried out through grid search based on validation performance.The σ 2 of RBF kernels is tuned between exp (−2) and exp (7).For DKPCA, we tune η 2 between −10 and 10.The hidden features of the test points are obtained through a kernel smoother approach [47] for the supervised and the large-scale disentanglement experiments.The shared hyperparameters in the compared methods are tuned under the same settings, e.g., the kernel parameters are tuned in the same range.

B.2 Additional Results
Table B.3 gives the test reconstruction errors on a 140-dimensional synthetic dataset with different numbers of principal components in the two-level DKPCA of Eq. ( 13) with linear kernels.In Table B.3, for a fixed s 1 , the best test error is obtained with s 2 = s 1 : the test error does not further decrease for s 2 > s 1 .In fact, the rank of H 1 H T 1 is at most s 1 .so an H 2 with rank higher than s 1 cannot lead to lower approximation error.Therefore, for a fixed s 1 in this conceived two-level architecture with linear kernels, s 1 should be set as s 1 ≤ s 2 in terms of reconstruction error, in which increasing s 1 leads to lower reconstruction error as more principal components are incorporated.

Figure 3 :
Figure 3: Overview of generative DKPCA with n L levels.Multiple latent spaces are considered with multi-level hidden features h (j) , ∀j = 1, . . ., n L .The feature maps ϕ j are indicated with arrows going from left to right.The generative model employs the pre-image maps ψ j , represented by the arrows going from right to left.The dashed line in input space represents the reconstruction error.The projecting vector in latent spaces indicates the projections in the corresponding s j -dimensional latent subspace.

Lemma 4 . 2 (
Explained variance of deep KPCA).In the full decomposition case (s1 = s 2 = N ), when η 2 < − 1 λ N, the explained variance of the top n principal components of DKPCA in(13) is strictly greater than the variance explained by the top n principal components of shallow kernel PCA, i.e.,

Figure 5 :
Figure 5: Results on the 3DShapes dataset.(a) Role of the deep principal components.The ground-truth on the 1st row, reconstructions on the 2nd row, and traversals on other rows in the latent spaces induced by DKPCA and FactorVAE.The factors extracted by DKPCA are better disentangled than FactorVAE.Unlike FactorVAE, DKPCA shows a hierarchy of details, where the second level learns more complex factors of variation than the first level.(b) Explained variance (%) of both DKPCA and shallow KPCA using the same kernel.DKPCA captures considerably greater explained variance (informative features) in the first principal components than KPCA, where the lines denote the cumulative explained variance and the bars denote the variance explained by each component.(c)(d)(e) Scatter plots of the latent variable distribution, where DKPCA learns one latent space for each level.The FactorVAE distribution shows partial irregularity, while the distributions learned by DKPCA follow a more compact Gaussian profile, centered around the origin in the second level.

Figure 6 :
Figure 6: Role of the deep principal components.First row: ground-truth.Second row: reconstructions.Other rows: traversals in the latent spaces induced by DKPCA.The DKPCA shows a hierarchy of details, where the second level learns more complex factors of variation than the first level.

Figure 9 :
Figure 9: Interpretation of the deep eigenvalues.Explained variance (%) of DKPCA.(a) The compared method is kernel PCA with RBF kernel with the same bandwidth.Our method is able to capture considerably greater explained variance in the first principal component than shallow KPCA, showing that the proposed deeper architecture outputs more informative principal components even with the same kernel function as the shallow KPCA.(b) Illustration of Lemma 4.2: the first DKPCA level maintains higher cumulative explained variance than KPCA for all n, capturing more information in fewer components.(c) Four-level DKPCA with RBF kernels on MNIST.In all plots, bars: explained variance, lines: cumulative explained variance.

Figure 10 :
Figure 10: Explained variance by the first level and reconstruction error (training MSE) for the Synth 1 dataset with s 1 = 6.This experiment shows the minimum number of components such that the approximation error is small enough so that practitioners have a guarantee on the faithfulness of the representation learned by the proposed model.For this dataset, the reconstruction error is 0 with s 1 = s 2 = 6.

Figure 11 :
Figure 11: Full decomposition for a subset of 3DShapes (N = 480).Cumulative explained variance (%) from the deep eigenvalues of the first level of KPCA and reconstruction error.A sharp increase in the explained variance corresponds to a distinctive drop in reconstruction error, which reaches 0 for the full decomposition.

Table 1 :
1 + s 2 = s.For all datasets s 1 = 3, s 2 = 2.Both KPCA and DKPCA employ RBF kernels; hyperparameters are tuned on a validation set using a 60/20/20 split for training/validation/test sets.Comparison of test performance for classification/regression and disentangled feature learning by DKPCA on real-world datasets of various data types.Higher scores (↑) are better for ACC (%) and WINDIN, lower scores (↓) are better for RMSE.The best performance is in bold.All datasets are UCI datasets from

Table B .
3: Test reconstruction error (MSE) on the 140D Synth 3 dataset for different numbers of principal components of the two levels in the proposed deep KPCA.All numbers are ×10 2 .