Unsupervised learning of disentangled representations in deep restricted kernel machines with orthogonality constraints

We introduce Constr-DRKM, a deep kernel method for the unsupervised learning of disentangled data representations. We propose augmenting the original deep restricted kernel machine formulation for kernel PCA by orthogonality constraints on the latent variables to promote disentanglement and to make it possible to carry out optimization without first defining a stabilized objective. After illustrating an end-to-end training procedure based on a quadratic penalty optimization algorithm with warm start, we quantitatively evaluate the proposed method's effectiveness in disentangled feature learning. We demonstrate on four benchmark datasets that this approach performs similarly overall to $\beta$-VAE on a number of disentanglement metrics when few training points are available, while being less sensitive to randomness and hyperparameter selection than $\beta$-VAE. We also present a deterministic initialization of Constr-DRKM's training algorithm that significantly improves the reproducibility of the results. Finally, we empirically evaluate and discuss the role of the number of layers in the proposed methodology, examining the influence of each principal component in every layer and showing that components in lower layers act as local feature detectors capturing the broad trends of the data distribution, while components in deeper layers use the representation learned by previous layers and more accurately reproduce higher-level features.


Introduction
The choice of the features on which a machine learning method is trained on is crucial to achieve good performance. Finding representations of the data that make it easier to train classifiers or other predictors is the goal of representation learning (Bengio et al., 2013).
One desirable characteristic of good representations is disentanglement, which means that the learned representation separates the factors of variations in the data (Bengio, 2009;Bengio et al., 2013). In fact, in representation learning it is common to assume a low dimensional multivariate random variable z representing the meaningful factors of variations: a high dimensional observation x is then sampled from the conditional distribution P (x|z). For example, a model trained on faces may learn latent generative factors such as hair style, emotion or the presence of glasses. In this respect, the empirical success of deep learning in supervised learning tasks is often linked to the ability of deep networks to learn meaningful intermediate representations (Vincent et al., 2010;Zeiler & Fergus, 2014).
However, unsupervised learning of disentangled representations is still a key challenge in artificial intelligence research (Bengio et al., 2013;LeCun et al., 2015;Lake et al., 2017). State-of-the-art models are mostly based on the generative adversarial network (GAN) framework, such as InfoGAN (Chen et al., 2016), or on variational autoencoders (VAEs) (Kingma & Welling, 2014), including β-VAEs (Higgins et al., 2017), FactorVAE (Kim & Mnih, 2018) and β-TCVAE (Chen et al., 2018). Concerning the former approaches, it has been observed that training InfoGANs is difficult because of the training instability of the GAN framework (Higgins et al., 2017). On the other hand, approaches based on VAEs offer training stability (Higgins et al., 2017), but a recent extensive empirical evaluation has found that their disentangling performance was not reliable as it varied widely with hyperparameter selection, random seeds and among datasets (Locatello et al., 2019).
With respect to unsupervised feature extraction, principal component analysis (PCA) is a standard methodology. More in general, it is well established that kernel methods are stable to train and give reliable performance, which contrasts sharply with the issues outlined in the above paragraph. However, in contrast to deep networks, they are shallow methods, so they cannot take advantage of depth to hierarchically decompose a difficult target function into a composition of simpler functions (Allen- Zhu & Li, 2020). Finally, currently there is no widely accepted technique to promote disentanglement in kernel methods.
This paper introduces Constr-DRKM, a deep kernel method for the unsupervised learning of disentangled representations. Our approach is based on deep restricted kernel machines (deep RKMs) (Suykens, 2017), which provide a framework rooted in well-understood and reliable kernel methods. There has been little investigation on the use of deep RKMs for unsupervised learning. Moreover, while recent evidence suggests that shallow restricted kernel machines can achieve good disentangling performance (Pandey et al., 2020a), no previous study has investigated the potential for unsupervised learning of disentangled representations in deep RKMs.
We propose employing a deep restricted kernel machine consisting of multiple kernel PCA layers augmented by orthogonality constraints on the latent variables to perform unsupervised extraction of disentangled features. We explain that the orthogonality constraints have two effects, called intraorthogonality and interorthogonality effects, which encourage the deep RKM to learn a disentangled representation of the data. As a result of the introduction of the orthogonality constraints, we are able to avoid defining a stabilized version of the deep RKM objective, which is instead needed in the original formulation of deep RKMs (Suykens, 2017). Furthermore, we show how to train the proposed model end-to-end, so that the components of lower layers can use the representations learned by higher layers. To evaluate the proposed methodology, we conduct experiments in the task of disentangled feature learning. We quantitatively show that RKMs can benefit from depth and that a 2-layer Constr-DRKM architecture performs similarly to the state of the art on a number of disentanglement metrics when only a fraction of training points are available, while being more reliable in terms of sensitivity to hyperparameter selection than β-VAE.
Our main contributions are as follows.
• We propose a novel deep kernel method called Constr-DRKM by reformulating the deep restricted kernel machine framework for kernel PCA (Suykens, 2017) into a constrained optimization problem with orthogonality constraints on the latent variables so that disentanglement is encouraged and optimization can be carried out without first defining a stabilized objective.
• We propose an end-to-end training algorithm, allowing lower layers to exploit the representation learned by higher layers, by employing a quadratic penalty optimization algorithm with warm start, additionally introducing a deterministic initialization scheme. Furthermore, we present a denoising procedure for deep RKMs.
• We show how to apply our proposed method to the unsupervised learning of disentangled representations without any prior knowledge on the generative factors, demonstrating empirically that its learned representations perform similarly overall compared to β-VAE in terms of multiple disentanglement metrics on four benchmark datasets when few training points are available. We also evaluate and discuss the influence of hyperparameters on the performance of the proposed method, demonstrating that our approach is less sensitive to hyperparameter choice than β-VAE: Constr-DRKM's disentangling performance tends to remain steady as its hyperparameter γ varies, in contrast to the strong influence of the hyperparameter β on β-VAE's performance. In particular, we show that deterministic initialization of Constr-DRKM's training algorithm considerably improves the reproducibility of the results.
• Finally, we show the benefit of Constr-DRKM over kernel PCA in denoising complex 2D data distributions. In addition, we study the influence of each principal component by visualizing the concept learned by each component in every layer. In this way, we illustrate the role of the different layers: the first layer performs lower-level feature detection and focuses on, for example, edges and corners, while the second layer employs those lower-level features and captures more global features, which in turn allow for more accurate reproduction of the details of the original data distribution.

Related work
A conventional method to extract features in an unsupervised manner is PCA (Jolliffe, 1986). The underlying assumption of PCA is that the input high-dimensional observations lie in a lower-dimensional linear subspace, which is indeed similar to the fundamental assumption of representation learning introduced in the previous section. PCA works through a linear transformation that projects the input observations in a new coordinate system, where the direction of maximum variance is called the first principal component, the orthogonal direction with the second highest variance is called the second principal component, and so on. The dimensionality reduction is achieved by considering only the first s principal components. Kernel PCA is an extension of PCA that introduces nonlinearities by mapping the input observations to a higher dimensional feature space using a kernel function (Schölkopf et al., 1998). Linear PCA is then performed in that space. Our proposed method builds on multiple layers of kernel PCA and introduces orthogonality constraints on the latent variables to promote disentanglement. In contrast to our method, neither PCA nor kernel PCA exploits depth or deals with disentanglement.
In deep learning, pre-training each layer of a deep network in an unsupervised fashion before fine tuning the entire network is a common technique. In this context, several works have proposed unsupervised methodologies that learn data representations, including (Ranzato et al., 2007;Vincent et al., 2008Vincent et al., , 2010. For instance, (Ranzato et al., 2007) proposed an unsupervised encoder-decoder architecture that employed multiple convolutions, sparsity constraints and pooling to build a hierarchical representation of the data and boost invariance; in fact, pooling was shown to represent an information bottleneck that can promote invariance (Achille & Soatto, 2018). Another example is (Vincent et al., 2010), where invariant representations were obtained using stacked denoising autoencoders. In that work it was shown that using those representations to train a classifier led to improved accuracy not only for deep networks, but also for support vector machines with radial basis function kernel, supporting our hypothesis that kernel methods can benefit from deep representations. With the mentioned works, our method shares the idea of exploiting depth, but it applies that idea in the framework of kernel methods instead of neural networks. In particular, differently to (Vincent et al., 2010), our method stacks kernel PCA layers instead of autoencoders. Finally, in contrast with the works mentioned in this paragraph, our method takes disentanglement of the learned features into account.
One early approach to build disentangled representations was independent component analysis (ICA), which is a method to extract independent components from a signal (Comon, 1994). It is based on the assumption that the observed signal is a linear mixture of unknown latent signals that are non-Gaussian and mutually independent. However, it was shown empirically that ICA has a poor disentangling performance compared to the state of the art (Higgins et al., 2017).
A more recent approach to the unsupervised learning of disentangled representations is InfoGAN (Chen et al., 2016). Building upon the GAN framework (Goodfellow et al., 2014), InfoGAN uses a generator G that aims to produce a realistic observation from the concatenation of a noise vector n with a latent representation z. Disentanglement in z is encouraged by adding a regularizer to the typical GAN objective.
The regularizer aims to maximize a lower bound of the mutual information between z and the generated observation G(n, z). In contrast to InfoGAN, our method is based on kernel methods, which provide more stable training procedures compared to the GAN framework. Finally, InfoGAN's disentangling performance has been empirically shown to be significantly lower than VAE-based methods (Higgins et al., 2017), which represent the state of the art.
Most VAE-based methods are based on the β-VAE approach (Higgins et al., 2017). In this setting, one assumes that the prior over the latent variables P (z) is Gaussian. Then, the posterior q φ (z|x), which is approximated using a deep neural network parametrized by φ, is learned by maximizing where the likelihood p θ (x|z) is approximated by a deep neural network parametrized by θ, β is a hyperparameter controlling the degree of disentanglement and D KL is the Kullbach-Leibler (KL) divergence defined as D KL (r(x)|s(x)) = E r [log r/s]. In (Higgins et al., 2017) it is claimed that fixing β > 1 promotes disentanglement because it imposes a stricter information bottleneck on the latent factors z. (Chen et al., 2018) extends this explanation by identifying in Equation (1) a total correlation term that was already known to be related with the disentanglement of a representation (Ver Steeg & Galstyan, 2015;Achille & Soatto, 2018). Similarly to (Chen et al., 2018), FactorVAE (Kim & Mnih, 2018) augments the VAE objective (Eq. (1) with β = 1) with an approximation of the total correlation. However, (Locatello et al., 2019) performed a largescale experimental study analyzing six VAE-bases methods, including β-VAE, β-TCVAE and FactorVAE, and concluded that they cannot be used to reliably learn disentangled representations in the unsupervised setting, as their disentangling performance was shown to vary widely with hyperparameter selection, random seeds and among datasets.
Regarding methods based on restricted kernel machines (RKMs) (Suykens, 2017), which is the framework our work is built upon, (Pandey et al., 2020a) has recently proposed a generative model called Gen-RKM, which was experimentally shown to be able to produce disentangled representations. Gen-RKM makes use of a single level of kernel PCA expressed in the RKM framework to derive the latent representations, while the method proposed in this paper introduces a deep architecture made up of multiple kernel PCA objectives. In fact, while (Pandey et al., 2020a) considered a single feature map consisting of a deep convolutional neural network, in this paper we consider several feature maps over multiple levels. In other words, following the terminology used in (Suykens, 2017), in the former case deep learning is only performed over layers, while in the latter case depth is given by multiple levels, each associated with a different feature map possibly consisting of multiple layers. For the sake of simplicity, in this paper the term "layer" is used instead of "level" when there is no ambiguity. Moreover, Constr-DRKM introduces new disentanglement constraints on the latent variables, defining in this way a constrained optimization problem that eliminates the need for the stabilization of the loss function used in (Suykens, 2017) and (Pandey et al., 2020a) and that can handle explicit and implicit feature maps in the same manner. Before introducing our proposed architecture, it is useful to first briefly illustrate the deep RKM framework.

Background: deep restricted kernel machines
This section reviews the framework of deep restricted kernel machines, as Constr-DRKM builds upon it. First, the restricted kernel machine formulation of a single layer of kernel PCA is explained. Then, deep restricted kernel machines are introduced by means of an example. This section ends with the description of an open problem in deep restricted kernel machines: performing end-to-end training and promoting disentanglement at the same time.

Restricted kernel machine formulation of kernel PCA
The restricted kernel machine formulation of kernel PCA gives another expression of the Least-Squares Support Vector Machine (LS-SVM) kernel PCA problem (Suykens et al., 2003) as an energy with visible and hidden units that is similar to the energy of restricted Boltzmann machines (RBMs) (Bengio, 2009;Fischer & Igel, 2014;Hinton et al., 2006;Salakhutdinov, 2015). In this new formulation, contrary to RBMs, both the visible units v i and the hidden units h i can be continuous. To derive this formulation, consider a training set of N data points of dimension d, a feature map ϕ : R d → R d F and let s be the number of selected principal components. In the LS-SVM setting, the kernel PCA problem can be written as: where W ∈ R d F ×s is an unknown interconnection matrix, e i ∈ R s are the error variables and η and λ are hyperparameters. The restricted kernel machine formulation of kernel PCA (Suykens, 2017) is given by an upper bound of J kpca obtained by introducing the latent representations h i ∈ R s , i = 1, . . . , N using: This leads to the following objective: which is the formulation of kernel PCA in the restricted kernel machine framework. The v i are called visible units because their states are observed, while the hidden units h i correspond to feature detectors (Hinton, 2012). In the representation learning literature, h i is also known as the (latent) representation or code of v i ; we say that h i consists of s latent or code variables or of s hidden features or units. Note that the first term of J t is similar to the energy of an RBM.
As in other energy-based models, the RKM energy for kernel PCA associates a scalar value to each configuration of the variables. As a consequence, given training points x i , i = 1, . . . , N , training means clamping the visible units v i to the training points x i and finding an energy function for which the observed configurations of the visible units are given lower energies than unobserved configurations (LeCun et al., 2006). To do so, one characterizes the stationary points of J t , which results in an eigenvalue problem of the kernel matrix K ∈ R N ×N , with K ij = ϕ(x i ) T ϕ(x j ) (Suykens, 2017).

Deep restricted kernel machines
The theory of deep restricted kernel machines was initially proposed in (Suykens, 2017) with the aim of introducing a new perspective in the connection between deep learning and kernel machines. Such deep RKMs are obtained by coupling several RKMs in sequence, resulting in a deep architecture. An example of a deep RKM is now given.
A possible configuration of a deep RKM that extracts features of some data consists of two kernel PCA layers in sequence. Each kernel PCA layer follows the description given in Subsection 3.1. The architecture can be summarized in the following manner.
• Layer 1 consists of kernel PCA using as input the observation x i from the given data. The features extracted by this layer are characterized by its hidden features h i .
• Layer 2 consists of kernel PCA using as input the hidden features h  i . Note that this architecture has a similar structure to stacked autoencoders (Bengio, 2009): each layer performs unsupervised learning and the hidden features produced by each layer serve as input to the next layer. The deep RKM is trained by considering an objective function that joins the objectives of each kernel PCA layer. To formalize the training objective, let s 1 be the number of selected principal components by the first layer of kernel PCA and s 2 be the number of selected principal components by the second layer of kernel PCA. Then, let ϕ 1 : R d → R d F 1 be the feature map and of layer 1 and let ϕ 2 : R s1 → R d F 2 be the feature map of layer 2. Also, let λ 1 , λ 2 , η 1 , η 2 > 0 be hyperparameters. The training objective of the above defined deep RKM is: where J 1 and J 2 are the objective of a single layer of kernel PCA in the RKM framework as defined in Eq.
(2) using the suitable input. That means: are the interconnection matrices of layer 1 and layer 2, respectively, and h (1) i ∈ R s1 and h (2) i ∈ R s2 are the hidden features of layer 1 and layer 2, respectively.
Training the above defined deep RKM means finding the interconnection matrices and the hidden features minimizing the energy J t,deep . Since J t,deep is unbounded below, to make the energy suitable for minimization, (Suykens, 2017) proposed to instead minimize a stabilized version of J t,deep . Following (Pandey et al., 2020a), this version is defined as: where c stab > 0 is a hyperparameter. It can be shown that J t,deep and J t,deep stab share the same stationary points (Pandey et al., 2020a).

Effective algorithms for deep RKMs: an open problem
Deriving effective algorithms for training deep RKMs is an open problem. In (Suykens, 2017) a layer-wise training procedure was proposed for the case of linear kernels. However, previous research (Allen-Zhu & Li, 2020) has stressed the importance of end-to-end training in deep architectures to produce representations able to efficiently approximate the target function. The ability of deep learning methods to produce such efficient representations is often linked to hierarchical learning, because deep networks can hierarchically decompose a difficult target function into a composition of simpler functions (Allen-Zhu & Li, 2020). This ability has been recently explained by a mechanism called "backward feature correction" (Allen-Zhu & Li, 2020), which means that layers of lower abstraction can use the representations learned by layers of higher abstraction to improve the quality of their representation. "Backward feature correction" and thus hierarchical learning cannot be achieved using layer-wise training alone (Allen-Zhu & Li, 2020).
On the other hand, it is interesting to note that training a deep RKM as the one described in Subsection 3.2 in a layer-wise manner yields mutually orthogonal hidden features, as they are obtained solving an eigenvalue problem of the symmetric kernel matrix K. This is an advantage when it comes to the disentanglement of the produced representations, as experimentally shown in (Pandey et al., 2020a) in the single layer case. However, as it was explained in the previous paragraph, layer-wise training does not take full advantage of the deep architecture, so one would like to perform end-to-end training instead. In the end-to-end training case, one cannot simply solve two eigenvalue problems in sequence, so the mutual orthogonality of the hidden features is lost. Likely, this loss would affect negatively the disentanglement of the learned representations. In this context, the following section proposes Constr-DRKM, which is a method that allows to perform end-to-end training and to promote disentanglement at the same time.

The Constr-DRKM method
In the previous section, two key issues in the development of effective training algorithms for deep RKMs were identified: performing end-to-end instead of layer-wise training and promoting disentanglement in the hidden features at the same time. Accordingly, the aim of this section is to propose a method, based on deep RKMs, for the unsupervised learning of latent representations of some given data so that • it promotes disentanglement in the learned hidden features, and • it carries out end-to-end training.
To address both aspects, we propose augmenting the original deep RKM formulation of two layers of kernel PCA, as described in Subsection 3.2, by orthogonality constraints on the latent variables of both layers.
Let s 1 be the number of selected principal components of the first layer of kernel PCA and s 2 be the number of selected principal components of the second layer of kernel PCA. Then, let ϕ 1 : R d → R d F 1 be the feature map and of layer 1 and let ϕ 2 : R s1 → R d F 2 be the feature map of layer 2. The proposed optimization problem is: minimize where W 1 ∈ R d F 1 ×s1 and W 2 ∈ R d F 2 ×s2 are, respectively, the interconnection matrices of layer 1 and layer 2, h (1) i ∈ R s1 and h (2) i ∈ R s2 are, respectively, the latent variables of layer 1 and layer 2, N ] ∈ R s2×N , I s1+s2 denotes the (s 1 + s 2 ) × (s 1 + s 2 ) identity matrix and λ 1 , λ 2 , η 1 , η 2 > 0 are regularization constants.
The orthogonality constraints have two effects. The first effect, called the intraorthogonality effect, enforces the mutual orthogonality of the hidden features learned by the first layer, as well as the mutual orthogonality of the hidden features learned by the second layer. The second effect, called an interorthogonality effect, enforces the orthogonality between the hidden features learned by the first layer and the hidden features learned by the second layer, so that the two layers are encouraged to learn new features of the data instead of repeating the same features in both layers. Both effects aim to push the deep RKM to learn a more disentangled representation of the data.
By employing end-to-end instead of layer-wise training in deep restricted kernel machines, lower layers can improve their representation by exploiting the representation learned by higher layers. This architecture may also be beneficial to disentangled feature learning, as it has previously been observed that a two-level hierarchical structure can promote disentanglement (Esmaeili et al., 2019). Note that the formulation shown in (3) can be easily extended to more than two kernel PCA layers. One reason for making use of more layers is that it might improve the invariance of the learned representation, as invariance is promoted by stacking layers (Achille & Soatto, 2018). Another manner of boosting invariance would be to increase the information bottleneck between each layer by selecting fewer principal components (Achille & Soatto, 2018).
After having trained the model on all training data points x i and having learned the hidden features h (1) i and h (2) i of each x i , one can encode an out-of-sample data point x in the manner proposed in (Pandey et al., 2020b) extended to the 2-layer case. The latent representation of x is computed by projecting it on the latent space using: where k 0 (x, y) = ϕ 1 (x) T ϕ 1 (y) and k 1 (x, y) = ϕ 2 (x) T ϕ 2 (y) are the kernel functions for the first and second layer, respectively, obtained by means of the kernel trick. Instead of first defining a feature map ϕ i and then deriving the kernel, one can simply choose a positive definite kernel k i−1 due to Mercer's theorem (Mercer & Forsyth, 1909), which guarantees the existence of a feature map φ such that

A training algorithm for Constr-DRKM
In the energy-based interpretation of deep RKMs, the training phase of Constr-DRKM consists of finding the interconnection matrices W 1 and W 2 and the hidden units h (1) i and h (2) i so that the observed configurations x i are given lower energies than unobserved configurations. Training of Constr-DRKM can be therefore performed by solving the equality constrained nonlinear optimization problem (3). However, the number of variables of the optimization problem (3) is large and depends on d F1 and d F2 , which are the dimensionalities of the feature spaces used by ϕ 1 and ϕ 2 . If, for instance, the Gaussian kernel is used, the dimensionality of the feature space would be infinite and therefore it would not be possible to directly solve the optimization problem (3). To address this issue, we show how to rewrite J t,Constr-DRKM to eliminate the interconnection matrices W 1 and W 2 . This can be done considering each layer separately. The following shows how to eliminate the interconnection matrix for the first layer; eliminating W 2 follows the same procedure.
First, recall that, when training RKMs, each visible unit v i is fixed to the training point x i . Therefore, in training, the stationary point of J t,Constr-DRKM with respect to W 1 is given by: Then, in order to eliminate the interconnection matrix W 1 , two terms in J t,Constr-DRKM (1) have to be rewritten: and η1 2 Tr W T 1 W 1 . This is accomplished using (6). The latter term can be rewritten as follows: The former term can be rewritten as follows: . Note that here W 1 was replaced using (6) by its expression in terms of the training points x i , but the term ϕ(v i ) was not expressed in terms of the training points. Keeping ϕ(v i ) expressed in v i is useful when, after training, the visible units are taken as unknowns. For instance, in RBMs this is the case in data generation.
Combining the above rewrites, J t,Constr-DRKM (1) becomes: The optimization problem of Constr-DRKM in (3) can finally be rewritten by combining the expressions for the first and second layer after weight elimination. Given that in training K Compared to (3), the above optimization problem does not have the interconnection matrices as variables. This means that not only training is more efficient but also that Constr-DRKM can deal with explicit and implicit feature maps in the same manner, even if their feature space has infinite dimensionality. In addition, as a consequence of the introduced constraints and of the elimination of the interconnection matrices, it is not necessary to define a stabilized version of the objective to make it suitable for minimization, as it was instead needed in the original formulation of deep RKMs as explained in Subsection 3.2.
End-to-end training of the proposed deep RKM can be performed by solving the equality constrained nonlinear optimization problem (7). The constraint set of (7) is a Stiefel manifold St(s 1 + s 2 , N ), so one of the algorithms that have been proposed for optimization on the Stiefel manifold could be employed. For example, one could use the Newton's method on the Stiefel manifold developed in (Smith, 1993(Smith, , 1994Edelman et al., 1998). This algorithm generates points along the geodesic, which is expensive to compute because it uses matrix exponentials. Alternatively, one could avoid computing the geodesic by instead exploiting the Cayley transform to determine the search curve, such as in the algorithms proposed in (Wen & Yin, 2013;Zhu, 2017). However, these methods could also be computationally heavy because the complexity of determining the search curve at each iteration is dominated by a matrix inversion.
To avoid expensive matrix computations, we propose applying a quadratic penalty optimization algorithm with warm start based on the description given in (Nocedal & Wright, 2006). First, define a function that combines the objective of (7) with an additional term penalizing solutions violating the orthogonality constraints. One such function is called the quadratic penalty function Q(h i ; µ) and is defined as where µ is a penalty parameter penalizing violations of the orthogonality constraints. The constrained optimization problem is then replaced by a sequence of unconstrained ones with increasing µ. In the k-th unconstrained problem, the h i ; µ k ) are sought, with starting point set to the minimizers found in the (k − 1)-th problem. Initialization can be done with random values drawn from the standard normal distribution or with deterministic layer-wise kernel PCA initialization, such that the initial hidden features of each layer are computed locally using kernel PCA. The unconstrained problems can be then solved by some unconstrained minimization algorithm; in our experiments, we used Adam (Kingma & Ba, 2015). If these subproblems are solved inexactly, in general the quadratic penalty optimization algorithm will not converge to the global solution of (7). In practice, we halt the outer loop of Algorithm 1 after a fixed number of iterations that we choose so that it is larger when the number of variables of the optimization problem is higher. This choice translates to running more outer iterations when the number of training point N is higher, as the number of variables depends on N .
Algorithm 1 Algorithm for training a 2-layer Constr-DRKM using a quadratic penalty optimization algorithm with warm start. (h (1) i ) s k denotes the starting point h (1) i at iteration k. Note that Q also depends on the hyperparameters λ 1 , λ 2 , η 1 , η 2 and on the kernel functions k 0 and k 1 , which need to be chosen before optimization.

14: end function
Regarding kernels' hyperparameters, they might also be optimized together with the latent variables by adding it as a variable of the optimization problem (7). Alternatively, kernels' hyperparameter selection can be carried out by fixing a validation set and selecting the hyperparameter values that perform best on it. When it comes to the scalability of solving (7), the size of both K (0) and K (1) grows quadratically in N . On the other hand, their size does not depend on the dimensionality d of the input space. Regarding the unknown matrices H (1) and H (2) , their size does not depend on d, but it depends on N , as well as on the number of selected principal components s 1 and s 2 , respectively. In addition, K (1) is computed from H (1) , so optimization might suffer from the increased non-linearity.

Denoising and Constr-DRKM
This section presents a reconstruction procedure that can denoise a test point x . As in PCA, denoising is carried out by keeping only the first s principal hidden features because one can assume that noise is concentrated in the components of lower variance. Performing reconstruction is straightforward in PCA because it is just a basis transformation, but the deep architecture and non-linear feature maps of Constr-DRKM pose a greater challenge. In fact, it is possible that a point in the feature space used by ϕ 1 does not have a pre-image in the input space. In addition, multiple non-linear mappings have to be taken into account during reconstruction.
We adapt the approach proposed in Schölkopf et al., 1999) in the context of kernel PCA to a 2-layer Constr-DRKM. Extending the following procedure to more than two layers is straightforward. Call F 1 the feature space used by ϕ 1 and assume that the mapped data points are centered in F 1 . First, h (2) , the latent representation of x characterized by the second layer, is computed following Eq. (5). Similarly, h (1) is then computed following Eq. (4). Employing this representation, only s 1 components are kept, discarding the components that are noisier. The reconstruction of x from its projections z k , k = 1, . . . , s 1 onto the first s 1 principal components in F 1 is where v k ∈ F 1 is the k-th principal component. As explained in (Schölkopf et al., 1998), v k lies in the span of ϕ 1 (x 1 ) . . . ϕ 1 (x N ), so (8) can be written as The denoised point in the input space is computed by finding a pointx such that ϕ 1 (x ) approximates P s1 ϕ 1 (x ). To this aim, we minimize Using (9), the above can be rewritten as Given that, by Eq. (17) of (Schölkopf et al., 1998), that, in the LS-SVM formulation of kernel PCA, α i = 1/λ 1 e i (Suykens et al., 2003(Suykens et al., , 2002 and that e i = λ 1 h (1) i in restricted kernel machines (Suykens, 2017), one can rewrite (10) in terms of the hidden units as where In general, standard gradient descent methods can be employed to minimize (11); in this case, note that the last term of (11) is independent ofx . In the case of the RBF kernel,  proposed the following iteration scheme forx : .
In denoising, the starting point is set to the noisy observation x .

Experimental evaluation
The goal of the experimental evaluation is to test the feasibility of the proposed method for denoising and to assess the advantage of the Constr-DRKM architecture in the task of unsupervised disentangled feature learning.

Denoising
We applied the denoising procedure explained in Section 4.2 to complex 2D synthetic datasets. Each dataset is generated by selecting 3000 points as a training set and 750 additional points as a validation set. In this set of experiments, the noise n is white Gaussian with zero mean and standard deviation σ n varying among different values. The number of selected components is s 1 = 2 for the first layer and s 2 = 1 for the second layer; the hidden units are initialized in a layer-wise manner with kernel PCA. All η and λ are set to 1. The kernel function employed in both layers is the RBF kernel and its bandwidth is selected employing the validation set. Constr-DRKM is first trained on the noisy points to find their latent representations and then each x i is denoised by computing its pre-image minimizing the expression in Eq. (11).
First, a half circle and a square, depicted in Figure 1, are considered. It can be seen that Constr-DRKM successfully captures the structure of the data distributions for both shapes. Note that denoising was effective even though the chosen overall number of principal components was higher than the dimensionality of the datasets. This would not have been the case with linear PCA, which performs perfect reconstruction, hence does not denoise, when using as many components as the dimensionality of the input data.  Secondly, two additional more complicated datasets were analyzed to study the role of each layer. The result of the influence of each component in every layer can be shown by denoising using only that component. For a dataset with a square and a spiral next to it, three plots, shown in Figure 2, were produced. The first plot is the result of denoising using only the first component of the first layer: it learns the shapes, but has a few outliers and is still noisy around the center of the spiral. The second plot shows the denoised dataset using only the second component of the first layer: it does not show outliers and better reproduces the higher frequency details around the center of the spiral, but loses part of the square. Finally, the third plot is the result of denoising using only the first component of the second layer: it keeps the higher frequency details and reconstructs the square completely. Overall, the principal component of the first layer captures the broad trend but has outliers, while the second component learns the details of the shapes but loses the general trend in some regions. The component of the second layer, on the other hand, both picks up the general trend and reproduces the details of the shapes.
In addition to the experiment above, a similar analysis for a more complicated data distribution, consisting of two squares, a spiral and a ring, is shown in Figure 3. Using only the first component of the first layer results in some artifacts: two distinct loops of the spiral intersect and two sides of the two squares are joined. On the other hand, the second component reconstructs those shapes correctly, but does not well reproduce the inner circle of the ring. This circle is better reproduced by the first component of the second layer. The findings of the previous two experiments suggest that the lower layer in the deep architecture of Constr-DRKM functions as lower-level feature detector focusing on the broad trends of the data distribution, while the higher layer exploits the representation learned by the lower layer and represents higher-level features which lead to better denoising and more accurate reproduction of the original data distribution. These results are  consistent with previous findings in deep Boltzmann machines (Le Roux & Bengio, 2008) and in convolutional neural networks (Zeiler & Fergus, 2014) that lower layers focus on local features, such as edges and corners, and higher layers capture progressively more global and complex patterns with increasing invariance.  Finally, denoising performance was compared to kernel PCA with the same number of overall components. Table 1 reports the ratio of the reconstruction error, computed for all denoised points, between Constr-DRKM and kernel PCA, where ratios larger than 1 mean that the deep architecture resulted in better denoising than the shallow one. In the considered set of experiments, Constr-DRKM outperforms the shallow kernel PCA in denoising. More complicated data distributions, such as the ones in Figure 2 and 3, show greater performance gain compared to the shallow architecture. This was expected because kernel PCA is known to be able to effectively denoise simple data distributions , but its denoising performance degrades as the datasets become more complex. The superiority of the deep architecture is strongly marked for small σ n , which, together with the observations on the influence of the second layer made in the previous sets of experiments, seems to indicate that the additional layer has a crucial role in the reconstruction of the finer details of the data distribution, as this effect might become less noticeable with higher noise levels.
On the whole, our proposed architecture's representational efficiency benefited from depth, since it attained improved denoising of the data distributions in the same number of principal components as the shallow architecture.   Table 1: Reconstruction error ratios between Constr-DRKM and kernel PCA in denoising complex 2D synthetic data distributions for different noise levels. Ratios larger than 1 mean that the deep architecture resulted in better denoising than the shallow one and the larger than 1 the better. In the deep architecture, the number of selected principal components is s 1 = 2 for the first layer and s 2 = 1 for the second layer, while the number of principal components used by kernel PCA is 3.

Disentanglement
This section aims to quantitatively assess that Constr-DRKM is able to learn a disentangled representation of the factors of variation of the data. In addition, the role of the hyperparameters, of the number of selected principal components and of the number of layers is studied. We applied our method on the Cars3D dataset (Reed et al., 2015), on the dSprites dataset (Higgins et al., 2017) and on the SmallNORB dataset (LeCun et al., 2004), as well as on a noisy version of dSprites introduced in (Locatello et al., 2019) obtained by replacing the background pixels with Gaussian noise with zero mean and unit variance. In all datasets, each data point is generated according to a deterministic function of its ground-truth latent representation. All data points are 64 × 64 images: Cars3D and noisy dSprites contain RGB images, while dSprites and SmallNORB contain grayscale images. For details of the datasets, see Appendix A.1.
In the experiments, the dimension of the learned latent representation is fixed to 10: this choice was also made in the large experimental evaluation of (Locatello et al., 2019) and, furthermore, this number is greater than but close to the ground-truth number of factor of variations. Computing the hidden features of some data point in Constr-DRKM translates to selecting 10 principal components. We chose to do so in the deep architecture with n layers ≥ 1 of Constr-DRKM by either fixing s i to 10 for some i such that 1 ≤ i ≤ n layers or by having n layers i=1 s i = 10 and concatenating all h (i) . In all experiments, the learned models are evaluated on a subset of N eval = 4000 data points chosen randomly from the relevant dataset.
Given that there is no single widely accepted measure to quantify disentanglement, we use three metrics that have been proposed in the literature, namely the IRS score (Suter et al., 2019), the mutual information gap (MIG) Chen et al. (2018) and the SAP score (Kumar et al., 2018). The IRS metric measures robust disentanglement, which means that, if a latent variable is associated with some generative factor G, the inferred value of that latent variable shows little change when G remains the same, regardless of changes in the other generative factors. The MIG metric is computed as the average over all generative factors of the difference between the two latent variables with highest mutual information with each generative factor. The SAP score is formulated by first building a score matrix S such that S ij is the classification score of predicting the j-th generative factor using only the i-th latent variable; the final score is the mean of the differences between the top two entries for each column, which corresponds to averaging over the generative factors. For all considered metrics, a higher score indicates better disentangling performance.
In the first set of experiments, the role of the number of selected principal components is investigated. The studied architecture has n layers = 3. All η and λ are set to 1. The chosen kernel function is the RBF kernel and its bandwidth is added as a variable to the optimization problem; this is the case for all experiments in this section. For each dataset, the training set is a random sample of N = 100 data points. Eight different choices for the number of selected components s 1 , s 2 and s 3 are considered. Some representative results are presented in Figure 4. For additional plots, see Appendix A.3. Figure 4 shows the IRS score attained by Constr-DRKM models with varying number of selected principal components in its three layers. The variance is due to five different random seeds. It is clear that the number of selected components has an important role in the disentangling performance of Constr-DRKM. Its influence is particularly evident in the SmallNORB dataset, where a bad choice of s 1 , s 2 and s 3 led to considerably worse scores than the other choices. None of the evaluated choices of s 1 , s 2 and s 3 consistently resulted in poor performance on all datasets: for instance, (20, 20, 10) was not the best performer on the  (10,20,20) (10,3,2) (2,2,6) (2,3,5) (20,20,10) (3,4,3) (5,5,10)  SmallNORB dataset, but it was the choice with the best median score on the dSprites dataset. In a similar manner, no choice of the number of selected components was found to always give higher disentanglement score than any other choice for the datasets considered in the experiments. However, some choices led to better scores more consistently than others, while also showing smaller variance. For instance, models with s 1 = 2, s 2 = 2 and s 3 = 6 never resulted in significantly poorer performance than the other models and always had modest variance with respect to random seeds in the experiments. In general, the variance due to randomness varied depending on the dataset and on the disentanglement metric considered. For instance, the variance for the IRS metric on SmallNORB was small, but it was large for the SAP score metric on the same dataset. Interestingly, one can see that, in general, randomness affects certain combinations of dataset and disentanglement metric more than others. For instance, in Figure 4 most models on SmallNORB have small variance, whereas most models have higher variance on the other datasets.
In the second set of experiments, the role of the number n layers of layers on Constr-DRKM's disentangling performance is studied as the number N of training points grows from 50 to 800. In the experiments, n layers is taken from {1, 2, 3}. The studied architectures are as follows: the 1-layer architecture has s 1 = 10, the 2-layer one has s 1 = 10 and s 2 = 5 and for the 3-layer architecture, s 1 = 2, s 2 = 2 and s 3 = 6. For the 2-layer and 3-layer architectures, we chose those configurations because they were among the best ones that were empirically evaluated. On top of varying N , the hyperparameters η and λ are varied as well. Given that they have the role of weights in the objective function, we can consider a single hyperparameter γ = η λ . In the experiments, γ is taken from {0.01, 0.1, 1, 5, 25}. All experiments are repeated over five random seeds. The results for the Cars3D dataset are shown in Figure 5. Figure 5 plots the disentanglement score attained by Constr-DRKM against the number N of training data points according to n layers . The results indicate that the extent of the influence of n layers on the disentangling performance of Constr-DRKM greatly depends on the disentanglement metric. In particular, on Cars3D all considered n layers had similar SAP score, while varying the number of layers generally greatly affected the MIG and IRS scores. For example, models with 3 layers resulted in approximately double the MIG score on Cars3D compared to the 2-layer and 1-layer models when N = 400. From the experiments it can be noted that the most consistent choice in terms of disentangling performance is n layers = 2, as it is in most cases the best or close to the best choice for any considered combination of metric, dataset and N . This observation accords with our hypothesis that introducing an additional layer to RKMs can be beneficial in terms of the disentanglement of the learned representation. For the datasets considered in these experiments, adding a  Figure 5: Line chart of the mean disentanglement score of different Constr-DRKM architectures according to the number N of training data points and the number n layers of layers for the Cars3D dataset. For the 1-layer architecture, s 1 = 10, for the 2-layer one, s 1 = 10 and s 2 = 5 and for the 3-layer one, s 1 = 2, s 2 = 2 and s 3 = 6. The size of each error band is set to the value of standard error, extending from the mean. The variance is due to five different random seeds and different γ. The higher the curve, the better. third layer did not consistently increase the disentanglement scores, but this may not be the case on more difficult datasets with, for instance, multiple more realistic objects and a complex background.

Number of layers
In the third set of experiments, the performance of a 2-layer Constr-DRKM is studied as the number N of training points grows from 50 to 800 and is compared against β-VAE (Higgins et al., 2017). The studied architecture has s 1 = 10 and s 2 = 5. The hyperparameter γ is varied in the same range as in the previous set of experiments. We repeat the same experiments using a β-VAE model in place of Constr-DRKM where, instead of the hyperparameter γ, we vary β in the set {2, 3, 4, 5, 6}. Some key results are presented in Figure 6. Figure 6 plots the disentanglement score attained by both Constr-DRKM and β-VAE against the number of training data points, for a selection of datasets and metrics. The variance is due to different hyperparameters (γ for Constr-DRKM and β for β-VAE) and five random seeds. Overall, Constr-DRKM showed good disentangling performance compared to β-VAE across datasets and metrics. Remarkably, on Cars3D Constr-DRKM significantly outperformed β-VAE in the IRS score. Turning now to the SmallNORB dataset, Constr-DRKM and β-VAE produced similar MIG scores. On the other hand, the latter method resulted in considerably better median IRS scores than the former method. If we now turn to the analysis of variance, in accordance with (Locatello et al., 2019), β-VAE's performance varied greatly with random seed and hyperparameter. For example, on SmallNORB in Figure 6b the attained score varies from about 0.5 up to almost 0.9 with considerable interquartile range for all N but the smallest. Comparing β-VAE's variance to Constr-DRKM's, it can be seen that our method shows a significantly more limited variance for all N . Even on other datasets, Constr-DRKM shows small variance on the IRS score that is similar or lower than the one shown by β-VAE. This is not always the case for other metrics, as exemplified in Figure 6c. Overall, the observations made in this set of experiments suggest that Constr-DRKM is able to consistently learn a disentangled representation of the input data, competitively with the state of the art, while being less affected by randomness and hyperparameter selection than β-VAE for the IRS metric.
We now focus on the influence of the hyperparameter γ for Constr-DRKM and β for β-VAE. Figure 7 plots the disentanglement score attained by both Constr-DRKM and β-VAE as the hyperparameter γ for Constr-DRKM and β for β-VAE varies, for a selection of datasets and metrics. The variance is due to different N , which is in the same range as in Figure 6, and five random seeds. From Figure 7c it can be observed that γ does not have appreciable influence on the IRS score on SmallNORB, whereas β plays an important role in β-VAE's median performance, which dropped when setting β to 5 and 6, producing greater variance as well. A similar but less sudden trend can be noted in Figure 7a on Cars3D: increasing β leads to increased score, but the median score remains approximately steady when increasing γ. In general, therefore, it seems that Constr-DRKM is less sensitive to γ than β-VAE is to β when it comes to the considered datasets and disentanglement metrics.
The influence of γ is also studied separately for each N ∈ {50, 100, 200, 400, 800}. Figure  points. The variance is due to five random seeds. It can be noted from the plot that most lines are roughly horizontal, meaning that varying γ does not significantly affect the IRS score and that this behavior is shared by all considered number of training points. These results corroborate the findings of the previous set of experiments: Constr-DRKM's disentangling performance tends to remain steady as its γ hyperparameter varies, contrary to the behavior of β-VAE with respect to its hyperparameter β, which greatly influences its performance.
Finally, we compare random initialization to layer-wise kernel PCA initialization. Table 2 shows the mean disentanglement scores attained on Cars3D over 5 random seeds. In this experiment, initializing the hidden units using kernel PCA in a layer-wise manner outperforms β-VAE and it provides the additional benefit of increased reliability as standard deviation is zero because no random seed is employed. In particular, layer-
wise kernel PCA initialization on average achieves slightly lower IRS score compared to random initialization and it significantly outperforms both random initialization and β-VAE on the MIG metric. Overall, when it comes to the variance of the results, Constr-DRKM with layer-wise kernel PCA initialization compares favorably to β-VAE, successfully addressing the core issue of reliability of VAE-based methods brought up by (Locatello et al., 2019).

Conclusion
In this work we have proposed to reformulate the deep restricted kernel machine framework for kernel PCA (Suykens, 2017) into a constrained optimization problem with orthogonality constraints on the latent variables. At the same time, we have described a training algorithm that learns the hidden features in an end-to-end manner instead of layer-wise by employing a quadratic penalty optimization algorithm with warm start. We have then showed how the proposed method can be applied to denoising and to the problem of learning disentangled features in an unsupervised manner without any prior knowledge on the generative factors. In the former task, we studied the role of each principal component in every layer showing that components in the first layer perform lower-level feature detection, while components in the second layer employ the representation learned by lower layers and extract more global features, more accurately reproducing the original data distribution. In our experiments in the task of disentangled factor learning, the proposed Constr-DRKM method quantitatively performed similarly overall compared to β-VAE (Higgins et al., 2017) on four benchmark datasets with respect to a number of different disentanglement metrics when few training points are available. In addition, regarding the issue raised in (Locatello et al., 2019) that performance of state-of-the-art approaches to disentangled factor learning based on VAEs greatly varies when changing random seed or hyperparameter, Constr-DRKM was less sensitive to randomness and hyperparameter choice compared to β-VAE. In particular, the variance due to Constr-DRKM's hyperparameter γ was smaller than the variance due to the hyperparameter β in β-VAE and it was shown that Constr-DRKM with deterministic layer-wise kernel PCA initialization attained favorable scores without the need of a random seed, considerably improving the reproducibility of the results. Finally, the experimental analysis of the number of layers of Constr-DRKM indicates that adding a layer can increase the disentangling performance, as it was observed that a 2-layer architecture is a better choice than a single layer one. Nevertheless, 3-layer models did not consistently perform better than 2-layer models. This result does not rule out the influence of other factors, as, for example, more challenging datasets may benefit from additional layers. In future work, applying Constr-DRKM to more complicated datasets may be useful to better understand the role of the number of layers. Furthermore, it would also be interesting to investigate more advanced constrained optimization algorithms that could be useful to boost training efficiency. For the 1-layer architecture, s 1 = 10, for the 2-layer one, s 1 = 10 and s 2 = 5 and for the 3-layer one, s 1 = 2, s 2 = 2 and s 3 = 6. The size of each error band is set to the value of standard error, extending from the mean. The variance is due to five different random seeds and different γ. .9: Line chart of the mean disentanglement score of a 2-layer Constr-DRKM architecture, with s 1 = 10 and s 2 = 5, according to the hyperparameter γ for each dataset and for each number N of training points. The size of each error band is set to the value of standard error, extending from the mean. The variance is due to five different random seeds.