Generative Restricted Kernel Machines

We propose a novel method for estimating generative models based on the Restricted Kernel Machine (RKM) framework. This mechanism uses the shared representation of data from various sources where training only involves solving an eigenvalue problem. By defining an explicit feature map, we show that neural networks could be incorporated in the current framework. Experiments on various datasets demonstrate the potential of the model through qualitative evaluation of generated samples.


Introduction
Generative modeling is a rapidly advancing area of machine learning research finding applications in multiple fields such as, generated art, on-demand video, image denoising [1], exploration in reinforcement learning [2], collaborative filtering [3], inpainting [4] and many more.
In general, three approaches have been used in generative modeling tasks.First, graphical models based on a probabilistic framework with latent variables such as variational auto-encoders [5] and Restricted Boltzmann Machines (RBMs) [6,7].Then, more recently proposed models based on adversarial training such as Generative Adversarial Networks (GANs) [8] and its many variants.Furthermore, autoregressive models such as Pixel Recurrent Neural Networks (PixelRNNs) [9] that models the conditional distribution of every individual pixel given previous pixels and generation involves sequentially predicting the pixels in an image along the two spatial dimensions.All these approaches have their own advantages and disadvantages.For example, RBM allows us to perform both learning and Bayesian inference in graphical models with latent variables.However, such probabilistic models must be properly normalized, which requires evaluating intractable integrals over the space of all possible variable configurations, hence posing difficulties in maximum likelihood computations [7].Currently GANs are state-of-the-art for generative modeling and produce sharpest images but they are more difficult to train due to unstable training dynamics, unless more sophisticated variants are applied.Few existing multi-view generative models exist which involve a combination of the above existing generative mechanisms [10,11].
In this work, we propose an alternative generative mechanism based on the framework of RKMs [12] that yields a representation of kernel methods with visible and hidden units establishing links between Kernel Principal Component Analysis (Kernel PCA), Least-Squares Support Vector Machines (LS-SVM) [13] and RBMs.This framework renders the energy form of RBMs into a non-probabilistic setting thereby having no requirement for proper normalization.This allows their immediate applicability to regression and classification problems.Recently, [14] used this framework to develop tensor-based multiview classification models and [15] showed that Kernel PCA could be used to generate new data.
The contributions of this paper further extend the RKM framework to multi-view generative models where multiple data could be generated simultaneously using a single instance a of sampled latent variable.Learning data representations is based on common-subspace learning and training only involves solving an eigenvalue problem.The generative mechanism involves computing the pre-image of the feature vectors.Two methods are proposed: with the feature map explicitly known or unknown.We show that the mechanism is flexible to incorporate both the kernel-based and (deep) neural network based models in the same setting.Lastly, RKMs can be viewed as a form of non-probabilistic graphical model, and they provide high flexibility in the design of architectures and training criteria as would be shown in following sections.
This paper is organized as follows.In section 2 we discuss the training mechanism of Generative RKMs when multiple data sources are available.In section 3 we explain the generative mechanism when implicit or explicit feature maps are used during training.In section 4, we show experimental results of our mechanism on MNIST and small-norb datasets.Section 5 concludes the paper along with extensions to the future work.

Generative Restricted Kernel Machines
We assume a dataset D = {x i , y i } N i=1 , with x i ∈ R d , y i ∈ R p comprising N data points.Here y i may represent an additional view of x i , for e.g., an additional image from a different angle of the same subject, or a label information, such as in case of MNIST digits.Starting from the RKM interpretation of Kernel PCA, which gives an upper bound on the equality constrained, least-squares Kernel PCA objective function [12], and applying the feature-map φ 1 : R d → R d f and φ 2 : R p → R p f to the input data points, the training objective function J t for generative RKM is given by: Tr(U U ) (1) where V ∈ R d f ×s and U ∈ R p f ×s are the unknown interaction matrices, and h i ∈ R s are the latent variables modeling a common subspace H between the two input spaces X and Y (see Fig. 1).Similar to Energy-Based Models (EBMs), RKM objective functions capture dependencies between variables by associating a scalar energy to each configuration of the variables.Learning consists of finding an energy function in which observed configurations of the variables are given lower energies than unobserved ones.Note that the schematic representation is similar to discriminative RBMs [16] and the objective function J t has the energy form similar to RBMs with additional regularization terms.Given η 1 > 0 and η 2 > 0 as regularization parameters, the stationary points of J t are given by: where we have used vector and matrix derivatives [17].Substituting V and U in the first equation above, denoting Λ = diag(λ i ) ∈ R s×s , yields the following eigenvalue problem: where Based on Mercer's theorem [18], positive-definite kernel functions , and k 2 (y i , y j ) = φ 2 (y i ), φ 2 (y j ) , ∀i, j = 1, . . ., N forms the elements of corresponding kernel matrices.The feature maps φ 1 and φ 2 , mapping the input data to high-dimensional feature space (possibly infinite) are implicitly defined by kernel functions such as Gaussian, polynomial and convolution kernels just to name a few [19].However, one can also define explicit feature maps, which preserves the positive-definiteness of kernel function due to construction [13].
Remark on centering: For notational convenience, it is assumed that all the feature vectors defined above and in the following sections are centered, i.e.
The case where the kernel matrix is not centered can be centered by replacing such matrices as follows: where 1 denotes an N -dimensional vector of ones and K is either Remark on multiple data sources: While in the above section we have assumed that only two data sources (namely X and Y) are available for learning, the above procedure could be easily extended to multiple data-sources.Following the same arguments, this yields the following training problem: where M is the number of views or data-sources.

Generating Data
In this section, we derive the equations for the generative mechanism.RKM being an Energy-Based Model, and the inference consists in clamping the value of observed variables and finding configurations of the remaining variables that minimizes the energy.
For the learned interconnection matrices U and V , and latent variables h , consider the following objective function: with an additional regularization term on data sources.Here J g denotes the objective function for generation.The stationary points of J g are characterized by: Using V and U from Eq. ( 2), we obtain the generated feature vectors: x y Figure 2: After learning, latent variables h could be sampled from the fitted probability distribution p(h) for generating x and y simultaneously.
if feature map = implicit then keep eigenvectors corresponding to s largest eigenvalues compute k x * , k y * using Eq.(8)

26:
end if 27: end procedure One now needs to find the inverse image of feature maps φ 1 and φ 2 in the respective input spaces i.e. solve the pre-image problem.When using kernel methods, feature maps are not known explicitly.Typically used kernels such as radial-basis function and polynomial kernels map the input data to very high dimensional feature space.Hence finding the pre-image, in general, is known to be an ill-conditioned [20] problem.However, various approximation techniques have been proposed [21,22,23,24] which could be used to obtain the approximate pre-image x of φ 1 (x * ).This problem could be avoided if one uses an explicit feature map, which is also invertible, during the training procedure and using the explicit expression for deducing x from φ 1 (x * ).In the remaining section, we illustrate two pre-image methods for both cases when the feature map is explicitly known or unknown, and demonstrate corresponding experimental results in the following section.
Implicit Feature Map Since x * may not exist, we find an approximation x.As shown in [15], using the trick of left multiplying Eqs. ( 7) by φ 1 (x * i ) and φ 2 (y * i ) , ∀i = 1, . . ., N , we obtain: where, k x * = [k(x 1 , x ), . . ., k(x N , x )] represents the similarities between x and data points in the feature space, and K x represents the N × N centered kernel matrix of X .Similar conventions follow for Y respectively.Using the kernel-smoother method, the pre-images are given by: where n r is the number of nearest neighbors.
Explicit Feature Map By using an appropriate feature map, Mercer's theorem still holds due to the positive-definiteness of the kernel function by construction, thereby allowing the derivation of Eq. ( 3).In our experiments, we have used feed-forward neural networks with Parametric Rectified Linear unit (PReLU) activations in hidden layers as an explicit feature map [25].Since, such networks are simply a composition of activation functions with matrix multiplication and addition of bias as arguments, we only require the activation functions to be invertible and weight matrices to be non-singular.

Experiments
To demonstrate the applicability of our framework, we trained the Generative RKM model on MNIST and small-NORB datasets using both the implicit feature map using a Gaussian kernel and an explicit feature map using the feed-forward neural network.In case of a kernel method, training only involves constructing the kernel and solving the eigenvalue problem (3), obtaining the latent variables H.In principle, one could also use the latent variables directly for generation.However, in our experiments, we fit a normal distribution to the latent variables, and randomly sample a new point h for generating views using the kernel smoother (n r = 3) technique as explained above.The neural network architecture for constructing an explicit feature map consists of a hidden layer of 6 neurons with PReLU activation functions (parameter value = 0.001) and an output layer of 6 neurons with linear function.We chose a rather basic architecture since our aim is to show the applicability of the method.The training procedure in case of neural networks consisted of minimizing J t using the fmincon function in MATLAB 2018b.The weights and biases were initialized randomly from a Gaussian distribution N (0, 1) and within each iteration of the minimizer, Eq. ( 3) is solved to update the value of H.

MNIST
The MNIST dataset contains 60, 000 training and 10, 000 testing images (28 × 28 pixels) of ten hand-written digits (0-9) along with the label information.We pre-processed the images by standardization, and original labels were transformed into a 10-dimensional array with 0-1 encoding.In our experiments, we use view X as the images and view Y as the label information.We trained the model on 5000 samples, with 500 samples from each class.Figure 3 shows the image and label generation using kernel smoother method.We can see that both the generated image and the generated label matches in most cases.Multiple digits on top of some images shows the mixing.For example, label 13 shows the mixing of images 1 and 3.Such label generation became possible thanks to 0-1 encoding scheme.Figure 4a shows the images generated when the feature map was a feed-forward neural network with architecture and parameters as described above.

Small-NORB
This dataset contains images and labels of 3D toys belonging to 5 generic categories: fourlegged animals, human figures, airplanes, trucks, and cars.The images were taken by two cameras under 6 lighting conditions, 9 elevations (30 to 70 degrees every 5 degrees), and 18 azimuths (0 to 340 every 20 degrees) [26].Each image is 96 × 96 pixels with integer greyscale values in range [0, 255].We trained our models without any pre-processing on the original images of 2 classes i.e. cars and airplanes.Figure 4b shows the images generated when the neural network was used as the feature map.

Targeted Generation
Since every latent variable h i corresponds to a data-point, it can be selected to generate the corresponding data in the input space.This could be useful in critical applications where data needs to be generated based on some prior-knowledge.Additionally, noise could be added to the latent variable and new data could be generated similar to the desired one.We demonstrate the targeted generation on the MNIST dataset using the kernel smoother technique (see Fig. 5).Latent variables corresponding to 0, 5, 4 and 3 were selected and the generated images are shown in the yellow bounding boxes.Then a Gaussian noise with mean 0 and variance 0.005, 0.01, 0.05 and 0.1 was added to the latent variables and the corresponding generated images are shown in the following columns.One can observe that as the noise variance is gradually increased, the generated images change from similar to a completely new image.The only exception is the case of 0, where the image generation is more robust towards noise.

Conclusion and future work
The paper shows a non-probabilistic framework that extends the RKM mechanism to multiple sources.This allows for a generative mechanism where the feature map can be defined using kernel functions or neural network based methods.When using kernel functions, the training consists of only solving an eigenvalue problem.In the case of a neural network based model, training involves an alternating minimization scheme.Experiments on the MNIST and Small-NORB datasets show that the model is capable of generating new images.Furthermore, a targeted generation mechanism is demonstrated using kernel-smoother method where adding noise to the latent variable generates similar images.Extensions of this work consists of training with more advanced models like Convolutional Neural Networks, secondly the effect of other pre-image methods would be investigated.This paper has demonstrated the applicability of the Generative Restricted Kernel Machine framework, suggesting the new research directions to be worth exploring.

Figure 1 :
Figure 1: Restricted Kernel Machine modeling a common subspace between two data sources

Figure 3 :
Figure 3: Image and label generation using the kernel smoother technique to construct the pre-image.The numbers above images show the generated label, where multiple labels above some images show the corresponding mixing of images.

Figure 4 :Figure 5 :
Figure 4: Image generation using neural networks as feature map.