Efficient training of energy-based models via spin-glass control

We present an efficient method for unsupervised learning using Boltzmann machines. The method is rooted in the control of the spin-glass properties of the Ising model described by the Boltzmann machine's weights. This allows for very easy access to low-energy configurations. We apply RAPID, the combination of Restricting the Axons (RA) of the model and training via Pattern-InDuced correlations (PID), to learn the Bars and Stripes dataset of various sizes and the MNIST dataset. We show how, in these tasks, RAPID quickly outperforms standard techniques for unsupervised learning in generalization ability. Indeed, both the number of epochs needed for effective learning and the computation time per training step are greatly reduced. In its simplest form, PID allows to compute the negative phase of the log-likelihood gradient with no Markov chain Monte Carlo sampling costs at all.


I. INTRODUCTION
Machine learning has emerged as a disruptive technology transforming industries, society and science. Its perhaps most remarkable recent developments are based on supervised learning in deep neural networks. Yet unsupervised learning is expected to be much more important in the long term [1]. Energy-based models, with their ability of unsupervised learning of probability distributions for generative purposes, are promising building blocks of future machine learning systems. Among them, Boltzmann machines (BMs) have especially prospective properties. Their latent variables-called hidden neurons-allow for deep neural network architectures while the learning algorithm is remarkably simple [2].
Yet training of BMs, as a particular instance of probabilistic graphical models [3], is hard due to the need of obtaining samples from the model built. Specifically, a set of averages with respect to training data and the defined model needs to be determined for every learning step. In general, such averages cannot be computed exactly for large networks because of the large dimensionality of the vector spaces involved, but they can be estimated by sampling through Markov chain Monte Carlo (MCMC) methods. Unfortunately, the sampling of BMs is hard and MCMC algorithms are numerically costly. At present, the most popular BMs are restricted Boltzmann machines (RBMs) with only one layer of hidden neurons and no intra-layer connections. This architecture usually allows for acceptable learning by means of simple MCMC-based algorithms such as Contrastive Divergence (CD). Improving sampling offers huge learning benefits and more numerically demanding methods, like Persistent Contrastive Divergence (PCD) [4] and Population Annealing [5], have been developed to this end.
Energy-based models are closely related to problems of statistical physics. In fact, the Boltzmann machine is a neural network with stochastic dynamics that resemble the thermal fluctuations of a physical system. Therefore, the powerful methods developed for statistical physics are among the most promising for dealing with the sampling problem of Boltzmann machines. These include modern MCMC algorithms for physical systems like Parallel Tempering [6], Simulated Annealing [7] or the Wang-Landau algorithm [8], which inspire the sampling of deep Boltzmann machines (DBMs) [9]. Other methods include Langevin dynamics [10], or the use of mean-field equations like TAP [11].
The problem of sampling of BMs is so relevant and challenging that special hardware systems exploiting some physical processes have been developed to deal with such task. These include systems operating in the regime of classical physics [12,13], as well as based on purely quantum or hybrid classical-quantum machines. The two latter are examples of quantum-assisted machine learning. For example a purely quantum Boltzmann machine has been proposed based on the learning of a Boltzmann distribution of a quantum Hamiltonian [14]. In the hybrid approach, quantum devices are used to sample from both classical and quantum Gibbs distributions as an alternative to MCMC methods [15]. Both routes are promising, but have important drawbacks for practical applications. In the former, the applications are limited to small size cases due to the small number of qubits available in current quantum computers. Several experimental implementations of generative models have been realized in different quantum computing platforms, such as superconducting quantum integrated circuits [16,17] or quantum computers based in trapped ions [18]. For the sampling problem, one has to overcome the challenge posed by the sparse connectivity among qubits existing in current devices (see Ref. [19], where efforts in this direc-tion are discussed and experimentally tested in a D-Wave quantum annealer, and Ref. [20] for a thorough review in quantum annealers). Importantly, there are classicalquantum hybrid approaches based in trapped-ion analog quantum systems, see e.g. [21].
Perhaps the most profound result stemming from the statistical physics perspective of BMs is the understanding of the origin of the sampling hardness of such networks. The Boltzmann probability distribution is dominated by contributions from low energy configurations and efficient sampling must probe such configurations well. However, at the beginning of training, when typically the couplings between neurons are drawn at random, a Boltzmann machine is equivalent to the Ising model with long-range random couplings-known as the Sherrington-Kirkpatrick (SK) spin-glass model [22]. Determining the lowest energy configuration of the Ising spin-glass model defined on a non-planar graph is an NP-complete problem [23]. Furthermore, Parisi's replica symmetry breaking solution [24] of the SK spin-glass model reveals that at low temperatures, the free energy landscape is composed of local minima separated by energy barriers which lead to ergodicity breaking. As the temperature is lowered, more minima and barriers arise and "ergodic components" further split in a stochastic branching process. The resulting structure of such thermodynamic states is ultrametric [25,26]. Simple MCMC sampling algorithms which imitate thermal fluctuations, like Gibbs sampling, get trapped in the phase space (i.e. they present poor "mixing") while global algorithms have to deal with an exponential number of local minima. Note that this is not a deficiency of the algorithms but a manifestation of the glassy nature of such systems. Indeed, real spin-glass materials exhibit glassy properties such as exponentially long relaxation times.
The first key point we rise in this paper is that BMs reproducing typical training data probability distributions are not in the true SK spin-glass phase. After the successful training of a BM, the low energy states representing training data cannot be very difficult to reach and Gibbs sampling should not get stuck in one of exponentially many local energy minima. Furthermore, if the training data is not ultrametrically distributed, the distribution of sampled outputs of properly trained BMs should neither be ultrametric. The fact that present, standard methods for training BMs lead to outputs that are more ultrametrically distributed than the training data [27] points actually to a deficiency of such methods. On the other hand, such outputs are still substantially less ultrametrically distributed than in SK spin-glass phase. In this paper, we explicitly show on a simple example that during training of RBMs, the accessibility of the low energy states via Gibbs sampling rises dramatically and that low energy states form only a few minima. This seems to indicate that the SK spin-glass phase of untrained Boltzmann machines constitutes an unnecessary bottleneck in the process of training.
Therefore, instead of pursuing the paramount prob-lem of efficient sampling in the SK spin-glass phase, we take a radically different approach: we regularize the couplings in the Boltzmann machine in order to avoid a spin-glass behavior at any point of training. The way we impose such regularization while allowing for efficient training is the second key point of this paper and is presented in detail in Section II. This regularization suggests a new algorithm for estimating the gradient of the loglikelihood function, which we employ to train RBMs in various datasets, showing that the proposed technique allows for efficient learning and generalization. Furthermore, we show an improvement in training speed of orders of magnitude, both in the epoch time and in the number of epochs needed for learning, over traditional training methods. Indeed, we show how MCMC sampling is not a necessary requirement for training, reducing the numerical effort to a minimum.

II. RAPID: REGULARIZED AXONS AND PATTERN-INDUCED CORRELATIONS
We begin by recalling the standard BM, which consists of N binary neurons σ (here we use values σ j = ±1, which are standard in the physics of spin systems), separated into two disjoint sets of V visible and H hidden neurons. The energy of a given configuration of visible and hidden neurons σ = (v, h) is defined as: where weights W ij describe connections (axons) between neurons, while b i are local biases. Alternatively, such BM setup describes spin systems where the weights describe interactions of pairs of spins and the biases are local magnetic fields. Different architectures of connections (i.e. different graphs whose vertices are neurons and edges denote non-zero weights) can be considered. For example, in RBMs, there are only connections between visible and hidden neurons, and no intra-layer connections are considered. However, in the most general case the neural network is fully connected. In the following we will neglect biases, as the main issues we discuss are related to the distribution of weights. The probability of having a visible configuration v is given by a Boltzmann distribution The goal of the training is to determine the parameters of the energy function (1) such that P v represents as close as possible the distribution P data underlying some training dataset T . This is usually done by minimizing the negative log-likelihood, with respect to the weights of the W ij . As P data is independent on these variables, the minimization is only performed to log P v (v). The derivative of this term takes the form of where the bracket · denotes the expectation value with respect to the probability distributions P data v or P v (v) for the data and model averages, respectively. Sampling from such distributions is the main challenge of BMs, as discussed previously. In fact, RBMs were introduced in order to facilitate sampling from σ i σ j data . However, even for RBMs, efficient sampling from σ i σ j model is still very difficult if the weights are random and thus the system is in the SK spin-glass phase.

A. Regularized Axons
The SK spin-glass phase is related to the so-called spin frustration, which occurs when there is no configuration that minimizes the energy of all pairwise interactions at the same time. For any closed loop of weights, at least one of the interaction energies cannot be minimized if the product of weight signs is negative. The exponential number of low energy minima characteristic of the SK spin-glass phase occurs with strong frustration. However, not all models with random weights exhibit such phenomena. In Ref. [28], Mattis introduced a model with random weights but no frustration: consider a set of N variables ξ j taking values ±1. Defining the interaction between spins as given by W ij = ξ i ξ j , for any loop of weights the product of their signs is positive, since every ξ j enters such product squared. As a result the model has no frustration and its ground state is easily reachable. In fact, the configuration ξ corresponds to the desired ground state. Mattis' construction suggests a way to regularize weights with the aim of avoiding the SK spin-glass phase. However, in a learning scenario, a single configuration ξ it is not flexible enough to allow for a wide range of values for W ij (i.e., the model does not have enough plasticity to learn typical data distributions).
Here we employ a generalization of Mattis' approach, with an arbitrary number of K configurations-we will call these configurations patterns-, each described by a set of variables ξ (k) . The BM weights are constructed from the patterns via Such form of the weights is well known in machine learning from the Hopfield model of associative memory [29,30], which implements the Hebbian rule that "neurons wire together if they fire together" [31]. Although the main focus in the Hopfield model is the retrieval of fixed patterns from the dynamics of a neural network instead of generalizing data probability distribution as in BMs, these models are closely related [32]. One of the standard figures of merit of the Hopfield model is the capacity, i.e., the number K of independent random patterns that can be faithfully retrieved from a neural network of N neurons. This quantity has been thoroughly studied [33,34], and in general, there exists a constant α c such that if the ratio K/N > α c the model is in a spin-glass phase, and patterns cannot be faithfully retrieved. However, if K/N < α c , the model is a desired "retrieval phase", in which it is able to recover the K patterns "stored" per as Eq. (5). For instance, for the Hopfield model at very low temperatures the value of this critical parameter is α c ≈ 0.14.
We employ Eq. (5) as a way of regularizing the weights of a BM in order to avoid the SK spin-glass phase at any moment of training. In a typical instance of training a BM with Regularized Axons (RA), one would proceed to first choose a number of patterns K high enough to faithfully learn the data, and only then the choice of the number of hidden neurons would be performed, taking into account the desire of being outside the SK spinglass phase by making the ratio K/N small enough. In such regime, frustration in the system should be reduced and sampling of the low energy configurations greatly enabled. Furthermore, the patterns ξ (k) themselves are explicit low-energy states of the BM. This property was elaborated in [35] for independent patterns of the Hopfield model in deep retrieval phase.
Let us emphasize that we use Eq. (5) in conjunction with a low K/N ratio in order to regularize the weights of the BM, not merely to alternatively represent a given set of weights. While the latter is possible through the mapping described in Ref. [32], if the weights of the BM describe a model in the SK spin-glass phase the associated Hopfield network will have a high K/N ratio, thus being in the undesired non-retrieval phase.

B. Training via Pattern-InDuced correlations
As mentioned above, for weights of the form of Eq. (5), when K N , the patterns ξ (k) are themselves lowenergy configurations of the spin model of Eq. (1). Recalling that Boltzmann distributions of the form (2) give exponentially larger weights to low-energy configurations with respect to higher-energy ones, a consequence is that the model averages required for training, σ i σ j model , can be well approximated by This suggests a natural procedure for minimizing the loss function in Eq.
where λ is some learning rate. The exact form of ∂ ξ (k) i L can be found in Appendix A.
To perform the last step, we relax the binary constraint on the values of ξ-the fact that they take values in {−1, +1}-since the gradients in Eq. (7), and therefore the updates to the patterns, take continuous values. This will lead to ξ (k) not representing real spin configurations. Different procedures, which we describe in Appendix C, can be implemented in order to solve such problem and to return to ξ (k) representing real spin configurations. Our results (depicted in Section III) show the prominence of two methods: (a) after a given number of updates, replace the value of a given ξ Both methods show great results in simple datasets. The latter, despite losing the significance of ξ (k) as spin configurations, is better suited for more complex problems since it allows for continuous-valued weights.
Note that the presented procedure of estimating model averages, Eq. (6), does not require any MCMC sampling. Still, as is demonstrated in Section III, it already gives good results in learning various datasets. Avoiding MCMC decreases the complexity of the method, as the computational time of a single epoch of PID scales as O(N 2 ) while for CD it scales as O(N 3 ). More importantly, by avoiding the frustrated SK spin-glass phase, the sampling from the model is very effective, and al-lows much faster training also in terms of the number of epochs.
In summary, the novelty of the combination of Regularized Axons and training via Pattern-InDuced correlations (RAPID) comes from: (i) avoiding the SK spin-glass phase at any moment of training by utilizing weights constructed via Eq. (5) while scaling H to keep a low K/N ratio (ii) exploiting the patterns introduced in Eq. (5) for approximating the low-energy space of the associated spin model in an efficient way, and (iii) relaxing the constraint of having binary pattern variables, in conjunction with a subsequent discretization/regularization, in order to easily implement updates while keeping patterns reasonably close to real configurations of binary spins. As we shown next, this recipe is sufficient for employing BMs to learn typical probability distributions in a practical way.

III. RESULTS
We proceed now to analyze the performance of RAPID in learning different datasets. To compare it with BMs trained through standard methods we will focus on the restricted architectures which are typically used. The models employed in this section, which can be found in [36], are implemented in PyTorch [37] via the ebmtorch module [38].
A. Benchmark with exact training: 4x4 Bars As a first example, we trained an RBM architecture with a small number of visible neurons, V = 16, and relative to that, a large number of hidden neurons, H = 1000. The small V , along with the restricted architecture, allows for the exact calculation of the denominator in Eq.
(2) in a reasonable time. This enables the exact computation of the loss function, Eq. (3), and thus to employ the exact gradients in Eq. (7) for unbiased training. Furthermore, the ground state energy can be exactly determined at any moment of training, irrespective of whether the system is in a SK spin-glass phase or not. Therefore, we can meaningfully contrast RAPID with training of exactly solved RBMs. Moreover, we also contrast it against standard methods employed for training larger RBMs when training with the exact gradients is intractable. In particular, we will perform comparisons against RBMs trained with CD and PCD with 10 Gibbs steps and-in the case of PCD-2048 fantasy particles. An appropriate benchmark problem is learning the Bars dataset, consisting of 4 × 4 images with full vertical bars, containing a total of 14 inequivalent images. The results of the comparisons are shown in Figs. 1 and 2. Let us begin by discussing how the machines themselves change during successful learning. From the training perspective, the most important aspect of the machine being in the SK spin-glass regime or not is how hard it is to obtain a faithful distribution of states via sampling. To estimate this, we assess the ease of reaching the ground state via Gibbs sampling starting from random visible configurations. For doing so, after every epoch of training we initialize the visible neurons in a random configuration and we use Gibbs sampling to extract a representative configuration of the model. We perform n = 10 Gibbs steps, after which we calculate the energy of the resulting configuration in the visible and hidden layers. We define the ratio of such energy to the true ground state energy as the Gibbs sampling GS accessibility.
In Fig. 1a we show how the Gibbs sampling GS accessibility changes during learning. For RBMs trained with CD, PCD and exact gradients, one can see that: (i) they are initialized in a regime where Gibbs sampling does not reach low energy configurations, which is typical for SK spin-glasses; (ii) after training, they end up in a regime where Gibbs sampling is efficiently reaching the low energy sector (especially for machines trained with PCD and exact gradients). This strongly suggests that, after training, the machines are not in the SK spinglass regime. Furthermore, it is clearly apparent that the regime of efficient Gibbs sampling is reached significantly faster with a more accurate gradient estimation (note the logarithmic scale of the horizontal axis). On the other hand, the RBM with RA presents efficient Gibbs sampling from the beginning. This proves that the weight regularization proposed is an effective method of avoidance of SK spin-glass regime.
In Fig. 1b we consider a quantity more relevant during the training process: the proximity of the configurations employed by each method to compute the negative phase, σ i σ j model , to the respective ground states. For the various training methods, we look for the lowest-energy configuration employed for the computation of the negative phase and calculate the ratio of that energy to the ground state energy, obtaining the method sampling accessibility. Note that, when employing the exact gradients, we have an explicit expression for P v (v) and therefore there is no need of taking any samples from the model. Therefore, there is no curve in Fig. 1b for the exact training method. For CD, the Gibbs sampling and method accessibilities are very similar, since the method for computing both is in eesence the same. The method accessibility is slightly better due to the fact that, in this case, the initial configurations before sampling are set to be images to be learnt. A similar penomenon can be observed in the curves for PCD. In this case, the method accessibility is better than Gibbs accessibility due to the fact that the fantasy particles employed in the sampling are always close to the ground state. For PID, the most dramatic improvement is noticed in the initial moments of training. The samples employed for computing the negative phase are much closer to the ground state than in any other method. This leads, as will be explicitly apparent from Fig. 2, to a much faster learning of the dataset. However, in later stages of training, other methods appear to give a better characterization of the ground state. This may, in fact, constitute a problem when learning more complex datasets. However, since the ground state is still accessible via standard Gibbs sampling (recall Fig. 1), one can abdicate Eq. (6) and go back to standard sampling techniques in RA models. In Section III B we elaborate in this matter.
In order to quantify the performance on the Bars learning task, we ask the models to reconstruct corrupted images from the dataset. Given an image A, we fix the top row of pixels (perpendicular to the direction of the bars), which implies fixing a set of neurons in the visible layer of the RBM. We perform Gibbs sampling, allowing for the rest of the visible neurons to be updated. Then, the Hamming distance (HD) between the machine generated image B and the given one A is calculated. We normalize such distance by dividing over the number of pixels of the image, this is, the number of visible neurons V . In this way, for a random reconstruction, a pixel has a probability of 0.5 of coinciding with the desired one, hence being the upper bound of our measure. Following the standard procedure of unsupervised learning, we divide the Bars images into two sets: a training set consisting of 10 images, and a test set containing the remaining 4 images. Note that, in the case of spin values and no biases, the energies of configurations v and −v are the same. Therefore, in order to ensure that there is no "information leakage" from the training set to the test set, we design them in such a way that the negative of every configuration in the training set is also in the training set, and the negative of every configuration in the test set is also in the test set.
In Figs. 2a and 2b we plot how the reconstruction of the training and test sets, respectively, change during training. One observes that: (i) the different training methods for standard RBMs lead to identical memorization (the reduction of the HD in the training set) while generalization (the reduction of the HD in the test set, which is never used in training) is best when training with exact gradients, then with PCD, and finally with CD. (ii) The RAPID approach leads to orders of magnitude faster training when measured in epochs (recall the logarithmic scale of the horizontal axis). In fact, even after the first epoch of training the HD is already notably below the random-reconstruction limit. (iii) There is almost no difference between the performance on memorization and generalization when employing RAPID. This consti- tutes a proof that the corresponding model not only truly learns, but also does it very efficiently. It is important to note that, in this example, PID is not employing any sort of MCMC sampling, so the actual speed up of training time measured in terms of processor operations is even greater.
Finally, one may wonder how similar are RA-and standard RBMs after training. For doing so, we perform a Singular Value Decomposition (SVD) of the weight matrices of trained models. Such results are shown in Fig. 3. Clearly, in all cases, four large singular values stand out. Interestingly, the form of the singular value decomposition of weight matrix (see, for instance, [39,Eq. (10)]) invokes Eq. (5), such that a clear analogy between the patterns employed in RA and the SVD eigenvectors can be drawn. The four large singular values observed in Fig. 3 suggest that only ≈ 4 patterns are needed to describe the weight matrix of a standard machine trained in 4x4 Bars problem, in all cases of training methods studied. We note that SVD analysis of standard RBMs trained in MNIST has been already performed in Ref. [39], where it was reported that the SVD spectrum develops a "tail" of relatively few but large singular values. Taking also into account that the standard RBMs presented on Fig. 1a evolved towards an easy-sampling regime, we can interpret that the RA-RBM may be regarded as an actual general model of trained standard RBMs.

B. Complex datasets: 12x12 BAS and MNIST
We now proceed to apply RAPID for unsupervised learning of more complex datasets. First, we consider the 12 ×12-pixel Bars and Stripes (BAS) dataset, which consists of 8188 images containing only vertical bars or horizontal stripes. As the complexity of the problem to solve increases, one needs to increase the number of auxiliary patterns K and the number of hidden neurons H.
In Fig. 4a we show the results for the HD of reconstructed images. The HD for the training and test set decrease parallel to each other, proving that the model trained is not just memorizing the images of the training set, but learning its fundamental features and being able to generalize the results to the test set. The inset shows images generated from sampling the model starting from an unbiased blank image where v = 0 (recall that the allowed values for the neurons are σ i = ±1). As a powerful print of the generalization power of the model, we see that not only it reconstructs satisfactorily corrupted unknown images (the results on the HD), but moreover it is able to generate images that were not contained in the training set. Remarkably, while standard RBMs were learning 12x12 BAS much slower than 4x4 Bars (note this is standard behaviour), RAPID allowed to learn bigger datasets much faster than smaller ones, as measured by the number of epochs. The explanation for this phenomenon lies in both the larger plasticity of the weights allowed by a larger K and a larger training data set, as each epoch contains much more updating iterations in larger datasets. As a final example, we employ RAPID to train an RBM in the MNIST dataset. Here, the complexity of the dataset is much higher than in the BAS example, and so it requires to increase the size of the machine both in terms of K and H. However, due to its RA construction and PID training, scaling up these parameters does not cause an important impact in terms of computation speed.
This is the first case in which we observe that the lowenergy space characterization of PID is not satisfactory to perform proper learning (recall Fig. 1b). In fact, employing just Eq. (6) for computing the negative phase led to a strong overfitting. In order to avoid this and to achieve a good approximation of σ i σ j model , we employ the patterns ξ (k) as initial seeds for k = 10 steps of Gibbs sampling. The resulting configurations after the sampling are those employed for computing the average under the model. This produces two things: first, Gibbs sampling enhances the proximity of the patterns to the ground state of the model; second, due to the stochastic nature of the sampling, it introduces fluctuations which are shown to help to overcome overfitting. While this procedure brings the cost of computing thje negative phase back to O(N 3 ), the reduction in training epochs consequence of using RA is still present.
We show the results of generating MNIST images in Fig. 4b. In this case it is considerably more difficult to produce a quantitative assessment of the effectiveness of training, since the ground state can not be exactly computed efficiently (therefore, one cannot reliably compute the accessibilities of Fig. 1), and a discrepancy in Hamming distance does not necessarily imply a bad generation of instances. Therefore, in this case we restrict ourselves to present qualitative images obtained by sampling from the model.

IV. DISCUSSION AND REMARKS
We have argued that the SK spin-glass phase occurring in the initial instances of training BMs is an unnecessary complication for learning typical probability distributions. Indeed, we have shown in simple examples that during the training of typical problems, BMs leave the SK spin-glass regime they begin at. We substantiated this claim by studying the evolution of standard RBMs trained with CD, PCD and exact gradient methods for the simple 4x4 Bars dataset. We measured the Gibbs sampling accessibility of low-energy states for such machines and observed radical changes during learning.
Our main result is the RAPID approach, which consists of: (i) Regularizing the Axons (RA) on the model by utilizing the Hebbian rule to construct the weights of a Boltzmann machine by means of K random patterns. The number of hidden neurons in the model is chosen in order to keep the ratio K/N low enough so as to avoid an SK spin-glass regime in the first place; (ii) training via Pattern-InDuced correlations (PID), Eq. (6), to approximate the negative phase in the log-likelihood gradient. Together, RAPID solves in practice the sampling problem of BM for modeling typical probability distributions. The speedup is shown to be large in terms of epochs and even greater measured in number of operations, even when MCMC is used to boost the PID method. Finally, we have proven with several examples that RAPID leads to models that learn very efficiently and generalize training data, outperforming existing training techniques.
Although the cases presented are significant examples, the question on how restrictive the RA construction is for learning general probability distributions still remains. Based on the training evolution of the Gibbs sampling accessibility, and the singular value decomposition of weight matrices of trained RBMs shown in this article and in Ref. [39], we conjecture that trained RBMs are well approximated by RA-RBMs. In that view, the possibility of deficient training of RA-RBMs may only come from a low number of patterns K in the model.
This brings up the question about the interpretation of the patterns ξ (k) . Contrary to the patterns in the Hopfield model, those in RA certainly do not contain just memorized training data. First, because the typical number of patterns in our experiments (8 for the 4 × 4 Bars dataset, 30 for the 12 ×12 BAS dataset, and 200 for MNIST) is much smaller than the size of the training data set or than the number of neurons in the model. Second, because we showed that RAPID effectively learns to generalize beyond the training data. Third, because each pattern has visible and hidden parts-the latter usually much bigger-whereas training images are inputs only for the visible part. An interesting line of future work is therefore analyzing how the information is generalized and actually stored in the patterns, and how to employ this information for discrimination or assisted discovery [40].
In practical implementations of RAPID, we observed that the number of patterns plays a major role in the performance of the model. We observed that a too small K for complex problems may lead to overfitting. On the other hand, increasing K results in the need of increasing H accordingly to avoid the spin-glass phase. This implies an increase on the RAM that, nevertheless, is heavily countered by the speedup in learning time offered by RAPID.
In this paper, we focused on experiments with restricted Boltzmann machines, in order to carefully compare RAPID with standard machines and training algorithms. However, RAPID is by no means restricted to RBMs. In fact we expect that RAPID, or variations of it, will bring the long-sought-after effective sampling and training of deep Boltzmann machines, as the principles behind RAPID can be applied to any energy-based model.
In a model with RA, the weights are not the ultimate parameters to be fixed by training. These are, rather, the values of the auxiliary patterns ξ (k) i . In this appendix we detail the calculation of the update rule for the auxiliary patterns. We focus here in the training of an RBM, just as explained in the main text. We start by recalling that the probability of observing a state v of the visible variables is given by where the free energy is defined from the expression e −F (v) = h e −E(v,h) . As stated in the main text, we will consider here a RBM with no biases. For the case of a binary hidden layer where h ∈ {−1, 1} H , one can give a closed-form expression to it: Our goal is to find the set of parameters (which we call θ for simplicity) such that P v (v) becomes as close as possible to the P data underlying some training dataset T . To compare them we employ the loglikelihood L = − v (i) ∈T P data (v (i) ) log P v (v (i) ). Intro-ducing Eq. (A1) to the previous we find that Expanding this expression and writing it in terms of the free energy, we obtain where for simplicity we have introduced the partition function Z = v,h e −E(v,h) , and |T | denotes the cardinality of T .
Once the loss function is defined, we can update the weights using, e.g., the gradient descent method ∆θ = −λ∂ θ L, which in our case means One can distinguish clearly here the positive and negative phases. The positive phase is the first term, evaluated only on the instances of the training set, while the negative phase is the negative term, that is evaluated on every possible configuration of the visible nodes. In the case of standard RBMs, the ultimate parameter that one desires to fix are the weights W ij . For these, the derivative of the free energy function is where we have assumed the requirements of the models in the main text, namely that we have spin variables (i.e., σ i = ±1) and that all biases are zero. When considering RAPID, the ultimate parameters to be determined are the auxiliary patterns ξ (k) , with which the weights are later computed by using Eq. (5). The gradient of the free energy with respect to the individual pattern neurons ξ Note that, in this case, we employ roman indices for denoting the visible neurons in a pattern, and greek indices for the hidden neurons.
Appendix B: Computational cost of CD vs. PID In this appendix we analytically calculate the computational cost of training RBMs following both PID and CD, the most-used sampler in this machines. The main difference in complexity is the calculation of the negative phase, namely the second term of Eq. (4). In the case of PID, computing this term is trivial as it is the averaged outer product of the auxiliary patterns. However, one needs to take into account the cost of calculating the weights after each update following Eq. (5). This implies multiplying K times a vector of size V +H (i.e. each auxiliary pattern). The complexity of such procedure scales as O(K(V + H) 2 ).
In the case of CD, the most basic algorithm consists on doing k Gibbs steps from an initial random visible configuration. From here, one calculates the activation probability of each hidden neuron h j as where σ(x) = (1 + e −x ) −1 is the sigmoid function. This calculation has a computational cost of O(V 2 H). Note that here we consider the general case where the biases b j can take nonzero values. To complete a Gibbs step, one needs to calculate the value of the visible layer given the hidden vector obtained from Eq. (B1) by which again has a computational cost of O(H 2 V ). Summing up both contributions and taking into account that these procedure is performed k times to approximate unbiased sampling, the total complexity of CD procedure scales as O(k(V 2 H + H 2 V )). Considering that K V, H, this shows the superiority of PID in terms of computational cost.
Appendix C: From continuous updates to discrete patterns Due to the gradient calculated in Eq. (7), ∂ ξ (k) i L, being continuous, the values of the neurons in the auxiliary patterns will also be continuous after the updates are applied. In the following we describe two methods to bring the continuous-valued, updated parameters ξ (k) i back into discrete, real spin configurations: 1. Sign discretization: The first method amounts to simply substitute the value of each of the continuous variable by its sign, i.e., This not only ensures that the auxiliary neurons are binary, but also acts as a regularizer, avoiding the divergence of the ξ (k) i . 2. Value restriction: When training in more complex datasets, the expressivity of the machine can be enhanced by considering that the auxiliary neurons are continuous. In such case, we no longer discretize them. In order to prevent the divergence of the weights, we restrict ξ In both procedures it is crucial to choose when to perform either the discretization or the restriction in the given range. For the former, discretizing too often may result in the erasure of the information learnt by the machine, as the cumulant of the updates before the discretization may have not be as big as to change the sign of a given ξ. Not discretizing often enough may result on a similar result. For example, if the first updates of a ξ (k) i = 1 are positive, subsequent negative updates will have no effect in ξ (k) i , as its value may be very different from zero. This method was applied to the BAS examples shown below, both the 4 × 4 and 12 × 12 datasets, by discretizing the patterns after every epoch of training, this is, after the end of every pass of the full dataset. Note that an alternative solution may be to employ a learning rate large enough so as to permit an appropriate size of the cumulants.
For the latter, i.e. the restriction of ξ (k) i ∈ [−x, x], a similar approach holds. However, given that the effect of such procedure on the value of the patterns is not as dramatic as in the previous case, it can be applied much often. We show the validity of this procedure in the MNIST example. There, the value of the patterns is checked and bound after every update.