Unlearning regularization for Boltzmann machines

Boltzmann machines (BMs) are graphical models with interconnected binary units, employed for the unsupervised modeling of data distributions. When trained on real data, BMs show the tendency to behave like critical systems, displaying a high susceptibility of the model under a small rescaling of the inferred parameters. This behavior is not convenient for the purpose of generating data, because it slows down the sampling process, and induces the model to overfit the training-data. In this study, we introduce a regularization method for BMs to improve the robustness of the model under rescaling of the parameters. The new technique shares formal similarities with the unlearning algorithm, an iterative procedure used to improve memory associativity in Hopfield-like neural networks. We test our unlearning regularization on synthetic data generated by two simple models, the Curie–Weiss ferromagnetic model and the Sherrington–Kirkpatrick spin glass model. We show that it outperforms Lp -norm schemes and discuss the role of parameter initialization. Eventually, the method is applied to learn the activity of real neuronal cells, confirming its efficacy at shifting the inferred model away from criticality and coming out as a powerful candidate for actual scientific implementations.


I. INTRODUCTION
Boltzmann Machines (BMs) [1][2][3] are a class of graphical models, capable of modelling data distributions through the learning of effective pairwise interactions between variables.Once properly trained e.g. through gradient ascent of the likelihood, BMs can be used to generate new data configurations, which hopefully are indistinguishable from the ones in the dataset [4].Good generative performances are, in practice, hindered by different limitations of the inferential problem.
While in theory BMs (with a sufficient number of hidden units) are universal approximators, in practical applications one is limited by the incompatibility between the true unknown distribution of the training data and the energy-based distribution that is employed to fit them.As a consequence, the training process cannot converge to the best possible generative performance, and only selects a good approximate model among many possible similarly good choices [5].Hence, regularization can be used to push the training process to yield specific desidered properties, useful for particular applications of the data analysis.Standard schemes for regularization include L 2 -and L 1 -norm penalties imposed on the pairwise interactions.However, these regularization schemes have a tendency to decrease the values of the interactions, and introduce systematic biases in the model.
Another limitation, which is intrinsic in inferential problems, is the limited availability of training data, possibly leading to the overfitting phenomenon.Informally speaking, overfitting is an over-specialization of the model, which is too focused on the details of the specific training set, rather than capturing the broader statistical structure highlighted by the data.Overfitting is particularly concerning in the so-called high dimensional setting, in which the size of the model (the dimension of the data configuration) is comparable to the amount of available data.Regularization is also supposed to help in avoiding overfitting.
Due to regularization, in a statistical physics language, the inferred interactions become smaller, and the inferred model is implicitly at higher temperature.Correcting this bias requires introducing a temperature parameter smaller than unity when generating new data.This procedure was used in Ref. [6] to generate new viable protein sequences with BMs inferred from evolutionary sequence data.However, this approach is purely empirical and somewhat uncontrolled.It is known from the statistical physics literature that rescaling the energy function by a temperature factor, even close to one, can drastically alter the distribution properties if the system is near a phase transition point [7,8].Hence, inferring models that are robust -i.e.generate similar data -when the energy function is rescaled might be extremely advantageous.
The goal of this study, motivated by the practical issues described above, is to propose a new regularization technique for BM learning that explicitly aims at increasing the robustness of the model under rescaling of its parameters, i.e. the coupling matrix J = {J ij } and the local fields h = {h i }.This new tool, that we named unlearning regularization, appears to be more effective than other techniques, approaching a higher robustness performance of the neural network.The performance of the unlearning regularization is evaluated from artificial data generated by both a Curie-Weiss (CW) and a Sherrington-Kirkpatrick (SK) models.We initially consider a small number of variables: in this way the original parameters are known, the fixed points of the learning equations are reached precisely and the useful quantities can be computed exactly.In addition, higherdimensional and more realistic cases are also considered to validate the results.
A particular limit of this type of regularization coincides with a thermal version of the traditional Hebbian Unlearning algorithm (HU) [9][10][11][12].We thus specifically consider this limit and conclude that HU is able to in-fer the original data distribution with a very good accuracy.The performance shows an optimum in time that scales with the control parameters, as it has been observed when the algorithm is implemented in associative memory tasks [13].We conclude that HU can be interpreted as a two-steps BM learning procedure.
The structure of the article is the following.We first introduce generative modeling and Boltzmann Machines in section II.Some extensively employed regularization techniques (i.e. the L p methods) are then described in section II B. Section III briefly describes the HU algorithm and its traditional use in associative memory models.The new unlearning regularization is then defined in section III B. Its performance is analyzed for two different data sources in section IV: a CW model (section IV A) and a SK model (sections IV B to IV E).In sections IV C to IV E we compare the new regularization method to the standard L p regularization scheme, showing that it substantially improves the robustness under rescaling of the parameters while preserving a high similarity to the original model.The analysis is performed both at convergence of the regularization algorithm and in the condition of highest similarity with the groundtruth model, obtained via early-stopping in the training, either in the small and large N cases.The study of a particular limit of the regularization technique leading to HU follows in section V, which provides some insight on its behavior as a generic inference tool.Eventually, the unlearning method is tested on real biological data in section VI, confirming in practice the validity of our new regularization prescription.

II. GENERATIVE MODELING
Generative models are a class of specific neural networks that are able to learn the probability distribution of a data-set and generate brand new data that exhibit maximum coherence with the same statistics [3,4].Consider a collection of N variables in a vector ⃗ S = (S 1 , ..., S N ).Data are M realizations of such a vector grouped into a set { ⃗ S µ } M µ=1 and we assume that they are sampled independently from a joint distribution P true ( ⃗ S).Given M data, we have access to the frequency of occurrence of a certain variable, i.e. the empirical distribution Generative modeling looks for a model distribution P mod ( ⃗ S| θ), where the parameters θ are inferred from the training data such that P mod is the closest possible to P data .However, P data might still differ from the groundtruth distribution P true .As a remedy, it is useful to reduce the number of degrees of freedom of the problem by designing the model taking into account some guess we might have about P true .For instance one might choose a graphical model, where variables are nodes of a graph [3] with interactions to be inferred.Alternatively, one might add a prior distribution for the variables, such as a mixture of Gaussians of unknown means and unit variances [14].The generative approach is largely implemented across several disciplines such as computational neuroscience [15,16], bio-informatics [17], animal behaviour [18], physical simulations [19], image and text synthesis [20][21][22].

A. Boltzmann Machines
We now describe a specific graphical model of generative neural networks, which will be particularly relevant to the present work.It is inspired by equilibrium statistical mechanics and it is called Boltzmann Machine (BM) [1][2][3]23].
Consider a fully connected network of N binary Ising variables ⃗ S ∈ {−1, +1} N with the following energy function where J ij are symmetric couplings and h i are the fields acting on each neuron site.Even if several types of architectures are collected under the title of BMs, we will stick to the simplest model, the one endowed exclusively with visible input neurons that interact reciprocally and are subjected to external fields.In fact, our goal will be analyzing the evolution of the interactions between real neurons, which are not necessarily present in other types of machines.We know, from statistical mechanics, that such a system at equilibrium at a temperature β −1 obeys the following Gibbs-Boltzmann joint probability distribution with Training a Boltzmann Machine means finding the parameters J and h that minimize the distance between P data given by Eq. ( 1) and P mod given by Eq. (3).Note that as a consequence, β can be set to one in the training without loss of generality.In practice, training can be achieved by minimizing the Kullback-Leibler divergence between P data and P mod , i.e.

L(J
which is equivalent to maximizing the cross-entropy of P mod with respect to the empirical distribution P data . Differentiating eq. ( 5) with respect to J ij and h i , we obtain the gradient of the loss, i.e.
where ⟨ • ⟩ data and ⟨ • ⟩ mod are the averages over the respective probability distributions.Therefore, the parameters can be found by iterating the following gradient descent equations with λ being a small positive learning rate.Note that the mean and covariance over the data can be computed upstream, because they only depend on the training data-set; on the other hand, the moments of P mod must be re-sampled at each step of the training process, because their exact calculation would involve a sum over all possible ⃗ S. Sampling can be performed by a sufficient number of Monte Carlo chains at equilibrium at β = 1, which requires an algorithm time that is long enough to ensure ergodicity for each chain.Usually the number of chains should be of the same order of magnitude of M , the number of training data-points, in such a way that the statistical errors on the two averages are comparable.
Once the process converges to the fixed points of Eq. ( 8) and Eq. ( 9) we obtain a moment matching condition, i.e. the first and the second moments of the two probability distributions coincide.While, in principle, the training of a BM is a convex problem that admits a unique solution [2], in practice there are many flat directions such that some initial conditions can push the parameters closer to their optimal configuration.A common choice is, for instance In order to measure the quality of the training of a BM one can measure whether the moment matching condition is met or not.This can be quantified by the Pearson coefficient between the moments of the P mod and P data distributions at the end of the training.Let us collect the two-points correlation matrices ⟨S i S j ⟩ in a vector ⃗ c where each entry runs over the indices i, j > i.Let us group the means ⟨S i ⟩ in a similar vector ⃗ µ where each entry runs over the index i.Then the Pearson coefficients are defined as Altimetric sketch of the typical loss function L minimized by BM learning, as a function of two parameters of the model.Level lines are relative to constant values of the loss.The bottom of the landscape, pointed out by the red star, is found by the unique solution (here chosen to be such that L = 0, i.e. a perfect fit of the data).The dashed line signals all the solutions that differ by a small error ε from the ground truth.Given the same distance from the bottom, colored circles indicate models found by different regularization methods.
Since the loss in Eq. ( 5) is a convex function of the parameters [2], there is a unique Gibbs-Boltzmann distribution that minimizes it.However, in most practical applications the data are limited and as a consequences the moments ⟨•⟩ data are only noisy estimates of the correct ones.Furthermore, the data are not generated by a Gibbs-Boltzmann distribution themselves.Because of this, the loss landscape features many almost flat directions, corresponding to the fact that many different models can fit the data equally well, i.e. reach a loss L = L min + ε with ε a small KL distance from the ground-truth.This idea is sketched in fig. 1. Regularization methods can be used to select specific models in this region of comparable loss, thus providing different inferred systems with different properties, depending on additional requirements imposed on the inferential problem.For instance, some of these models can be sparse or even contain null parameters (e.g.L 0 , L 1 -norm and information based regularization schemes [24]), others are fully connected (e.g.L 2 -norm method).Past literature [7,8,25] highlighted that critical models, i.e. inferred systems being highly susceptible to small changes in the parameters, can be attractive for BM learning.In fact, if we consider training as an homogeneous sampling of the models that minimize L, critical models have a large basin of attraction and are thus sampled most often [8].Criticality can be problematic for data generation: it implies a long correlation time in the Monte Carlo dynamics, slowing down the sampling process; it makes the model susceptible under rescaling of the parameters, reducing its predictivity in real data applications.Hence, avoiding criticality might help increasing generalization.
To summarize, we define the performance of the model according to two properties: accuracy, i.e. the capability of the model to be as similar as possible to the distribution that generated the data; robustness, i.e. the tendency of the model not to change its typical configurations when slightly changing (e.g., rescaling by a common temperature factor) the parameters.The goal of this work is to introduce a new type of regularization in order to find a better compromise between these two requirements.
We now discuss some widely used regularization methods for BMs and the way we can benchmark their accuracy and robustness.As mentioned above, L p regularizations are certainly the most famous regularization methods, also due to their intuitive interpretation.This approach penalises high values of the parameters by adding a L p norm of the same variables in the expression of the loss function.As a consequence, the sparsity of the graph is promoted: the redundant parameters are weakened with respect to the relevant ones.The expression of the loss of a BM with L 1 and L 2 regularizations is given by the following equations where L BM is defined in Eq. ( 5) and γ 1 , γ 2 are two regularization rates.This translates into new updating rules for the parameters, i.e. δh Accuracy can be quantified by the Kullback-Leibler divergence D KL (true|mod) between the inferred model and the original one.Though, this quantity is not symmetric under the exchange of the two distributions, which can be inconvenient to define a distance.We can then adopt a symmetric version of the divergence as sD KL (true, mod) = = D KL (true|mod) + D KL (mod|true).(20) Even if numerical results do not show a significant difference between these two quantities, we will mainly adopt sD KL in our analysis.Sometimes, the KL divergence is not sufficient to quantify accuracy, so one must compare the model correlations and the data correlations.A detailed analysis of these will be provided in the following.
Regarding robustness, it is known that the susceptibility of the system to small variations of temperature can be measured through the specific heat [26], defined as where S β is the entropy of the model at a given inverse temperature β.The fact that inferred neural networks typically display a peak in C v around the value of β used for training implies the vicinity of a critical point (the finite size of the system impedes C v to properly diverge and to show a real criticality).Therefore, since all parameters naturally scale with β in Eq. ( 3), C v is a measure of the sensitivity of the model to a small perturbation of the parameters: high values of C v imply a strong variation of the model entropy under a small variation of the parameters, which suggests strong variations of the inferred statistics as well.Note that other kind of susceptibilities can be defined by taking derivatives of observables with respect to model parameters, but we will focus on the specific heat in the rest of our study.

III. UNLEARNING REGULARIZATION A. Unlearning algorithm
We will now describe an unsupervised algorithm employed in the realm of associative memory models that will be our main focus in the following.Inspired by the brain functioning during REM sleep [27], the Hebbian Unlearning algorithm (HU) [9][10][11][27][28][29][30] is a training procedure for simple associative models.Consider a neural network having the same energy defined in Eq. ( 2).We assume the external local fields to be null, i.e. h i = 0, ∀i.The associative memory task consists in retrieving a set of random neural states { ⃗ ξ µ } αN µ=1 , called patterns, through a zero temperature Monte Carlo dynamics [31] initialized on some corrupted versions of them.Such a rule for the dynamics reads The control parameter α is the load of the model.Associativity is improved by achieving larger basins of attraction of the patterns.
The training procedure starts by initializing the connectivity matrix in the Hebb's fashion as in Eq. ( 10), i.e.J 0 ij = ⟨S i S j ⟩ data , ∀i, j, where ⟨•⟩ data is the empirical average over the patterns.Then, the following routine is iterated at each step t: 1. Initialize the network on a random neural state.
2. Run the zero temperature Monte Carlo dynamics, updating a randomly picked neuron at each time step according to Eq. ( 22), until convergence to a stable fixed point ⃗ S * .
3. Update couplings according to: This algorithm was first introduced in Ref. [9] to prune the landscape of attractors from proliferating spurious states, i.e. fixed points of the neural dynamics not coinciding with the patterns [32,33].This pruning action stabilizes the patterns and increases the size of their basins of attraction.The numerical analysis of Ref. [11] shows that HU approaches the memory performance of a maximally stable symmetric perceptron, which is considered to be a very effective model, up to a critical capacity α c ≃ 0.6.Another important aspect to underline is that HU performs at its best at a given amount of algorithmic iterations that scales with N, α and λ [10,11].After this amount of steps the performance of the model, in terms of perfect retrieval of the patterns and size of the basins of attraction, deteriorates.This implies the necessity of applying an early-stopping criterion for the HU algorithm to perform optimally.We also note that HU is an unsupervised algorithm, in the sense that it does not need to be provided explicitly with the patterns, and only exploits the information encoded in the Hebbian initialization of the couplings.

B. Unlearning regularization
We now propose a new type of regularization that has the objective of imposing robustness under rescaling of the parameters, i.e.
For the model to be robust under a redefinition of the parameters, we can add a regularization term to the traditional BM loss function that shifts the peak of the specific heat (i.e. the critical temperature) away from β = 1, where the data are generated.Hence, let us define the following loss function with being the Kullback-Leibler divergences between the data and the model inferred at an inverse temperature β.Both the cases of positive and negative regularization factor (i.e. for a ≶ 1) will be evaluated in this work.The motivation for using the asymmetric KL distance will become clear in a few lines.The gradient descent equations for this loss function become When a = 1, Eqs. ( 27) and ( 28) coincide with the original BM learning.In the limit a → 0 one has ⟨S i S j ⟩ a→0 → ⟨S i ⟩ 0 ⟨S j ⟩ 0 = 0 and Eq. ( 27) tends to a thermal version of the HU rule performed at β = 1, i.e.
Eq. ( 29) is similar to previous attempts in the associative memory literature [13].On the other hand, when a → ∞, and the learning rate is redefined as λ = λ/a → 0, the algorithm becomes which also resembles the unlearning procedure: the sampling at a = ∞ produces fixed points of the zerotemperature dynamics, which gives back Eq. ( 23).At variance with the standard routine, there is an additional Hebbian input term ⟨S i S j ⟩ data which impedes the couplings to vanish at convergence.At this point, the motivation for our choice of the loss function L(J , h|a) should be clear: we search for an algorithm that interpolates between the BM learning algorithm and HU.In this way HU emerges as a particular limit of a regularization method on BMs.

IV. RESULTS
In order to benchmark our proposed regularization scheme, we first consider a small sized network, i.e.N = {18, 20}, which allows us to consider the limit of infinite number of generated data, i.e.M → ∞, and compute exactly the moments ⟨•⟩ data .We can then reach the fixed points of Eqs. ( 27) and ( 28) precisely, whenever such points are admitted, with no errors due to the finite sampling.We can then compute the observables C v or sD KL more easily, in order to determine the accuracy and robustness of the inference.It must be underlined that our purpose in this framework is not to maximize the generalization properties of the model, because this goal would be trivially reached by the standard BM learning: it is rather to find the model that achieves the best compromise between accuracy and robustness.In the second instance, we generalize our analysis to the case of larger N , where quantities cannot be computed exactly.
For simplicity of notation we rename where we will mainly deal with β = a and β = 1.
The inverse temperature of the data generating model is β = β d .For given β d , we can compute c d ij exactly in the limit M → ∞ by exact enumeration of the model configurations.We consider two kind of data generating models: a mean field fully connected ferromagnetic Ising network (i.e. the Curie-Weiss model) and a fully disordered spin glass network (i.e. the Sherrington-Kirkpatrick model [34]), which both display a critical point in the thermodynamic limit, manifested by a singularity of the specific heat at a given critical temperature.This singularity is smoothed in finite size systems, thus appearing as a peak in specific heat.For both models we will choose β d to be smaller than the real position of the peak of C v (β) because sampling is more efficient in the paramagnetic phase.

A. Curie-Weiss model
We first consider the problem of inferring a Curie-Weiss model (CW), that is defined by the following energy function This model can be fully treated analytically in the finite size case, and the procedure to compute the main quantities is reported in the following.Since fields are zero, by construction of the model, we are interested in the evolution equation for the couplings, i.e.
where we used the fact that all elements of the matrices are identical, because each node receives the same fields from its neighbours.The fixed point equation for the correlation functions from Eq. ( 34) is where the star indicates the value of the correlations at convergence.Moreover we know that where , +1] and we used From now on, we will make use of the following notation for the correlation functions: The second moment of the magnetization with respect to the Gibbs-Boltzmann measure ⟨m 2 ⟩ β can be written as Thus The fourth moment of the magnetization ⟨m 4 ⟩ β can be written as where k(βJ) is the fourth-order correlation among spins.Eq. ( 35) then becomes or equivalently that has be to solved for J, given a and β d .
In the following, we first choose β d = 1 in order to generate data from the paramagnetic phase of a network of N = 20 spins slightly above the critical point, since the specific heat peak of the model at N = 20 appears to be at β > 1.Then, we solve Eq. ( 42) to obtain the coupling J(a) of the inferred model in presence of the unlearning regularization.Once this is done, we analyze the resulting model by computing quantities from the Gibbs-Boltzmann distribution with parameter βJ(a) rescaled by an additional temperature β.The specific heat of the inferred model is computed as Results are reported in fig.2a and show that the specific heat has a peak slightly after β = 1 that shifts progressively when a is decreased as well as increased away from unity.In fig.2b we compare c 1 and c a with the data correlations c d as functions of a.There is no value of a where c 1 = c d or c a = c d , signaling that the inferred model never reproduces exactly the original system, except for the trivial case of a = 1.Moreover, c 1 /c d and c a /c d swap when crossing a = 1, as one could guess by the fact that the regularization factor in eq. ( 25) passes from being negative to positive.Furthermore, in fig.2c we compute J(a) for different values of β d and plot the quantity aJ(a)/β d as a function of a: when a lies in between 0 and 1 the lines are approximately linear, signaling that J(a) is nearly constant in a.The inferred model thus changes slowly upon varying a and this property improves when N is increased.

B. Sherrington-Kirkpatrick model: training
Next, we consider the problem of inferring a Sherrington-Kirkpatrick model (SK), i.e.
(44) We consider a network of N = 18 spins with a data generation temperature β d = 0.4.For clarity of the results, we will use only one realisation of the parameters for the SK, having a peak in the specific heat at a β slightly larger than β d .The parameters are initialised as J (0) = c d and h (0) = 0 ∀i.
The fixed points of the gradient descent Eqs. ( 27) and ( 28) is such that the inferred model does not exactly reproduce the original system, due to the regularization.In fact, when a ≪ 1 the stable fixed point of the equations is c 1 ≃ ac d and c a ≃ 0; when a ≫ 1, one has c a = c d and c 1 ≃ 0. In both scenarios the inferred model (at a = 1) does not show a good accuracy.Note that even if for a ≫ 1 the model scaled by a fits well the data, we do not consider this as a good inferred model because it is not robust upon further lowering the temperature.
For this reason, we also consider an early-stopping criterion for the training, such that the unlearning regularization reaches the best compromise between accuracy and robustness.Let us compute the symmetric KL divergence between the original SK model and the inferred one at different number of training steps.Fig. 3a shows the resulting curve for a model regularized via the unlearning scheme with a = 0.3 and compares the standard Hebbian initialization of the couplings with two other initializations: cost, in fact at each algorithm time step we have where µ d is the average over the off-diagonal elements of c d and µ 1 (t), s 2 1 (t) are, respectively, the average of the off-diagonal elements and the average of the off-diagonal squared elements of c 1 at time t.
Figures 3b and 3c report the evolution of ω and ρ during training.As one can notice, there is a clear correspondence between the local minimum of the symmetric KL divergence and the point where ω = 1 and ρ approaches a local maximum.It is therefore interesting to consider stopping the training at this point (defined in practice by ω = 1), in order to achieve the best accuracy.We note that this is only possible with a Hebbian initialization of the parameters; while a null inizialization never reaches ω = 1, a random one reaches it but with a much lower Pearson coefficient ρ, resulting in poor accuracy.Note that the correspondence between the local minimum of sD KL and the best match of c d and c 1 is not valid for the entire range of a. Specifically, for the rest of our analysis we will avoid values of a that are close to unity, because for this value the correspondence does not hold and we cannot define a good early-stopping criterion.
We also observe that L p regularizations do not converge to the best network in terms of accuracy: even though ρ ≃ 1 at convergence, we find ω ̸ = 1.Nevertheless, we found that L 1 and L 2 regularization schemes with a Hebbian initialization also display the same correspondence between a minimum in sD KL and the point where ω approaches unity and ρ admits a local maximum in time, as shown in Fig. 4.

C. SK model: results at convergence
Before discussing the early-stopping results, we now compare the generalization performance of the unlearning regularization with the L 1 and L 2 regularizations described in section II B, at convergence.Data are generated by the same SK model with β d = 0.4, as in section IV B.
The couplings are optimized until convergence of the training for the three types of regularizations.Three quantities are measured after training: the symmetric Kullback-Leibler divergence sD KL (1, d) between the model at unitary temperature and the original SK model, the specific heat at unitary temperature C v (1) and the slope ω defined as in eq. ( 45).One should keep in mind that ω = 1 is reached only by the trivial choice of the regularization parameters a = 1 and γ 1 = 0, γ 2 = 0; any other choice of these parameters leads to a reduced accuracy, by definition of regularization.Measures are repeated for different choices of the regularization parameters {a, γ 1 , γ 2 } and results are reported in fig. 5. Specifically, fig.5a is a colour plot showing the performance of the network in the plane (sD KL , C v (1)), where each point is coloured according to the slope ω.The three lines correspond to the three regularizations: the unlearning line is generally closer to the origin than the L 1 and L 2 ones, suggesting a better compromise between accuracy and robustness.All the lines converge to the ground-truth when sD KL = 0, corresponding to absence of regularization (a = 1 or γ 1 = 0 or γ 2 = 0).
To better compare the three regularization methods, fig.5b reports the distance of the points reported in fig.5a from the origin directly as a function of the slope ω.Consistently with fig.5a, the plot shows that the set of points relative to unlearning always stands below the L 1 and L 2 ones, while all regularizations tend to the ground truth when ω approaches unity.Different experiments with different data and temperatures β d show very similar results.
One can repeat the analysis by comparing the three regularization schemes at constant standard deviation of the couplings σ J instead of constant slope ω.The results are reported in fig.6, and the unlearning regularization once again reaches a better compromise between accuracy and robustness compared to the other regularization methods.Note that the U and L 2 curves tend to coincide when σ J → 0, i.e. when all inferred couplings are small.This can be explained through a weak coupling expansion of the gradient equations for U and L 2 .When σ J ≪ 1 the two-point correlation functions can be Taylor expanded as The expansion of Eq. ( 27) leads to Since σ J is small when a ≪ 1, in this limit we obtain which at the fixed point becomes where σ d is the standard deviation of the empirical correlation matrix.By repeating the same procedure with Eq. ( 18) one obtains which at the fixed point becomes for γ 2 > 1, which is justified by the fact that σ J is small when γ 2 is sufficiently large.Numerics support the fact that, when σ J ≪ 1 and σ J is the same in fig. 5 for both U and L 2 , then a(γ 2 − 1) = 1.Our results show that the HU algorithm, which derives from the unlearning regularization when a → 0, can also be obtained via a L 2 regularization with a large γ 2 .

D. SK model with early-stopping
The early-stopping criterion described in section IV B is now employed to improve the performance of the inferred model.The training is stopped when ω ≃ 1.For the unlearning regularization, we consider a range of values for a such that there is a good correspondence among the minimum of sD KL and ω, ρ reaching unity: outside the considered interval these three quantities might not behave consistently, and we could not define a good earlystopping criterion.Measures of the symmetric Kullback-Leibler divergence sD KL (1, d), the specific heat at unitary temperature C v (1) and the standard deviation of  higher robustness at comparable accuracy.We observe a range of values of γ 1 and γ 2 for which the L 1 and L 2 regularizations reduce the specific heat in β = 1, while the remaining choices for the regularizers do not allow to obtain a higher robustness with respect to the original SK model, i.e.C v (1) is increased.Different calculations with different data and temperatures β d produce very similar results.Finally, we show that the unlearning regularization helps generating new data.Since the inferred model is less critical than the original one, the correlation between network configurations sampled at different time steps of a Monte Carlo chain at unitary temperature are smaller.Fig. 8 shows a comparison between the time-correlation functions corresponding to the inferred system at a = 0.4 and the original SK model.The inset displays the behaviour of the specific heat C v for the same two models.

E. SK model with N ≫ 1
In this section, we test the method on an instance with larger number N of variables, showing a good efficiency of the Unlearning method, both run until convergence and early-stopped.A number M = 10 4 of data-points are sampled from a SK model with N = 200 spins in equilibrium at a temperature β d = 0.4 (i.e. in the paramagnetic phase).The network is then initialized in Hebbian fashion, i.e.J (0) = c d and h (0) = 0.The model is trained via Unlearning, L 1 and L 2 regularization methods and the specific heat C v (1), the slope ω(c 1 , c d ) and the Pearson coefficient ρ(c 1 , c d ) are measured through MCMC sampling at unitary temeperature.Two different training prescriptions have been adopted for the study: the algorithm is executed until convergence of equations ( 27), (28); training is early-stopped as soon as ω ≃ 1.
Figure 9 depicts the values of ρ as a function of ω and C v (1) as achieved through the three regularization methods for various values of the control parameters a, γ 1 , γ 2 .At convergence (see fig. 9a), the Unlearning algorithm appears to approach a better accuracy performance (i.e. in terms of both ρ and ω) together with a lower value of the specific heat.Even after early-stopping (see fig. 9b) the same behaviour emerges: there is a range in a where, given the same degree of accuracy, the model achieves a smaller specific heat with respect of the other regularization methods.We also stress that, by contrast with the rest of our analysis, this experiment provides for a specific heat C v (1) obtained at early-stopping which is lower than the one reached at convergence.This aspect might be due to the Hebbian initialization of the parameters in a regime which is very far from the perfect-sampling one that we considered before.
It should be noted that the standard deviation σ J of the couplings is chosen to have the same order of magnitude, across all three regularization methods.Both in the converged and early-stopped scenario, the Unlearning algorithm is superior to the L p regularizations, confirming the results obtained at small N (i.e. the perfect sampling framework).Moreover, the early-stopping technique used in this work, which relies on the Hebbian initialization of the network, results handily applicable for larger dimensions due to its reduced computational cost, hence providing an effective training recipe in terms of both accuracy and robustness.

V. THE HEBBIAN UNLEARNING LIMIT
We will now study the specific limit of the unlearning regularization that leads to an algorithm which strongly resembles the HU routine.The inferential performance of the learning procedure is examined by measuring the distance between the inferred model and the original one at different training steps.Results show, for the first time, that HU can be employed as an inferential tool, due to the beneficial effect of the Hebbian initialization of the couplings.We advance an explanation for its perfor-mance that is supported by further numerical evidence.In the limit a → 0 the updating rules for the parameters transform into Eqs.( 29) and (30).As mentioned before, Eq. ( 29) strongly resembles the traditional HU algorithm of Ref. [9].By contrast with the original rule, the new regularization samples configurations at β = 1 instead of stable fixed points of the neural dynamics, and makes use of a thermal average, rather than summing each contribution each time.We will keep dealing with small networks so that, given the initial conditions for the parameters, the loss function of the problem can be minimized exactly.As a numerical experiment, we train a BM with N = 18 neurons to learn a SK model at β d = 0.4.The unlearning regularization is applied with a = 0 and the usual initial conditions for the parameters, i.e.J (0) = c d and h (0) = 0.
Even if training is performed over both the couplings and the fields, according to the rules ( 29) and ( 30), the fields h do not appear in the energy of the original model, hence we will focus exclusively on the evolution of the interactions J .Fig. 10a displays the standard deviation of the couplings: as we can see the general trend shows an exponential decay of the interactions, as one typically observes in the HU algorithm [10].The decay to zero is implied by the fact that the fixed point of the gradient descent equations for Eq. ( 29) corresponds to c 1 ≡ 0. Fig. 10b depicts the symmetric Kullback-Leibler divergence between the Gibbs-Boltzmann distribution of the original SK model, from which data were sampled, and the inferred model with β = 1.The sD KL is high at the beginning, then decreases until reaching a global minimum signaling an optimum in the inferential performance, then it increases again and stabilizes at a plateau.The minimum is sufficiently low to indicate a good statistical consistency between the model and data.The Pearson coefficient between the correlation matrix of the data c d and the one of the inferred model at unitary temperature c 1 is measured step-by-step in the learning process and reported in fig.10c: as one can notice, there is a global maximum near the position of the global minimum of the sD KL .The inset in fig.10c displays the good agreement between the two matrices, the inferred c 1 and c d , at the position of the peak.This analysis suggests that the unlearning regularization with a = 0 displays two algorithmic regimes: a transient regime where the model at β = 1 shows a good statistical agreement with the generating model; another regime where c 1 → 0, and the total performance deteriorates.The first transient regime is the most important one, because it suggests that HU can be considered as an inferential tool, aside of its associative memory use.We know from section II A that in BM learning data are shown to the model each time-step of the algorithm and this imposes the moment matching.The co-existence of a positive Hebbian term and a negative unlearning one in the traditional BM learning has already been pointed out in Refs.[1,35].In fact, the variation of each pair of couplings J ij is given by where we neglected the dependence on time, and We can interpret each step in the minimization of L as the result of two contributions, one constant quantity deriving from the data, and another one deriving from the evolving model.While a standard BM can start wherever in the space of the parameters and ends up at the fixed point of eq. ( 52), in the HU case the update is performed by using only the negative unlearning contribution.Nevertheless, the data have been seen by the model, specifically through the Hebbian choice of the initial conditions.This observation suggests that HU approaches the minimum of the loss function in two temporally separated steps: by first descending along the data direction and then, progressively, along the model one.As a consequence, we can assume that such a two-step minimization is similar to standard BM learning as long as for most of the pairs i, j.To test whether this condition holds while training a BM regularized with a = 0, and initialized with J (0) = c d , we measure the standard deviation of the ratio between the elements of c 1 and c d as a function of time.
Results are presented in fig.11a for the usual data-set sampled by a SK model.The standard deviation of the ratio of the elements of the two matrices c 1 and c d is plotted as a function of tλ for different choices of λ.The curves collapse well from λ smaller than O(10 −2 ).The system starts with c 1 generally being much larger than c d , which satisfies the condition (55).Around t ∼ 0.2λ −1 the curves reach a local minimum, then increase until a local maximum at t ∼ 0.29λ −1 , to start a slow decay right after.Both these two stationary points are located where c 1 and c d share the same order of magnitude, i.e.where the condition for the two-steps minimization of the loss ceases to hold.However, only the second stationary As one can notice, the generalization of the model increases when λ is small: we can imagine that the smaller is λ the stronger is the effect of the initial overshoot along the data direction.Different parameter initializations can now be compared.Specifically, we run the standard BM learning (i.e. the unlearning regularization with a = 1) after a Hebbian, random, and all-to-zero initialization of the couplings.The Hebbian and the random initializations have the same initial standard deviation of the couplings σ J .At this stage we measure the evolution in time of the standard deviation of the ratio c 1 /c d and the value of the loss.In addition to these two quantities, we introduce another observable that quantifies the degree of alignment of the gradient ⃗ δJ(t) = − ⃗ ∇L(t) to the direction connecting the initial state of the couplings, i.e.J (0) , and the convergence state J d = β d J SK .Such quantity is defined as The evolution of these observables are reported in fig.12.Note that, for the random initialization, measures are averaged over fifteen realizations of the initial matrix.As one can observe from fig. 12a, the highest dispersion of the elements of c 1 /c d is obtained through the Hebbian initialization; the random initialization also begins with the elements of c 1 being on average larger than the empirical data correlations.Yet, as one can see from fig. 12b, the loss decays generally faster in the Hebbian case, with respect to the random case.Figure 12c allows to interpret the behavior of the learning dynamics in terms of the gradient descent: the Hebbian initialization points the gradient ⃗ δJ towards the minimum of the loss, located in J d ; on the other hand, the random initialization does not start well in its descent, and progressively adjusts the trajectory.The wrong start of training after random initialization can be also visualized in the inset of fig.12b, displaying a projection of the trajectory over the first two components of the J matrix, with arrows indicating the direction of the gradient descent.Only one experiment is reported for the random initialization since the others showed very similar behavior.Conversely, when the system learns from zero initial couplings, both the positive and negative contributions to the learning appear to be small and comparable in value, which implies that the implementation of pure unlearning is ineffective in this case; the descent of the loss is slower with respect to the Hebbian initialization; the trajectory is already sufficiently pointing to the bottom of the landscape, and progressively adjusts itself.We conclude by stating another important remark.The unlearning trajectory and the standard BM learning look very similar in the first stage of training, in terms of the gradient descent, as predicted by Eq. (55).After a certain amount of iterations that we have identified above, the unlearning trajectory deviates towards J = 0, while the BM finally converges to the final target.The abrupt change of direction that is visible from both the two-dimensional projection of the trajectory and the evolution of the gradient might be related to the choice of the data-set.We have thus showed that, when the network is initialized in the Hebbian fashion, the dominant contribution to the BM learning is the unlearning one, and the gradient of J points towards the minimum of the loss function.Hence we showed that the HU algorithm works as well as an inferential device than as a training procedure for associative memory models.

VI. A REAL-WORLD APPLICATION
This section is dedicated to the testing of the performance of the Unlearning regularization on real data.We train a BM on a data-set available from Ref. [16], reporting the activity of N = 40 retinal ganglion neurons reacting to visual stimuli.We firstly binarize each data vector by discretizing a time interval of 0.1s into bins ∆τ = 20ms, as done in [7,16].As a consequence, each data-point is a sequence of S i units, representing the instance where the i − th neuron has fired (S i = +1) or not fired (S i = −1) in given time bin.Each time bin is thus associated to a data-point of our binary data-set.We are provided with several repetitions of the experiment over the same population of neurons, so that we can train the model with M = 10 5 data-points in total.
The model is trained as a BM and regularized over couplings and fields through the Unlearning method with different choices of the parameter a.The profile of the specific heat C v is measured as a function of β and reported in Fig. 13.Lines with dots represent a machine trained until convergence of the algorithm.The curve obtained at a = 1, corresponding to pure BM learning, appears consistent with the one reported in [7].For each a < 1 the peak of the specific heat shifts away from unit, and its value in β = 1 is lowered by the regularization, suggesting a higher stability of the inferred network under a rescaling of couplings and fields.The small dots report the specific heat measured under the usual earlystopped prescription when couplings and fields are initialized as in equations ( 10), (11): even when N becomes large there is an optimum amount of iterations where the Pearson coefficient between c d and c 1 is high and the slope ω reaches unity.As underlined by the figure, at this state of the network the specific heat is still significantly lower than the one obtained with no regularization, corroborating the importance of a Hebbian initialization for reaching the optimal level of accuracy and robustness of the model.

VII. CONCLUSIONS
In this study we have shown that: • In the context of Boltzmann Machines, and specifically for what concerns the robustness under rescal- ing of the parameters of the inferred network, the standard L 1 and L 2 regularization techniques are outperformed by the unlearning regularization.This result is valid for both cases where training is stopped earlier, to maximize the similarity between the inferred model and the ground-truth, and at convergence of the algorithms.
• A particular case of the unlearning regularization (i.e. for a vanishing parameter a) reproduces a thermally averaged Hebbian Unlearning rule, that shows good inferential capabilities when initialized in the standard Hebbian fashion.Hebbian Unlearning can be interpreted as a two-steps Boltzmann-Machine learning and it is well reproduced by a L 2 regularization with a high regularization rate.
In this case, our study stresses the necessity of early-stopping the training procedure after a specific amount of iterations where the performance reaches its optimum.
The goal of this article was thus to show the effectiveness of unlearning as a form of regularization in Boltzmann Machines.The analysis has been carried out on both small networks, that allowed to compute all quantities without error by exact enumeration, and large networks, generally affected by statistical noise.For the most part of our work the ground-truth distribution was defined by one of two models, either the Curie-Weiss or the Sherrington-Kirkpatrick models, which are known to show a phase transition at a critical temperature.As a result, we could not only test the proximity of the inferred model to the original one, but also the susceptibility of the network under variations of the parameters, by comparing the trend of the specific heat of the ground truth model with respect to the inverse temperature, with the trend obtained after inference with regularization.In addition to this controlled framework, a real-world case, with data sampled from real neuronal activity, has been considered to confirm the results.Both heuristic and empirical arguments support the idea that models inferred via Boltzmann Machine learning tend to be pathologically critical [7,8,16], i.e. less effective in sampling new examples.On the other hand, the regularized model is much less critical than the one originally learned in [16] from the same data-set.Furthermore, we have displayed that even in the high-dimensional case one can characterize the performance of the algorithm in time by tracking two quantities: the Pearson coefficient and the slope of the empirical covariances versus the thermal correlations.These two observables are fast to compute and allow to establish an early stopping criterion that, combined with a Hebbian initialization of the parameters, reaches a greater generalization and robustness performance.We conclude that, given a high consistency between the model and the ground-truth distribution, the stability of the model under rescaling of the parameters is enhanced by the unlearning regularization.As an alternative strategy to the unlearning regularization presented here, we also evaluated the minimization of the KL distance between the models at β = 1 and β = a.Even if the gradient descent equations look similar to the current case, they contain the contribution of four-point correlations among spins, resulting in a significant increase of the computational cost of the training.We also analyzed the particular limit of the regularization that gives as learning rules the ones expressed in equations ( 29) and (30).The resulting procedure strongly resembles the traditional Hebbian Unlearning routine, but with a thermal two-point correlation computed at β = 1 replacing that computed on the zerotemperature fixed points of the dynamics.Previous work [13] supports the fact that the associative memory performance of such a thermal unlearning does not deviate much from the original rule.Unlearning is not new to be mentioned in the statistical inference realm [35][36][37], yet we are the first to consider the original rule proposed by Hopfield in [9].As a new contribution to the characterization of unlearning, we tested its inferential power, i.e. the capability of learning the original model.So far, to the best of our knowledge, the original unlearning algorithm was exclusively tested in an associative memory framework, where data are actually patterns, that are sub-dominant in number with respect to the entire set of possibile configurations.Our results show a good inferential performance of the algorithm.The explanation for this success relies on the choice of the initialization of the parameters.A Hebbian initialization of the network has two implications.First, the contribution to the gradient of the loss deriving from the data (i.e. the positive Hebbian contribution in the Boltzmann Machine learning in eq. ( 8)) is much smaller than the contribution deriving from the cross-entropy of the model given the data (i.e. the negative unlearning contribution in the Boltzmann Machine learning).This condition holds until an optimal amount of iterations that can be estimated from the numerical simulations.Second, we showed that a Hebbian initialization of the couplings points the gradient of the parameters towards the correct direction, i.e. the one of the ground-truth of the inferential problem, at variance with other kinds of parameter initialization.The beneficial effects of a Hebbian initialization for these type of machines has already emerged in the very recent literature [12,38].Yet, the topological properties of the Hebbian free energy landscape that make them such a good starting point for training have not received due attention.
We conclude that Boltzmann Machine learning can be performed in two steps: an initial overshooting along the direction of the data and then a gradual adjustment along the direction of the model with increments being updated at each step of the algorithm.The effectiveness of a twostep training for a Boltzmann Machine encourages a biological interpretation of this kind of training, in agree-ment with the hypothesis that synapses in the brain could be fine tuned through the alternation of daily online experience and offline sleep [39,40].Such a conjecture is inspiring members of the A.I. community in the development of training procedures for deep neural networks that might substitute the slow and non-biological backpropagation techniques [35,41].These observations, together with the importance of the Hebbian initialization of the network, appear to be consistent with previous results in literature concerning the unlearning algorithm and its implementation in associative memory modeling [10][11][12]38].
FIG. 2. Thermodynamic quantities computed for a small network trained on the configurations of the Curie-Weiss model with N = 20 spins at a given β d .(a) The specific heat of the inferred model for β d = 1 is reported as a function of the inverse temperature β, at different values of the regularization parameter a. Recall that a = 1 corresponds to standard BM learning that reproduces the original model exactly.(b) A comparison between c1, ca and the empirical data correlation c d when inferred at different values of a.The only point where c1 = ca = c d , i.e. a = 1, is marked by a red circle.(c) Inferred coupling multiplied by a/β d as a function of a at different values of β d .The point a = 1 where c1 = ca = c d is marked by a red circle.

FIG. 3 .FIG. 4 .
FIG. 3. (a) Symmetric KL divergence (in log scale), as a function of time tλ, between the inferred Gibbs-Boltzmann distribution at unitary temperature and the one describing a SK model at β d = 0.4.(b) Slope ω obtained from the linear fit of the elements of c d versus the elements of c1, as a function of time t.(c) Pearson coefficient ρ between the correlation matrix c1 and the data correlation matrix c d as a function of time t.The full line reports the case of Hebbian initialization of the couplings; the dotted line corresponds to a random initialization; the dashed line represents the initialization to J (0) = 0.The gray vertical line signs the location of the minimum of sDKL for the Hebbian inizialization.Choice of the parameters: N = 18, a = 0.3, λ = 0.06.

FIG. 8 .
FIG. 8. Time correlation between network configurations sampled by a MCMC at unitary temperature when a = 0.4 and a = 1 (i.e. the original SK model), in semi-log scale.Errorbars are the propagated standard deviations of the measures.The parameters are found by the early-stopping procedure described in the text.The sub-panel reports the specific heat obtained for the same choices of a. Choice of the parameters: N = 18, β d = 0.4, λ = 0.01.

FIG. 10 .
FIG.10.Three relevant observables to benchmark the inferential performance of the unlearning regularization in time with a = 0: the standard deviation of the couplings σJ , the Kullback-Leibler divergence between the original SK model and the inferred one at β = 1, the Pearson coefficient between the data correlation matrix c d and c1, i.e. the correlation matrix for the model at β = 1.Choice of the parameters: N = 18, λ = 0.06, β d = 0.4.

FIG. 11 .FIG. 12 .
FIG. 11.(a) Standard deviation of the ratio between the elements of c1 and c d as a function of the normalized time tλ; the subplot reports the values of tλ associated to global minima of the symmetric Kullback-Leibler divergence (sDKL) between the generating model and the inferred one for different choices of λ; both the vertical and horizontal gray dotted lines report the mean of the points in the inset.(b) Global minimum of the symmetric Kullback-Leibler divergence between the generating model and the inferred one as a function of λ in log-scale.Choice of the parameters: N = 18, a = 0, β d = 0.4.

FIG. 13 .
FIG. 13. Specific heat of a model inferred from the activity of N = 40 real neurons as a function of the inverse temperature β.The circles report the measure at convergence of the Unlearning regularization; the dots report measures when the algorithm is initialized in a Hebbian fashion and early-stopped when ω ≃ 1. Error bars are neglected for clarity purposes.Choice of the parameters: N = 40, M = 10 5 , λ = 10 −3 .