Regularization, early-stopping and dreaming: a Hopfield-like setup to address generalization and overfitting

In this work we approach attractor neural networks from a machine learning perspective: we look for optimal network parameters by applying a gradient descent over a regularized loss function. Within this framework, the optimal neuron-interaction matrices turn out to be a class of matrices which correspond to Hebbian kernels revised by a reiterated unlearning protocol. Remarkably, the extent of such unlearning is proved to be related to the regularization hyperparameter of the loss function and to the training time. Thus, we can design strategies to avoid overfitting that are formulated in terms of regularization and early-stopping tuning. The generalization capabilities of these attractor networks are also investigated: analytical results are obtained for random synthetic datasets, next, the emerging picture is corroborated by numerical experiments that highlight the existence of several regimes (i.e., overfitting, failure and success) as the dataset parameters are varied.


Introduction
The Hopfield model is probably the best-known example of attractor neural network [1][2][3]: this is constituted by a set of N binary units, referred to as neurons, that interact pairwise and whose state is iteratively updated by a non-linear activation function, in such a way that the new state of neuron i depends on the signal acting on i and stemming from the neighbouring neurons.A suitable choice of the neuron interaction matrix, denoted as J ∈ R N ×N , can ensure the attractivity of a number of patterns, that we want to store and retrieve.More precisely, one initializes the neural configuration by setting the values of the units close to a pattern ξ 1 ∈ {−1, +1} N , this configuration represents the input supplied to the machine and may correspond to a corrupted version of ξ 1 ; repeated updates of neurons are then performed until convergence to a fixed point and, if this matches ξ 1 , this state is interpreted as the retrieval of the information codified by ξ 1 .The same is expected to occur for any target pattern, say ξ µ , with µ = 1, ..., P .
The popularity of the Hopfield model is also due to the fact that it is feasible of an analytical treatment and, in particular, it can be recognized as a spin-glass, in such a way that it can benefit from a broad collection of techniques developed to address disordered systems.Indeed, in the last decades, the model has been intensively investigated, and countless variations on theme have also been accounted for 1 .Remarkably, a significant fraction of these works spotlighted the structure of the neural interaction matrix: in the standard Hopfield model this is based on the so-called Hebb's rule [19] N and suitable revisions of this rule can give rise to better performances of the model in terms of number of storable and retrievable patterns [20].A successful class of these revisions apply unlearning protocols (see e.g., [21][22][23][24][25][26][27][28][29][30]), whose aim is to impair the attractiveness of configurations that do not correspond to any of the stored patterns (the convergence to those configurations, sometimes referred to as spurious states, is interpreted as a mistake of the machine).More recently, Hebb's rule has also been revised to make it closer to a learning algorithm: the "reality" that we want to retrieve is now not accessible, instead, a corrupted sample, say {ξ µ,1 , ξ µ,2 , ...} is available and used to build J (see e.g., [31][32][33][34]).This way, the available sample of data can be interpreted as a training set and the Hopfiled model can be employed for generalization tasks.The bridge between a retrieval scenario and a machine learning setting has also been strengthened by leveraging the equivalence between Hopfield model and Boltzmann machines (see e.g., [35][36][37][38][39][40]).However, a full mapping allowing for the role of regularization parameters, the emergence of overfitting or underfitting phenomena is still under construction (see e.g., [41][42][43][44][45][46]).
In this paper, we try to contributed in filling this gap, focusing on an unsupervised reconstruction problem: the dataset is made of a set of items belonging to different classes (the label being veiled) and we introduce a loss function for the neuron interaction matrix.The solution of our problem corresponds to a Hebbian kernel subjected to a certain amount t d of unlearning iterations and we prove that t d is related to the regularization hyperparameter, which, in turn, can be related to the training time in the un-regularized version of the problem.This framework allows us to inspect the emergence of overfitting phenomena and therefore to conceive recipes for an optimal training time.Specifically, the system stores each pattern as a minimum of the Lyapunov function associated to the neural dynamics (this can be interpreted as a cost function or as an energy function); minima corresponding to the same class form a cluster, and -when the number of examples per class is large enough -these minima do coalesce into a single minimum.In this scenario, there emerge both intra-class and inter-class correlations and we find that the role of t d (or, equivalently, of the regularization or of the training time) is to disentangle such correlations starting from the lowest ones: as t d is increased, the minima corresponding to different classes are shifted, their overlap is reduced and the system gets able to generalize from examples; by further increasing t d , minima corresponding to patterns belonging to the same classes get shifted too and the system starts to overfit.
In what follows, we detail these results by first introducing the loss function associated with our problem and showing that the neural interaction matrix that minimizes the loss function corresponds to the revised Hebbian kernel studied in [27,47] (Sec.2).Subsequently, we find a relation between regularization, training time and unlearning time, and we present numerical experiments on structureless and structured datasets (Sec.3).These results constitute the premises for a thorough discussion on the emergence of overfitting phenomena (Sec.4) and a corroboration by numerics (Sec.5).Lastly, we conclude our paper by offering a concise outlook and final remarks (Sec.6).Technical details are collected in the Appendices.

From the stability condition to the minimization problem
The basic principle underlying an attractor neural network is that each pattern making up the set ξ = {ξ µ } P µ=1 and encoding relevant information is associated to an attracting fixed point for the network dynamics.We assume that patterns are N -dimensional binary vectors and that the network units σ = (σ 1 , . . ., σ N ) ∈ {−1, +1} N interact pairwise as specified by the symmetric2 coupling matrix J , then, we set up the network evolution as where φ i (σ(t)) ≡ j̸ =i J ij σ j (t) represents the signal reaching unit i at time t and the sign function acts component-wise.This dynamics is applied sequentially and exhibits the following Lyapunov function (see e.g., [48]) In this work, we will retain a deterministic3 evolution and, for the moment, we take J as quenched and we exclude the presence of biases.In this framework, the stability condition for a retrieval configuration, e.g., σ(t) = ξ µ without loss of generality, reads as in such a way that, if at time t the system is prepared or occurs to be precisely in that state, it will be there trapped for all t ′ ≥ t.Of course, the fulfilment of (2.3) implies that J must be a suitable functional of ξ.More generally, one is interested in assessing the convergence to a retrieval configuration even when the input quality is relatively low, namely, even when the initial configuration is relatively far (in Hamming sense) from the target pattern.In this case, one asks for a stronger condition, that is which means that, although σ(t) displays some discrepancies with respect to ξ µ , the dynamics (2.1) is still ensured to converge to ξ µ .In this inequality (corresponding to the basic requirement in Gardner's theory [49][50][51]), κ controls the width of the attraction basins, that is, if σ belongs to a Hamming ball B(ξ µ , R(κ)) centered in ξ µ with radius R(κ), the network response will be f (σ) = ξ µ , with f being the transfer function f (σ) = lim n→∞ T n (σ) and T (σ) = sgn(φ i (σ)) the 1-step dynamics.Increasing κ, the stability criterion will be satisfied in a ball with larger and larger radius R(κ) surrounding the patterns, but this goes at cost of a smaller amount of storable information vectors, resulting in a lower critical storage capacity.In particular, for symmetric networks (J = J T ) the largest number of patterns that can be retrieved is N [49].In order to satisfy the inequality constraint (2.4), we can impose a (stronger) equality condition requiring that, given γ ≥ κ, with γ being the same for all the patterns.The latter point could appear a rather strong assumption, but -at least in the random theory, where patterns are all equivalent, as their entries are i.i.d.it is reasonable.The requirement (2.5) has important technical consequences.First, if the patterns are Boolean, it can be rewritten in a more transparent form as j̸ =i J ij ξ µ j = γξ µ i , which is nothing but the Personnaz et al's stability criterion [22,51].Further, we can remove the constraint on the absence of self-interactions and allow for j = i in the last sum 4 , thus recasting the previous expression as an eigenvalue problem as J • ξ µ = γξ µ ; this means that the coupling matrix is designed so that the patterns are eigenvectors with degenerate eigenvalue γ.
In the general case, we can add an external field by replacing the local internal field in (2.4) with the total field, i.e., φ i (σ) → j J ij σ j + h i .The previous arguments still hold and, for µ = 1, . . ., P , our problem takes the form Therefore, in this context, training a network implies finding an arrangement for J and h, such that (2.6) hold and this can be recast into the minimization of a Mean-Squared Error (MSE) of the form , in such a way that we can set up the minimization procedure for a loss function reading as5 (2.7) with ϵ J , ϵ h ∈ [0, +∞].The second term in the r.h.s. is obtained starting from the first one and reverting the roles of i and j; these two contributions account for the stability criterion and the symmetry constraint6 in a non-rigid way.The successive two contributions are L 2 -regularization terms for, respectively, J and h, that protect the norms of these parameters from divergence during training as they are confined by the quadratic potentials.
The explicit form of the solution of the constrained system (2.6) can be achieved via gradient descent method J = −∇ J L ξ and ḣ = −∇ h L ξ , which yields where are, respectively, the Hebbian kernel 7 and the mean value of the i-th entry over patterns.Notice that the matrices J and J T satisfy the same differential equation (2.8) since Ω is symmetric, thus, if we consider initial conditions such that J (0) = J (0) T , the symmetry is preserved for any t > 0 by uniqueness arguments.Hence, we can safely replace J to its symmetric part in (2.9).The convergence condition of the discrete form of these dynamical equations is discussed in App. A.
Before concluding this Section, it is worth highlighting that, as standard, the neural relaxation (2.1) and the training dynamics (2.8)-(2.9)operate on different time scales, the former being much faster.Biologically, this is motivated by the fact that synaptic plasticity is much slower that neural activation and, in artificial neural networks, by the fact that the machine is first trained and later used for task accomplishment.For consistency with (2.1) one should therefore write J = dJ dt τ −1 J and ḣ = dh dt τ −1 h , with τ J,h ≫ 1.However, under this adiabatic hypothesis, we can consider the network parameters J , h as planted during the neural dynamics and separate the two dynamical problems in such a way that, when focusing on the synaptic evolution, τ J,h can be set as unitary without ambiguity.

Dreaming as regularization, regularization as early-stopping
The global minimum for L ξ (J , h) in (2.7) can be obtained by requiring the stability conditions J = 0 and ḣ = 0 in Eqs.(2.8)-(2.9)that give whose solution reads as where By inspecting Eq. (3.3) one can see that the external field stems from the presence of biases in the input data, i.e. ξi ̸ = 0, in such a way that re-centering the patterns by ξ µ i → ξ µ i − ξµ i results in h = 0. Thus, as long as data are pre-processed in this way, external fields are not needed.Further, by looking at Eq. (3.4), one can see that our solution recovers the interaction matrix of the "Dreaming Hopfield model" (DHM) [27,39,47] upon setting γ = 1 and identifying t d as the inverse of the hyperparameter ϵ J 8 , that is, t d = ϵ −1 J .More precisely, in the DHM, the kernel reads as I+Ct d ξ T and it was obtained from the standard 7 With respect to the standard notation here the prefactor 1/N is replaced with 1/P . 8The minimization of the regularized-MSE as defined in (2.7) is a special setting of the usual setup of ridge regression theory [52,53], with the target response of the network being the multiplication of the input vectors ξ µ by the constant γ.Neglecting the bias vector, the ridge estimator minimizing the loss-function does coincide with the coupling matrix J (D) , where t d plays the role of Tichonov regularization parameter [54].Ridge regression, together with their generalization to non-linear regression problems with kernel techniques [55][56][57], is a central topic in statistical learning theory, focusing in particular on the role of the corresponding hyper parameter (see e.g.[58][59][60]) as well as the implicit regularization phenomena emerging in high-dimensional statistics [61,62].
Hebbian kernel by iteratively applying an unlearning protocol based on an interplay of remotion and consolidation mechanisms inspired by those occurring during sleep in mammals' brain [21,63]; because of this analogy the time t d , which measures the number of unlearning iterations, is referred to as "dreaming time".The critical storage of the DHM has been shown to increase monotonically with t d , reaching, in the t d → ∞ limit, the theoretical upper bound known for symmetric networks and corresponding to a number of retrievable patterns equal to the number of neurons, i.e., P = N .Also, the DHM has been proved to outperform the standard Hopfield model as for generalization abilities [64].The difference between J (D) and J (D) just lays in a pre-fractor P/N and in a shift t d → t d + 1 which, as long as they are finite and non-vanishing, yield only a quantitative correction. 9 Moreover, for t d → ∞ (or, equivalently, for ϵ J → 0), we recover Kohonen's projector matrix [65] J (P ) ≡ 1 P ξC −1 ξ T .Let us now move forward and notice that the solution J (D) obtained by a fully-trained (t → ∞) L 2 -regularized (ϵ J ̸ = 0) process can be related to the solution of an unregularized (ϵ J = 0) process which is run up to a finite time t * ; as we will see, this relation allows us to map the dreaming time t d into the training time and therefore interpret the dreaming mechanism as a training.In order to establish this relation, we resume the dynamical problem with recentered patterns, in such a way that the inferred field is vanishing 10 and Eq.(2.8) simply reads as This can be recast in the basis of the eigenvectors of Ω (denoting with a, b = 1, . . ., N the corresponding indices) as where J ab is the element of J in the current basis and σ(Ω) = {λ a } N a=1 is the Ω spectrum.By solving Eq. (3.7) we find that the non-diagonal terms asymptotically go to zero as J ab ∼ exp[−t(λ a +λ b +2ϵ J )] for any initial condition, in such a way that, at the equilibrium point, the coupling matrix is diagonal.Here, we choose to prepare the system in a configuration where no information is stored, i.e., a tabula rasa setting J (t = 0) = 0, in this way the off-diagonal entries remain stuck at zero at any time t.Remarkably, the diagonal structure of J (t), when expressed in the basis of the eigenvectors of Ω, implies that the two matrices share the same eigenvectors.As for the entries on the principal diagonal, 9 In particular, the two models hold statistical mechanical equivalence as the partition function of the present model ij σ i σ j can be turned into that of the original DHM by rescaling where β tunes the degree of stochasticity in the system, that is, it tunes the broadness of the distribution of the neural configurations or, in a physical jargon, it plays as the inverse temperature. 10Equivalently, we can choose to work without rescaling the patterns, thus also including the external fields.In this case, we can simplify the analysis of the dynamical problem by requiring that τ J J = −∇ J L and τ h ḣ = −∇ h L, and consider the case τ h ≪ τ J .Under this assumption, the variation of the external fields is much faster than the typical evolution of the coupling matrix, so -when dealing with the temporal behavior of the latter -the fields do relax instantaneously towards their fixed point at fixed J(t): As a consequence, the synaptic dynamics is described by the equation When dealing with structured patterns, we will preserve the inferred field in order not to alter the graphical appearance of the data; this is of course completely irrelevant when dealing with a zero-mean data set.
given the above-mentioned initial condition, the solution of the associated differential equation is As expected, in the limit of large training-time, we recover J (D) , that is while, expanding at small t, we get which corresponds to J ≈ 2tγΩ: this means that, despite the blank initial condition, at the very start of the training, the kernel J is close to an Hebbian structure.
On the other hand, by setting ϵ J = 0 in Eq. (3.8), we find that the diagonal terms evolve as Now, we compare the two explicit forms of the coupling matrix, i.e. the regularized one at t → ∞ (3.9) and the t-dependent one with ϵ J = 0 (3.11), and search for the characteristic time t * at which the latter is as close as possible to the former.To do this, let us consider the quantity measuring the squared difference between the components in the two realizations at fixed eigenvalue λ.Then, we take the average over the Ω spectrum, i.e. where is the empirical spectral distribution of Ω.Notice that δ is nothing but the squared Frobenius distance between the Dreaming kernel and the unregularized coupling matrix.This quantity is minimized for the following This relation provides an expression for the time t * at which the unregularized gradient descent over L ξ (J , h) should be interrupted if we want a coupling matrix close to J (D) corresponding to the fully-relaxed, regularized gradient-descent.The equivalence between the two scenarios is validated for synthetic, MNIST [66] and Fashion-MNIST [67] datasets as reported in Fig. 1.
The functional relation between t d = ϵ −1 J and t * highlighted in Eq. (3.12) is depicted in Fig. 2. The logarithmic behavior is justified analytically in App.C, where, by expanding (3.12) around the mean eigenvalue, we obtain a first-order approximation of the early-stopping time t * (ϵ J ) which depends on t d and on the trace of Ω.
Another way to see the equivalence between regularization and early-stop is the following.Starting from Eq. (3.8), we notice that ϵ J provides a natural scale as ) with ϵ J ̸ = 0 and the solution of the earlystopped training procedure with ϵ J = 0; for the latter the final training time is chosen according to (3.12).In the leftmost panel, the dataset ξ is made of P Rademacher vectors that naturally display zero mean, while in the central and in the rightmost panels the dataset ξ is made of P items randomly drawn from, respectively, the MNIST and the Fashion-MNIST datasets, and these vectors were pre-processed by Otsu method [68] to make them binary.The items in these datasets were used to build up the interaction matrices J (D) and J (t * ).For the random dataset N = 200 and γ = 1, whereas for the MNIST and Fashion-MNIST N = 784 and γ = 1, also, different values of the ratio P/N are considered as reported in the common legend.The performance of the system is measured in terms of the normalized Hamming distance d(ξ µ , σ (∞) ) between the target pattern ξ µ and the final configuration σ (∞) , obtained by initializing the system in a corrupted version of ξ µ (obtained by flipping the pattern entries with probability q = 0.1) and iterating (2.1) up to convergence.By averaging over all the P patterns we obtain d(ξ, σ (∞) ) = 1 P µ d(ξ µ , σ (∞) ), which is plotted versus the dreaming time.We refer to App.B for further details on numerics.
Therefore, in the regularization approach, the parameter ϵ J prevents the saturation of all the diagonal entries of J (D) to the value γ (corresponding to Kohonen's projector J (P ) [22,51,65,69]); in fact, as long as ϵ J > 0 (or t d finite), only the entries corresponding to the top eigenvalues of Ω get close to this limiting value, while the others remain close to the initial condition (i.e., J aa = 0 for any a = 1, ..., N ).On the other hand, the early-stop dynamically accounts for such a filtering, since the time t * at which we stop the training is chosen so that only a subset of the diagonal entries saturate to the fixed point J (D) aa = γ of the dynamical system (3.6), while all the others are not changed in a substantial way w.r.t. the initial condition; in fact, as highlighted by (3.11), the characteristic time for saturation is entry-dependent and given by (2λ a ) −1 , thus entries corresponding to large eigenvalues of Ω are faster.
What we presented so far does apply to a general set of Boolean vectors, as the coupling matrix J naturally arises as the fixed point of a gradient descent algorithm, the only hypothesis that we made on the vectors ξ, that we want to store as attractors for the neural dynamics, being that they are of the same length and that they share the same "importance" γ.In particular, the dataset items could represent (noisy) realizations of some unknown ground-truth patterns to which we have no direct access.In this context, regularization -preventing the network parameters from acquiring large norms during learning -also allows for a reduction of the model specialization on the training set.The relation t d = ϵ −1 J therefore suggests that overfitting issues may arise for too large t d , as we are going to discuss in more details in Sec. 4. Further, the dreaming time t d can be related, through and Eq.(3.12), to the stopping time t * in unregularized versions of the gradient descent algorithm.This relation is consistent with the previous remark since early-stop techniques are indeed designed for avoiding overfitting 11 and, again, recalling the monotonic relation between t d and t * , we expect that overfitting issues may arise for too large t d .In the next section we will make use of the framework outlined in this section and, specifically, of the optimal interaction matrix J (D) , in order to address the generalization capabilities of such models or, conversely, the emergence of overfitting.
4 Emergence of generalization and overfitting in Hopfield-like models

A synthetic dataset
The results derived in Secs. 2 and 3 were obtained without making specific assumptions on the binary vectors {ξ µ } P µ=1 , however, in order to go further in the analytical investigations, some additional hypothesis are in order.In fact, in theoretical studies one usually assumes that pattern entries are extracted according to a prescribed probability distribution that allows working out a controllable theory.For instance, when dealing with the Hopfield model, a common choice is to take pattern entries as i.i.d.Rademacher random variables, and thus treat the patterns as ground truths to be reconstructed starting from a corrupted version of them.However, in practical applications, one has no direct access to the ground-truth patterns, but only to empirical realizations constituting the dataset from which we want to extract information.In a supervised scenario, one knows a priori how the different items making up the dataset are partitioned between classes, so that it is possible to define class archetypes (for example, taking the average of examples belonging to the same category) which are taken as representatives of the ground vectors, see e.g., [31,33,71].In the unsupervised case, this clearly cannot be done, and the simplest way to proceed is to include all the examples, homogeneously, in the treatment.This is the path that shall be pursued in this section and hereafter we detail this unsupervised setting by considering a synthetic dataset.
Let {ζ µ } K µ=1 ∈ {−1, +1} N ×K be the ground patterns to which we have access only through empirical realizations referred to as ξ µ,A with A = 1, . . ., M for each µ.We assume that the examples in this training dataset are obtained from the ground patterns with a multiplicative noise, that is, ξ µ,A = χ µ,A ζ µ (with entry-wise multiplication), with where r ∈ [0, 1] is the parameter quantifying the quality of the example (i.e., it measures the correlation between the example and the corresponding ground-pattern).Our training dataset is therefore constituted by the set S = {ξ µ,A } A=1,...,M µ=1,...,K and we distinguish two loads: α := KM/N , i.e., the ratio between the number of examples in the training set and the network size, and η := K/N ≤ 1, i.e., the ratio between the number of classes in the dataset (which is in principle unknown) and the network size.With this kind of available information, we want to train the system in order to make it able to generalize, namely to reconstruct the hidden ground-truths ζ µ , starting from an input data σ (0) that is a corrupted representation of ζ µ .Since we have no direct access to the ground-truths, a direct error minimization procedure is not feasible in this case.However, we can include each single item in our loss function and take advantage of emergent phenomena in Hopfield-like models: as we will see, for a sufficiently large dataset, a plethora of spurious states appear and, depending on the control parameters of the system, these can favour the appearance of a generalization phase.In this scenario, regularization mechanism plays a key role, preventing the solution of the system to trivialize or overspecialize on the training set.In this unsupervised setting, the interaction matrix is therefore obtained by plugging in (3.5), that is still solution of the gradient descent Eq. (2.6), the empirical realization of the Hebbian kernel with entries Ω ij = 1 P µ,A ξ µ,A i ξ µ,A j , and the empirical correlation matrix, whose size is (KM ) × (KM ) and whose entries are To evaluate the performance of the network, we generate a test set S = { ξµ,A } A=1,...,M µ=1,...,K , sampled in the same way as the training set, and initialise the network with the configurations of the test set, say σ (0) = ξµ,A , the latter being, by construction, a noisy version of ζ µ with quality r.Next, we check whether the network response is f (σ (0) ) = ζ µ , an outcome that we interpret as a correct generalization; conversely, the retrieval of one of the training items, say f (σ (0) ) = ξ µ,A , is interpreted as overfitting.In the following subsection, we discuss the role of spurious states in the emergence of generalization and overfitting.

Spurious states of training data enable generalization
In the classical Hopfield setup, spurious states (i.e., configurations that are combinations of stored data) are known to impair the retrieval capabilities of the model and should be suitably treated in order to reduce their attractiveness, see e.g., [21, 25-27, 39, 47, 63, 72].In fact, the dreaming mechanisms mentioned in Sec. 3 are precisely aimed at this purpose and their implementation improves the retrieval capabilities of the network.On the other hand, when dealing with a dataset made of unlabelled examples, the situation is quite different, and spurious attractors are crucial for the emergence of generalization capabilities of the network, as we are going to discuss.Let us suppose that the training set is made of a large number M of data for each class, and let us consider a spurious configuration given by a symmetric combination of L examples pertaining to the same class, that is, Now, spurious configurations of the form (4.1) in Hopfield-like models can be stable attractors, so that running the dynamics (2.1) we could end in one of these minima and reach a relatively fair retrieval.Indeed, each of these configurations originates as a combination of attractors associated to stored vectors.In our scenario, increasing M , would result in an increasing number of intra-class spurious configurations which, as L is increased, do exhibit larger and larger correlation with the corresponding ground-truths as sketched in Fig. 3.For sufficiently large M , it is then reasonable to expect that minima of training examples and spurious configurations coalesce together, so that the resulting landscape consists in a wide minima centered in the ground-truth ζ µ , favoring the reconstruction of the hidden patterns.
Given such a landscape, which is the role of t d (or, equivalently, of ϵ J )?As recalled at the beginning of this subsection, in a Hopfield model where we store ground patterns, dreaming mechanisms reduce (and ultimately remove, if the load is not too high) the stability of spurious mixtures between independent patterns and this is obtained by shrinking and lifting the attraction basins associated to the patterns.Moving to an unsupervised setting, we realize that there are two kinds of mixture, according to whether they involve examples belonging to different classes or examples belonging to the same class; the former, just like in the Hopfield model, impair retrieval and should be removed, while the latter, as mentioned above, can be beneficial for generalization.Therefore, in this case, the dreaming mechanism should operate in removing only the first type of correlation.In fact, by increasing t d we are disentangling the minima corresponding to the stored patterns and this process affects progressively minima with larger and larger overlap.Inter-class correlation is typically smaller than intra-class correlation -their extents being related to, respectively, K/N and r -and relatively small values of t d can be sufficient to detach the attraction basins related to different ground-patterns.Yet, if we let the dreaming mechanism operate for too long a time, the mélange of intra-class minima can be separated as well and they get fragmented in many energetic minima, each associated to a single example.As a consequence, we cannot retrieve spontaneously-formed archetypes, but only the single examples: the system is specialized over the training set, thus ending in an overfitting regime.This picture is corroborated by the numerical results reported in Fig. 4. Here, we focus on the behavior of the system when prepared in the neighborhood of a training example or of an intra-class spurious configuration.First, we notice that, for low enough t d , even when the system is initialised in a configuration consisting in a mild perturbation of one of the stored examples (L = 1), the neural dynamics will drive it towards a final configuration with the relative distance w.r.t. the reference example being (1 − r)/2 (precisely the distance between the stored examples and the corresponding ground-truth).For spurious states, the situation is similar, with the relative distance between the final state and the reference mixture getting lower and lower as L is increased (as the correlation c L (r) with the ground-truth increases monotonically with L).A moderately larger t d yields a larger attractivity of the ground patterns.This picture is consistent with our claim about the coalescence of the population of attraction basins in a wide minima centered at ζ µ .By further increasing t d , the training points get attractive, with a basin width depending on the number of examples per class.This signals that, according to the setting, the dreaming mechanism can either enhance generalisation or favour overfitting.

Numerical experiments
In this Section, we provide numerical evidence to our theoretical findings.We first inspect the regions in the space of parameters (α, t d , r) where the system equipped with the interaction matrix J (D) can successfully generalize, that is, when tested with examples not included in the training set (but sharing with them the same underlying statistics), it is able to fairly reconstruct the ground pattern.Next, we corroborate this picture by applying a clustering algorithm to the network outputs and showing that, in the region of the parameter space where the system is expected to generalize (resp.overfit), the number of classes is nicely estimated (overestimated).Details on numerics are collected in App.B.

Generalization diagrams
In the first part of the numerical experiments we consider both structureless and structured datasets, to confirm and check the robustness of our theoretical results.The structureless datasets are built synthetically as follows: we initially generate a set of K Rademacher ground patterns G = {ζ µ } µ=1,...,K , whence we obtain a set of training examples S = {ξ µ,A } A=1,...,M µ=1,...,K (characterized by a quality r as specified in Sec.4.1), which are used to build J (D) , according to Eq. (3.5).Next, we generate a test set S = { ξµ,A } A=1,...,M µ=1,...,K , applying the same procedure used for the training set, that is, each item ξµ,A exhibits a quality r w.r.t. the related ground pattern ζ µ .For structured datasets, we consider the MNIST [66] and the Fashion-MNIST [67] benchmarks and define the ground patterns as the class  The analysis is performed by taking a reference configurations ξ L (with for L = 1, 3, 5, 20, as explained by the legend) and applying a perturbation that consists in randomly flipping a fraction q of the entries; preparing the system in this configuration σ (0) , we update the network up to convergence towards the fixed point σ (∞) .Then, we compare the average distances d(ξ L , σ (0) ) and d(ξ L , σ (∞) ) between the reference configurations and, respectively, the initial and the final configurations.The dashed black lines correspond to the distance between the training examples used to build J (D) and the associated ground-truths.The results are averaged over 50 different realizations of the dataset.
averages, then, the training and the test sets are made of M items, drawn from the whole datasets (overall made of, respectively, 60000 and 10000 instances), in such a way that the two sets have null intersection.
Whatever the dataset, we initialize the system in a configuration σ (0) belonging to the test set, we run the dynamics (2.1) and collect the final configuration σ (∞) = f (σ (0) ).In other words, σ (0) and σ (∞) represent, respectively, the input and the output of the system.Next, we evaluate the following (5.1) where is the normalized Hamming distance, measuring the fraction of misaligned entries among the two configurations that are compared.We stress that, despite not highlighted in Eqs.(5.1)-(5.2),these quantities depend on the initial configuration as σ (∞) does depend on σ (0) .These distances are then averaged over the sample S, to get, respectively, dξ and dζ .
In Fig. 5, we compare the behaviour of dξ and dζ versus t d , for different choices of M , while K is fixed.We find that, in any case, dξ and dζ are monotonically decreasing with t d as long as t d is relatively small, next, their behavior depends on M .In particular, for the random dataset, when M is small, we always have dξ < dζ that evidences poor generalization capabilities; when M is larger we can leverage t d to enhance the generalization capabilities and get dζ < dξ , yet when large t d is too large the curves cross; finally, when M is large, at intermediate dreaming time, dξ and dζ exhibit a plateau and at large values of t d they grow, the height of the plateau (respectively, ≈ (1 − r)/2 and ≈ 0) suggests that there the final configuration is close to the ground pattern, while the final growth suggests a possible harmful effect of a large dreaming time.As for structured datasets, dζ exhibits a minimum at intermediate values of t d , corresponding to an optimal generalization performance and, again, for large dreaming times, dζ grows.In particular, for the MNIST dataset, for relatively small (resp.large) values of t d , we get dζ < dξ (resp.dζ > dξ ), suggesting good (impoverished) generalization capabilities.Before proceeding, we also emphasize that, for all the datasets considered, when t d ≫ 1 and when M is relatively large, both dζ and dξ grow.This is due to the fact that the number of examples is relatively large to give rise to spurious attractors, but not large enough to make these attractors close to the ground truths; this point is further examined in the following.
We now focus on the synthetic dataset and summarize the information processing capabilities of the system into "phase diagrams".To this aim, we distinguish between different outcomes as follows: • Success: this corresponds to dζ < dξ and dζ < 1−r 2 .The first requirement ensures that the system relaxes in a configuration which is more correlated with the ground pattern than with the training points; the second condition, instead, guarantees that the dynamics ends up within the Hamming ball centered in the ground-truth with radius (1 − r)/2 and therefore that the system has moved closer to the ground-truth.
• Overfitting: this corresponds to dζ ≥ dξ and dξ < 1−r 2 .The first condition states that the final configuration is closer to one of the training points than to the ground; the second condition, guarantees that, this time, the dynamics ends up within the Hamming ball centered in the nearest training points with radius (1 − r)/2 and therefore that the system has moved closer to a specific training item.
• Failure: otherwise.In this case the system is neither sufficiently close to a ground-pattern nor to an example.
For a given choice of K and r, moving within the (α, t d ) plane, we thus depict the generalization diagrams for synthetic datasets.The results are reported in Fig. 6 and discussed hereafter.
Starting from panel a (K = 10 and r = 0.7), we find that, at very low t d , the system is always in a failure regime.This is not surprising since, there, the interaction matrix is very close to the Hebbian prescription and, for this choice of the dataset parameters, interferences among examples are significant enough for the system to be likely to end up in inter-class spurious states, thus we expect that equilibrium configurations are non-retrieval states.Increasing t d , such interferences are (possibly) removed.If the load parameter α is low (meaning that the number of examples is low), the system enters in an overfitting regime because the minima corresponding to examples are sparse enough to be easily disentangled.On the other hand, by increasing α (namely by increasing the number of examples per class), the attractors coalesce and configurations corresponding to the ground patterns become more and more attractive, in such a way that the system starts to well-generalize.When t d ≫ 1, as α is increased, the transition from overfitting to success is no longer direct as, for intermediate values of α, we can end up into spurious states that are still too sparse to ensure a sound generalization; this region can be shrunk by enhancing the dataset quality.In fact, by increasing the dataset quality as in panel b (K = 10 and r = 0.8), the qualitative picture is the same, with just an expansion of the success region due to the fact that intra-class examples are now closer to each other.Next, we move to panels c and d, where a larger η (namely, a larger number of classes K = 30, with fixed N ) benefits the failure region, since in this case the clusters of minima associated to each class are closer and clusters of minima corresponding to different classes now present non-trivial overlaps, in such a way that the relaxation of the system could end far away from the class for which the initial condition was generated.

An analogy with a clustering algorithm
In this section, we use another approach to check the emergence of overfitting and generalization regimes as the system parameters are tuned.The idea is to use an unsupervised clustering algorithm to partition the final configurations σ (∞) obtained by applying the dynamics (2.1) to the test configurations; here unsupervised refers to the fact that the clustering algorithm is unaware of the number of effective clusters.
We start the experiment by generating a random synthetic datasets G made of K Rademacher ground patterns, whence we build a training set S and a test set S, both characterized by a quality r and a size M .We use the former to construct J (D) and the latter to initialize the neural configuration.We collect the final configurations σ (∞) obtained by iterating the neural dynamics and we expect that, if the network correctly generalizes, the clustering algorithm applied to the final configurations will return an estimated number of clusters K which is (approximately) K and each cluster contains a number of items which is (approximately) M .Conversely, if the network overfits, we expect that K > K.
In order to quantify the likelihood of these outcomes we introduce the accuracy M M ×K ∈ [0, 1], where M is the total number of examples correctly clustered by the algorithm.The unsupervised clustering algorithm considered here is based on the Disjoint Set Union data structure [73], which works as follows.Initially, it associates to each item σ (∞) a different class, in such a way that, at this stage, the number of estimated classes is M × K. Next, we consider all the M 2 couples of configurations σ (∞) and check whether their normalized Hamming distance is smaller than a threshold d * (r) and, if so, the two items are merged in the same class.Once all the couples have been examined the algorithm stops.The threshold value is chosen equal to the minimum of the normalised Hamming distance between all pairs of examples belonging to the test set: The idea underlying this choice is that, after applying the dynamics to the test examples, if the network performs well, the examples belonging to the same class get closer to the common ground pattern, their distance is reduced and expected to be smaller than d * (r); on the other hand, for two examples belonging to different classes the distance is expected not to vary significantly and remain larger than d * (r).
The results of this experiment, repeated for different loads and different dreaming times, are reported in Fig. 7.The panels in the first row show the accuracy (left axis) and the difference between K and the true number of classed K (right axis) as a function of t d 1+t d , while the panels in the second row show the average distances dζ and dξ between σ (∞) and, respectively, the ground truth and the nearest example, as a function of t d 1+t d .The colours in the background correspond to the different regimes of the network and we used the same colormap previously adopted in Fig. 6, to highlight the consistency.In fact, in the failure region the accuracy is low and the number of clusters is underestimated; in the success region the accuracy is unitary and K = K; in the overfitting region the accuracy is suboptimal and the number of cluster is overestimated.Further, these outcomes are nicely mirrored by the behavior of d ζ and d ξ : the transition from failure to success corresponds to an abrupt decrease of d ζ that leaves d ξ behind; the transition between success to overfit corresponds to d ξ outpacing d ζ .

Conclusions
The main results obtained in this work are listed hereafter: • We introduced a regularized loss function, whose minimum provides the interaction matrix J of an associative neural network; according to the dataset provided, the neural network equipped with the solution J is able to retrieve a set of stored ground patterns or to generalize starting from a corrupted version of unknown ground patterns.
• We proved that the solution J of the loss function corresponds to a Hebbian-like kernel J (D) , known as dreaming Hebbian kernel and parametrized by the "dreaming time" t d , as long as t d is identified with the inverse of the regularization parameter ϵ J .• In the absence of regularization (ϵ J = 0) the solution J (D) can be recovered by applying an early-stop strategy to the gradient descent over the loss function.This suggests that t d (or, equivalently, ϵ J = 0) plays a role in preventing overspecialization on the training set.
• Focusing on the case of a training set made of corrupted versions of some unknown ground patterns, we found robust numerical evidence that relatively large values of t d and relatively sparse training sets can yield overfitting.
• The emergence of overfitting is related to the structure of the Lyapunov function associated to the neural dynamics and this picture allowed us to speculate on optimal settings for the loss hyperparameters and/or for the training time.This picture is sketched in Fig. 8.
To conclude, our results highlight the relevant mechanism allowing for the emergence of generalization capabilities of Hopfield-like networks: this is identified as the coalescence of attractors associated to training points giving rise to wide minima around the underlying ground truths (which is, indeed, a crucial ingredient for general models exhibiting robust generalization properties, see for example [74,75] and references therein).In this scenario we give a comprehensive characterization of generalization and overfitting for synthetic random datasets.Developments of the present work would require the extensions to structured data, as well as a statistical mechanics characterization of relevant collective phenomena.

Intra-class spurious states
Training points (attractors)

Ground-truths Resulting landscape
Figure 8: Emergence of fixed points for a system trained without supervision.This schematic picture shows the evolution of the landscape generated by J L 2 -regularization: t = ∞ and ϵ J > 0 Early-stop regularization: t = t * as given by (3.12) and ϵ J = 0 Output: J (t), h(t)  simulations with structured datasets such as MNIST and Fashion-MNIST, since we want to reconstruct the original dataset and not its centered version, we preserve the field h in the dynamics and, for the sake of convenience and simplicity we set ϵ h = 0 in all the simulations.The pseudo-code presented in Algorithm 1 highlights the steps followed to train the model.If we run Algorithm 1 up to convergence, the resulting interaction matrix recovers the dreaming kernel, denoted as J (D) and reported in Eq. (3.5), as long as we set ϵ J = t −1 d and γ = 1.Thus, if we are interested in the regularized, full-trained model, we can directly pose J = J (D) without the need of running the training procedure.If we set ϵ J = ϵ h = 0 and we stop the gradient descent at the training t = t * given in (3.12), then the resulting coupling matrix J (t * ) is the closest possible to the interaction matrix J (D) .

B.2 Formulation of the performance metrics
Once the training is over and we have the desired expression of the coupling matrix J , the retrieval capabilities of the machine are investigated.The initial configuration σ (0) is taken as a corrupted version of one of the training patterns (with a fraction of flipped entries w.r.t. the reference configu- 1: Remove the diagonal terms from J 2: Set n = 0 3: repeat update the i-th spin σ i according to σ ration) or as an item of a test set (whose elements are statistically analogous to the training patterns, but were not involved in the training procedure); in general, we denote with Q the sample of initial configurations.Then, the system relaxes according to the evolution rule introduced in Eq. (2.1) and reported hereafter in a discrete-time notation The pseudo-code for the dynamics can be found in Algorithm 2. After the relaxation towards an equilibrium configuration σ (∞) , we check its proximity to a specific configurations ξ * by exploiting the normalized Hamming distance as a performance metrics.The latter is the number of misaligned entries between two configurations σ 1 , σ 2 ∈ {−1, 1} N divided by N , that is Then, we evaluate d(σ (∞) , ξ * ), which depends on the initial configuration, due to the fact that σ (∞) depends on σ (0) .Next, we average d(σ (∞) , ξ * ) over the |Q| different realizations of the initial condition; when σ (0) is meant as an item of the test set, this operation corresponds to a batch average.This way, we get the average distance, defined as that is expected to depend on the reference point ξ * and on the system parameters.Finally, we notice that d(σ (∞) , ξ * ) is nothing but the (normalized) absolute error made by the machine outputting σ (∞) , when asked to reconstruct ξ * .

B.3 Numerical signatures of overfitting
In this section, we will give more numerical indications of overfitting emergence for the model we are dealing with.Specifically, we check that overfitting can take place in the non-regularized training procedure.To do this, we follow the evolution of the following learning and validation loss functions over training time, in particular the loss functions have the following structure L ξ (J (t), h(t)) = 1 2P  J ij (t)ξ µ j ) (B.4) In the following numerical simulations, J ij (t) evolves according to Algorithm 1 with parameters ϵ J .Equation (B.3) is the loss given in Eq. (2.7) of the main text with ϵ J = ϵ h = 0, γ = 1.The learning and validation loss functions are obtained by substituting into the previous equations {ξ} P µ=1 with the features of the training and test dataset respectively.For the synthetic random dataset and MNIST and Fashion-MNIST cases we found that, even if the training loss goes to zero, the validation ones do exhibit a global minimum at finite training time, and then is start to increase, thus signalizing a worsening in generalization performances.These results are reported in Fig. 9.

C A faster way to compute the early-stopping time
The estimate of the early-stopping time reported in Eq. (3.12) is based on the computation of the empirical spectral distribution ρ E of Ω, which is a N × N matrix.For high-dimension datasets, this can constitute a bottleneck in the training procedure, so operative criteria needs to be provided.To do this, we can Taylor expand the quantity δ(λ, t, ϵ J ) around the average eigenvalue of the empirical distribution, i.e.Stopping at the first order in λ − λ and then solving the minimization problem, we are left with the prescription The results are reported in Fig. 10, which shows again a substantial agreement between the earlystopping procedure and the Dreaming kernel scenario.Notice that this criterion precisely accounts for the correct logarithmic dependence of the early-stopping time w.r.t. the dreaming time t d .Clearly, this prescription can provide, in the general case, a rough estimate of the early-stopping time.In that case, one can decide also to work out sub-leading orders in λ − λ: regardless of the order at which the computation is performed, the numerical estimate of t * is based on the computation of low-order moments of the (modified) Hebbian kernel Ω, which is far easier than to compute the whole empirical spectral distribution.

D Details on spurious states
In this Appendix, we report some details about spurious configurations of training examples in the synthetic random dataset.We recall that ξ µ,A = χ µ,A ζ µ (with entry-wise multiplication), with the χ µ,A i variables extracted as Relevant spurious configurations in this setup are of Hopfield-type, so that we can consider combination of the form The correlation of this new configuration with the ground-pattern ζ µ is

Figure 1 :
Figure1: Retrieval performance of dreaming kernel versus early-stopping.The three panels show a comparison between the fully-trained solution (3.4) with ϵ J ̸ = 0 and the solution of the earlystopped training procedure with ϵ J = 0; for the latter the final training time is chosen according to (3.12).In the leftmost panel, the dataset ξ is made of P Rademacher vectors that naturally display zero mean, while in the central and in the rightmost panels the dataset ξ is made of P items randomly drawn from, respectively, the MNIST and the Fashion-MNIST datasets, and these vectors were pre-processed by Otsu method[68] to make them binary.The items in these datasets were used to build up the interaction matrices J (D) and J (t * ).For the random dataset N = 200 and γ = 1, whereas for the MNIST and Fashion-MNIST N = 784 and γ = 1, also, different values of the ratio P/N are considered as reported in the common legend.The performance of the system is measured in terms of the normalized Hamming distance d(ξ µ , σ(∞) ) between the target pattern ξ µ and the final configuration σ (∞) , obtained by initializing the system in a corrupted version of ξ µ (obtained by flipping the pattern entries with probability q = 0.1) and iterating (2.1) up to convergence.By averaging over all the P patterns we obtain d(ξ, σ (∞) ) =1

Figure 2 :
Figure 2: Stopping time as a function of dreaming time.The plot shows the early-stopping time t * as a function of the dreaming time t d , obtained by a numerical estimate (solid line) from Eq. (3.12) and by a fit (dashed line) based on the functional relation t * (t d ) = a log(1 + b t d ), suggested by the analytical findings presented in App. C. The network parameters for the three cases are γ = 1, N = 784, and P/N = 0.2.The couple of coefficients (a, b) estimated via linear least squares are (a = 0.66, b = 0.54), (a = 0.11, b = 3.25), (a = 0.19, b = 1.67) for the random, MNIST and Fashion-MNIST datasets, respectively.

5 ExamplesFigure 3 :
Figure 3: Schematic representation of training points, spurious combinations and groundpattern.The figure sketches the organization of attracting configurations within each class in the dataset.The class is represented by a ground-pattern ζ (the red dot in the center), while the training points are located at distance (1 − r)/2 from it (i.e. they have correlation r).Spurious combinations of training points are themselves attracting points, and their correlation c L (r) increases with the number L of training points involved in the combination.For large enough M , the resulting landscape consists in many local minima very close to each other, so that they coalesce and form flat valleys around the ground pattern.

. 1 )
being A 1 , . . ., A L ∈ {1, . . ., M } the indices of the examples that we are mixing.Denoting with c L (r) := ξ µ L • ζ µ /N the correlation between ξ µL and the related ground truth, we notice that, as long as L is relatively large, ξ µ L displays a correlation with the ground pattern ζ µ that is larger than the correlation r displayed by any training item, that is r < c L (r) → L≫1 1, see App.D for more details.

Figure 4 :
Figure 4: Relaxation to fixed points from perturbed training examples and spurious states.The plots show the retrieval capabilities of the model initialized in a configuration consisting in a perturbed version of the training examples (i.e., L = 1) or in an intra-class spurious configuration ξ L (as given by Eq. (4.1)).The network size is fixed to N = 500, the number of classes is K = 10, and the quality of the dataset is r = 0.8, while different values of dreaming time (from left to right t d = 0.1, 2, 10) and of load (from top to bottom α = 0.4, 0.6, 0.8, that is, M = 20, 30, 40) are considered.The analysis is performed by taking a reference configurations ξ L (with for L = 1, 3, 5, 20, as explained by the legend) and applying a perturbation that consists in randomly flipping a fraction q of the entries; preparing the system in this configuration σ (0) , we update the network up to convergence towards the fixed point σ(∞) .Then, we compare the average distances d(ξ L , σ (0) ) and d(ξ L , σ (∞) ) between the reference configurations and, respectively, the initial and the final configurations.The dashed black lines correspond to the distance between the training examples used to build J (D) and the associated ground-truths.The results are averaged over 50 different realizations of the dataset.

Figure 5 :
Figure 5: Retrieval on synthetic and structured datasets.The retrieval performance is measured in terms of the normalized Hamming distance d between the final configuration σ (∞) and the nearest training example ξ (dotted curve, see Eq. (5.1)) and the nearest ground-truth ζ (solid curve, see Eq. (5.2)); the results presented have been averaged over the K × M different initial configurations which constitute the test set (see App. B for further details).The network parameters for the random dataset are N = 200, K = 10 and r = 0.8, whereas for the structured datasets they are N = 784, K = 10.For all the datasets, we reported results for different choices of α = 0.1, 0.2 and 0.8, retaining η = K/N fixed, therefore, recalling that α = KM/N , we varied α by increasing M .

Figure 6 :
Figure 6: Generalization diagrams.The four diagrams show the generalization outcomes of the neural network where the interaction matrix J (D) is built on a sample of random, synthetic examples {ξ µ,A } A=1,...,M µ=1,...,K with N = 200 and M tunable (M = αN/K).In the (α, t d ) plane, for various values of K and r, we outline three regions: success (S), overfitting (O), failure (F).In any case the initial conditions σ (0) are taken as perturbed versions of the ground-truths sampled with the same quality r as the training examples.

Figure 7 :
Figure 7: Clustering of the test set.Application of the clustering algorithm in case of a random synthetic dataset with parameters K = 10, r = 0.8, N = 200 for two different loads values: α = 0.15 and α = 0.3 keeping fixed the number of ground-truths K.In the simulation with α = 0.15 the overfitting region occurs when dξ is overcome by dζ , in this region the clustering algorithm is no longer able to correctly cluster the examples and the number of clusters estimated starts to grow.In the simulation with α = 0.3, the number of examples per ground truth in the training set is double that of the previous simulation, dζ is always lower than that of the nearest training example dξ and the overfitting region is no longer present.

TestingFigure 9 :
Figure 9: Training and test losses as a function of the training time.The figure shows the comparison between the training loss function and the validation ones for synthetic dataset (blue curve), MNIST (orange curve) and Fashion-MNIST (green curve) for a training procedure without regularization term.

Figure 10 :
Figure 10: Comparison between early-stopping and dreaming kernel with approximated time.The three plots show the comparison between the retrieval performances of the Dreaming kernel and the early-stopped training procedure.The content is perfectly specular to Fig. 1, with the only exception that the early-stopping time here is computed with the first-order approximation (C.1).

Table 1 :
(D)as t d is varied.When t d is relatively low (left), the mélange of minima has been partially disentangled: the weak interclass interference is removed, while the minima corresponding to examples of the same class are still clusterized; when t d is relatively large (right) the interference among training examples has been completely shifted.This picture is compatible with our numerical results reported in Fig.4.Training parameters.Training of the coupling matrix J Input: Training set ξ = {ξ µ } P µ=1 ∈ {−1, +1} N ×P , stopping time t, regularizator ϵ J Settings for t and ϵ J : time as given by (A.4) 5: iters = t