Thermodynamic efficiency of learning a rule in neural networks

Biological systems have to build models from their sensory data that allow them to efficiently process previously unseen inputs. Here, we study a neural network learning a linearly separable rule using examples provided by a teacher. We analyse the ability of the network to apply the rule to new inputs, that is to generalise from past experience. Using stochastic thermodynamics, we show that the thermodynamic costs of the learning process provide an upper bound on the amount of information that the network is able to learn from its teacher for both batch and online learning. This allows us to introduce a thermodynamic efficiency of learning. We analytically compute the dynamics and the efficiency of a noisy neural network performing online learning in the thermodynamic limit. In particular, we analyse three popular learning algorithms, namely Hebbian, Perceptron and AdaTron learning. Our work extends the methods of stochastic thermodynamics to a new type of learning problem and might form a suitable basis for investigating the thermodynamics of decision-making.


I. INTRODUCTION
Biological information processing occurs in three steps. First, an organism needs to acquire information about the external state of affairs by sensing its environment, for example by monitoring the concentration of nutrients in the surrounding solution. Second, a model or a representation of the data is built to allow for efficient processing. Such a model would then be the basis for the third and final step: processing previously unseen inputs by applying the model and making decisions based on the model's output, i.e. to "generalise" from past experience.
The first step, sensing, has a history of interest from physicists going back at least to the seminal work of Berg and Purcell [1] searching for the fundamental limits on sensing imposed by physics. More recently, the application of stochastic thermodynamics [2,3], an integrated framework to analyse the interplay of dissipation and information processing in fluctuating systems far from equilibrium, has provided some intriguing results with regards to the physical limits on the speed, precision and dissipation of sensing in living systems [4][5][6][7][8][9][10][11][12][13][14][15][16][17].
Neural networks, well known from statistical physics and machine learning [18][19][20], form a mature framework to investigate learning and generalising. We have recently introduced the methods of stochastic thermodynamics to study the thermodynamic efficiency of the second step, building an efficient representation of uncorrelated data [21], and a recent study has looked at the non-equilibrium thermodynamics of unsupervised learning with restricted Boltzmann machines [22].
In this paper, we study the learning of a rule by a neural network. The rules we want to learn are Boolean functions: they take an input, call it ξ, and map it to a binary output, the true label ξ → σ T = ±1. The network has to build a model of this rule by looking at a * goldt@theo2.physik.uni-stuttgart.de number of pairs (ξ, σ T ). Our focus is on the final step of information processing: how well can the network emulate the function after a training period, i.e. how well do the outputs of the network, σ, match the correct output of the function σ T ? We will show that the ability of the network to generalise such a rule from the examples it has seen to previously unseen inputs is bound by the dissipation of free energy by the components of the network as a consequence of the second law of stochastic thermodynamics.
Our results apply to a wide variety of learning algorithms. For illustration purposes, we analyse three learning algorithms in particular: Hebbian learning [23,24], which was inspired by the neurobiology of memory formation; the celebrated Perceptron algorithm [25], whose discovery led to a surge in interest in neural networks in the 1960s and which is still very influential; and finally AdaTron learning [26], a refinement of the Perceptron algorithm with surprising dynamical features. This paper is organised as follows. We give a detailed description of our model and its dynamics in Sec. II and III. We derive a general bound in Sec. IV and discuss a number of simple examples with different learning algorithms in Sec. V. We then derive a second, sharper bound in Sec. VI and analyse the efficiency of learning in large networks in Sec. VII. We give some concluding perspectives in Sec. VIII. Detailed proofs and a number of technical points are discussed in the appendices. In this small network (N = 3), the neuron of interest (red) has three input neurons (blue), two of which are just sending a short signal, an action potential, while the second input neuron is silent. This behaviour is captured by input vector ξ = {1, −1, 1} for the network shown. Each of the connections between the neuron of interest and an input neuron has a weight wn. The neuron, which is fully characterised by its weights w = (w1, . . . , wN ) ∈ R N , will also send an action potential, depending on its activation A ∼ w·ξ. The response of the neuron is denoted by σ. In this example, σ = 1.
that it is connected to. Since we are not interested in the precise temporal dynamics of the action potentials, we model the input of the neuron, i.e. the signals that the neuron receives at a particular point in time, as vectors ξ = (ξ 1 , . . . , ξ N ) where ξ n = 1 if the nth connected neuron is firing an action potential in that input. For symmetry reasons, we set ξ n = −1 if the nth neuron is silent. The inputs are distributed according to The neuron itself is fully characterised by the N weights w ∈ R N of its N afferent connections. The weights obey noisy dynamics, to be specified in Sec. III. Presented with a given input ξ, the neuron computes an input-dependent activation where the prefactor ensures normalisation. The activation determines whether the neuron will fire an action potential or not, σ = 1 or −1, respectively. If the prediction was noise-free, we would have σ = sgn(A), where sgn(x > 0) = 1 and sgn(x ≤ 0) = −1; instead, the predicted label σ is stochastic with where β is the inverse temperature of the surrounding heat bath. We set k B = β = 1 for the remainder of this article, rendering entropy and energy dimensionless without loss of generality.
A single neuron learning a rule. The neuron, characterised by its weights w ∈ R N , is presented with a succession of inputs ξ µ ∈ {−1, 1} N and their true labels σ µ T = sgn(T · ξ µ ) = ±1 which are determined by a random, static teacher T ∈ R N . The goal of learning is to infer the teacher T by using only the information provided by the samples (ξ µ , σ µ T ), such that the neuron is eventually able to predict the true label of a previously unseen input.
The rules we want to learn are Boolean functions of the inputs. More precisely, we will focus on realisable rules which are linearly separable, i.e. we can write where the teacher network T ∈ R N has the same architecture as the neural network w. The components of the teacher are independent and drawn from a normal distribution with mean 0 and variance 1 and kept fixed. We draw the teacher at random in order to make general statements about the ability of the network to infer teachers of this form. By analogy, the neuron in such a setup is often called the student. We can interpret the true label of an input as an indication of whether the student should fire an action potential in response to that input or not. We emphasise that while the response of a neuron to an input is stochastic, as is the case physiologically, we assume that the teacher does not make mistakes.
The goal of learning is to adjust the weights of the network w such that the label predicted by the neuron equals the true label for any input ξ, σ = σ T . The adaptation of weights is thought to be a main mechanism of memory formation in biological networks [27]. To this end, the neuron needs to infer the teacher T . However, the neuron only has indirect access to the teacher via a number of samples (ξ µ , σ µ T ), where we have now indexed the inputs and their labels with µ = 1, . . . , see Fig. 2. The exact form of the dynamics will be specified below in Section III. A classic example for neurons performing this kind of associative learning are the Purkinje cells in the cerebellum [28][29][30][31].

III. DYNAMICS
Let us now describe the dynamics of the weights learning a rule from a fixed teacher T . Initially, all the weights are independent of each other and in equilibrium in the potential which restricts the weights from increasing indefinitely, as is also the case physiologically [32]. Starting at time t = 0, the weights w n obey overdamped Langevin equations [33] w n (t) = F n w(t), σ The thermal white noise ζ n (t) has correlations ζ n (t)ζ m (t ) = 2Dδ nm δ(t−t ), where D is the "diffusion" constant. We set the mobility of the weights to unity and impose the fluctuation-dissipation relation βD = 1 for thermodynamic consistency [2]. For the remainder of this article, we will use angled brackets · to indicate averages over the thermal noise, unless indicated otherwise. The total force F = (F 1 , . . . , F N ) on the weights has a conservative contribution from the harmonic potential, −∇V (w) = −kw, and a non-conservative contribution from the learning force f , which is a function of a single sample (σ µ(t) T , ξ µ(t) ). The learning force changes the weights in such a way that the neuron becomes more likely to predict the true label for the input as discussed above. In this paper, we focus on online learning, where the learning force changes the weights using just a single sample at a time. The succession of samples is described by the function µ(t) = 1, 2, . . . . This function may be deterministic or stochastic and we do not make any assumptions about the rate of change of the inputs nor whether the same input may be shown more than once to the neuron. We note that our results also hold for batch learning, where the neuron has simultaneous access to a set of samples at any point in time as discussed in detail in Appendix D.
We will assume that the change to the weights in response to a sample is made in the direction of that input, as is the case for most customary algorithms (see Sec. V and [19,20,34]). We thus write f = (f 1 , . . . , f N ) with where we have introduced a possibly time-dependent learning rate ν(t) [35] and we denote the Euclidean norm of a vector by | · |. Here, F is an as yet unspecified scalar function of the length of the weight vector, |w(t)|, the student's field w(t) · ξ µ(t) and the teacher's field T · ξ µ(t) .
The learning force may only depend on the sign of the teacher's field. Its precise form is specified by learning algorithms; some popular forms are summarised in Tab. I and described in more detail in Sec. V. However, we stress that the bounds that we derive in this paper do not depend on the particular form of the learning force and hold for all learning dynamics of the form (6). The full Langevin equation for a weight then readṡ On the ensemble level, the system is fully described by the distribution p(T , w, t). Its equation of motion is given by a Fokker-Planck equation [33] whose form is simplified by the fact that the noise ζ n (t) of the different weights is uncorrelated. The dynamics are hence multipartite [36,37] and the Fokker-Planck equation corresponding to the Langevin dynamics (8) separates into one probability current for every weight w n , where ∂ t ≡ ∂/∂t, ∂ n ≡ ∂/∂w n and the probability currents are given by There are hence three sources of stochasticity in the system. On the one hand, the fluctuating weights w(t) and the stochastic process of firing an action potential or not, σ, for a given activation (2) affect the performance of the network. Furthermore, there is randomness in the choice of samples during learning. Since the neuron learns using just a single randomly drawn input and its label at a time, the system performs stochastic gradient descent in the sense that the direction of the learning force fluctuates from one input to the next and only yields the appropriate direction for the weights on average.

IV. A FIRST THERMODYNAMIC BOUND ON GENERALISING
The aim of the neuron is to predict the label of a previously unseen input ξ as well as possible. In the following discussion, we consider the generalisation properties of the neuron, i.e. its performance on an input drawn at random from the distribution (1), so we drop the superscript µ on inputs and labels. We quantify the accuracy of the predictions using information theory [20,38]. The a priori uncertainty about the true label of some input, σ T , is quantified by the Shannon entropy of the random variable σ T and defined as where the equality follows from the fact that for an arbitrary input with probability distribution (1), p(σ T = ±1) = 1/2. We have included the variable σ T as an argument to the Shannon entropy in a slight abuse of notation to emphasise that S(σ T ) is the entropy of the marginalised distribution of just σ T . The definition (11) readily carries over to continuous random variables, where the sum is replaced by an integral over the support of the random variable. The uncertainty about the true label given the label predicted by the neuron σ is given by the conditional Shannon entropy where p(σ T |σ) = p(σ T , σ)/p(σ) and the inequality indicates that, on average, knowing the label predicted by the neuron reduces uncertainty about its true label. The natural quantity to measure the information learnt by the neuron is the mutual information which measures by how much, on average, the uncertainty about σ T is reduced by knowing σ. If learning and predicting went perfectly, then by knowing the neuron's output σ one could predict the true label σ T with perfect accuracy, such that S(σ T |σ) = 0 and hence I(σ T : σ) = ln 2. On the other hand, when the weights are in equilibrium in their potential V (w) before learning, there is no correlation between the weights of the student and those of its teacher, such that I(σ T : σ) = 0.
We can connect the mutual information I(σ T : σ) to the well-established generalisation error of neural networks [19,34]. It gives the probability that the neuron predicts the wrong label for an arbitrary input ξ, assuming that the prediction of the neuron is noise-free, i.e. σ = sgn(w · ξ), and is defined as where θ is the Heaviside step function. If the neuron predicted a label based on its activity reliably via σ = sgn(w · ξ) like the teacher, Eq. (4), the mutual information between the true and predicted label for an arbitrarily drawn input could be expressed as where is the shorthand for the Shannon entropy of a binary stochastic variable with probability p. For a realistic neuron, the activity gives only the probability that the neuron will fire an action potential, see Eq.
(3), hence Eq. (16) constitutes an upper bound on its actual performance with noisy predictions. In the following, we will focus on deriving thermodynamic bounds on the amount of information that the neuron can learn from its teacher for the ideal case of noise-free predictions. Thermodynamics enters the picture by considering the free energy costs of the non-equilibrium dynamics of the weights. They can be quantified by the total entropy production ∆S tot n of a single weight in the network which is guaranteed to be non-negative by the second law of stochastic thermodynamics and has two contributions: the heat dissipated by the nth weight into the connected heat bath, ∆Q n , and the change in Shannon entropy of the marginalised distribution p(w n ) [2].
For a neural network learning with the dynamics (8), we can show both for N = 1 and in the thermodynamic limit that from the second law for the network (see Appendices A and B for details). This suggests the introduction of an efficiency This inequality is our first main result and holds at all times t > 0 in Eq. (8) and (9). We note that while this result is superficially similar to the inequality we have derived previously [21], here we consider an entirely different scenario. In our previous work, there was no teacher; instead, we considered the learning of a number of fixed inputs with true labels drawn at random, such that the true labels were uncorrelated to the inputs and to each other. Hence the concept of a generalisation error did not apply and "the information" was always related to the labels of the fixed set of inputs. Here, we are learning from a number of samples (σ µ T , ξ µ ) which are examples of the function that the teacher performs, Eq. (4). The network tries to infer this function in order to be able to correctly classify previously unseen inputs. What we show here is that the ability to learn from a teacher and generalise accordingly is bound by the total entropy production per weight.
Hebbian 1 [23,24] Perceptron TABLE I. Different learning algorithms for a neuron with weights w online-learning a sample (σ µ T , ξ µ ) together with the colour code used throughout the paper. Here, θ(·) is the Heaviside step function. References are given to where the algorithm first appeared in a discussion of (the statistical mechanics of) neural learning, to the best of our knowledge. A detailed discussion of the form of these algorithms is given in Let us look at a toy model of a neuron with a single weight. The weight is initially in equilibrium in the harmonic potential V (w) = kw 2 /2. Without loss of generality, we can set k = β = D = 1, making energy, entropy, time and the weights dimensionless. At time t = 0, the learning rate is suddenly increased from 0 to a constant value ν 0 .
The neuron learns using one of the three learning algorithms, each defined by a particular choice of F and summarised in Tab. I. The simplest non-trivial choice is F = 1, which is Hebbian learning [23,24]. For such an algorithm, each incoming sample changes the weight by an amount ∼ σ µ T ξ µ . An obvious improvement on this simple algorithm is to only change the weight if the network would currently predict the wrong label for that input, which is achieved by choosing F = θ(−σ µ T wξ µ ). This is the Perceptron algorithm [25]. A further refinement of this rule is achieved by choosing F = |wξ µ |θ(−σ µ T wξ µ ) such that the change in the weights is proportional to the confidence of the neuron in its decision, measured by |wξ µ |.
The key insight to solve the dynamics in each case is that in one dimension, σ µ T ξ µ = sgn(T ) for all ξ µ , which is readily verified. This has the appealing consequence that it is possible to rewrite the Langevin equations for all three learning rules without any mention of the inputs ξ µ . Instead, learning a rule is equivalent to a quench of the potential of the weight from the simple harmonic form V (w) = w 2 /2 to a new T -dependent potential V q (T, w), the exact form of which depends on the learning algorithm chosen. They read for Hebbian, Perceptron and AdaTron learning, respectively. The weight then relaxes to the new equilibrium distribution, which is given by the Boltzmann distribution. The heat dissipated by the weight during this isothermal relaxation is given by where · 0 and · eq indicate averages with respect to the distributions of teacher and weight at t = 0 and after relaxation, respectively.
We plot the efficiency of learning (17) for t → ∞ in Fig. 3 as a function of the learning rate. While the Hebbian algorithm yields the lowest generalisation error, its efficiency is quickly dominated by the heat dissipated, ∆Q ∼ ν 2 0 , resulting in low efficiency. The perceptron algorithm is the most efficient for large ν and yields a better generalisation performance than the AdaTron algorithm, too (see the inset of Fig. 3).
We finally note that our inequality (17) is sharp for N = 1. Optimal efficiency η → 1 can for example be reached for Hebbian learning with a time-dependent learning rate ν(t) where we first linearly increase ν(t) from 0 to ν 0 over a period of time τ and then keep it at the final value ν 0 : which is similar to an example discussed in [21]. In the limit of slow driving τ → ∞, the dissipated heat ∆Q → 0. If additionally the learning rate ν → ∞, the efficiency η → 1.

VI. LEARNING IN LARGE NETWORKS AND A SECOND BOUND
We just saw in Sec. V that for N = 1, ξ µ sgn(T ξ µ ) = sgn(T ) which simplifies the analysis because the inputs ξ µ do not appear explicitly in the equation of motion of the weight. In higher dimensions, we have instead a learning force on the nth weight which will fluctuate between the desired value sgn(T n ) and − sgn(T n ) due to the second term inside the sign function, which is effectively a noise term corrupting the signal from the n-th component of the teacher. So instead of relaxing to a new equilibrium as seen for N = 1, the weights relax to a steady state with a constant, positive rate of thermodynamic entropy production [2]. Our inequality (17) still applies to this process, but it is not very sharp anymore: I(σ T : σ) ∼ 1 and ∆S(w n ) ∼ 1, but a steady state comes with a non-zero rate of heat dissipation, such that ∆Q ∼ t. This issue was not ad-dressed in our previous work [21]. In this section, we derive a sharper bound using concepts from steady state thermodynamics [40]. We start with the explicit expression for the total entropy production of the weights of the network [2], (23) In our problem, the learning rate ν(t) acts as a control parameter. For every value of ν(t), there is a well-defined steady state p s (T , w; ν(t)) where ∂ t p s (T , w; ν(t)) = 0 as in equilibrium, but where at least some of currents j s n (T , w; ν(t)) = 0, leading, inter alia, to a constant rate of total entropy productionṠ tot ≥ 0 in the steady state. For the remainder of this section, we will use the shorthands to keep our notation slim. We can rewrite the total entropy production (23), The cross-term is identically zero, which can be seen by first rewriting the integrand and then integrating by parts, where we have used the fact that our system is infinite to integrate and the Fokker-Planck equation in the last step. We can hence write [2,41] S tot (t) =Ṡ na (t) +Ṡ a (t) where we have introduced the non-adiabatic entropy productioṅ and the adiabatic entropy productioṅ Both entropy production rates are evidently positive. They each correspond to a possible mechanism that leads to the breaking of time symmetry and hence to dissipation: the application of non-equilibrium constraints (Ṡ a ) and the presence of driving (Ṡ na ). The non-adiabatic entropy production of the system can be written as [41] S na (t) = − dT dwṗ(T , w, t) ln p(T , w, t) p s (T , w, ν(t)) .
It becomes identically zero once the steady state is reached, as is easily seen from its definition. By splitting the logarithm, we find the second law of steady state thermodynamics [40][41][42] S na (t) =Ṡ(T , w, t) +Q ex (t) ≥ 0 whereṠ(T , w, t) is the rate of change of the Shannon entropy of the distribution p(T , w, t) and we have identified the excess heat [40][41][42] Q ex (t) ≡ dT dwṗ(T , w, t) ln p s (T , w, ν(t)) (36) which has no definite sign. Starting from the second law of steady-state thermodynamics (35), we can derive our second, sharper bound on the accuracy of learning: which leads to the efficiencỹ This the second main result of our paper. It also holds at all times and applies to any learning algorithm that depends on the weights, w, and samples (σ T , ξ µ(t) ). We give the details of its derivation in Appendix C and show that our result applies to batch learning in Appendix D.

VII. ONLINE LEARNING IN LARGE NETWORKS
The number of afferent connections to a single neuron in a realistic network may be on the order of thousands [27], so it is sensible to analyse learning in the limit N → ∞. We will focus on online learning [39,43] using the algorithms introduced in Sec. V and summarised in Table I. We will assume that the samples, indexed by µ(t), change much faster than the weights relax. This assumption is central to virtually all of the existing literature on the analysis of online learning algorithms.

A. Scaling of the learning rate
We have noted that for N > 2, the learning force on the nth weight will fluctuate between two values proportional to ± sgn(T n ), leading to a steady state with constant and constant rate of heat dissipation. Let us try to make this statement more quantitative by looking at the learning force averaged over the inputs ξ in the limit N → ∞. Setting F = 1 for the moment for simplicity of notation, we have where we have written µ = µ(t) to simplify our notation and we have introduced the noise term inside the sgn(·) function, ψ is uncorrelated with T n ξ µ n and normally distributed with zero mean and variance N − 1 ≈ N due to the central limit theorem since the teacher and the inputs are uncorrelated. We are interested in the probability p that sgn(T n ξ µ n + ψ) = sgn(T n ξ µ n ), i.e. the probability that the learning force points in the right direction despite the noise term ψ. This probability is found by integrating the binormal distribution p(T n , ψ) = p(T n )p(ψ) over the region where (41) holds for ξ µ n = 1 and ξ µ n = −1, respectively. We find that where we have expanded p for large N [44]. Hence the larger the network, the smaller the information that σ T = sgn(T ·ξ) carries about a single component of the teacher network. This analysis suggests we choose a learning rate ν(t) ≡ν 0 √ N with the normalised learning rateν(t) ∼ 1. This choice corresponds to the conventional scaling of time with the inverse of the network size in the machine learning literature [19,34], which amounts to nothing more but an increase in samples shown to the network to compensate for the dilution of the signal.

B. Dynamics
First of all, we would like to compute the timedependent generalisation error (t) for online learning with the three algorithms from Tab. I in a large network with dynamics given by (8). We keep the inverse temperature and the diffusion constant at β = D = 1 and again consider the case where the learning rate is quenched to a constant valueν =ν 0 at t = 0, leaving us with two free parameters:ν 0 and the stiffness of the harmonic potential k, see Eq. (5).
We thus introduce two new parameters, which go back to the original proof of convergence of the perceptron algorithm [45] and play an important role in the statistical mechanics of learning [19], These quantities have the appealing property of being self-averaging in the thermodynamic limit, where they become the second moment of w n and the covariance of (T n , w n ), respectively. Using geometrical [19] or analytical [34] arguments, it can be shown that the generalisation error (15) becomes Hence it is sufficient to find and solve the equations of motion for Q and R to solve the dynamics of the generalisation error. We can indeed derive such equations directly from the Langevin equation for the weights w (8) (see Appendix E). They reaḋ where we have introduced the auxiliary random variables Since we are assuming that the inputs change on a timescale much faster than the relaxation time of the weights, we need to average Eqs. (45) over the inputs ξ. This average is simplified by noting that the inputs only enter the equations via x and y. Thus the average over the inputs can be replaced with an average over x and y, which are binormally distributed by the central limit theorem, with moments x ξ = y ξ = 0, The averages · ξ can be performed analytically for all three learning algorithms and their particular choice of F, see Tab. I. We give the results in Appendix E. This procedure eventually yields a set of closed equations for R and Q for each learning algorithm, which can be solved numerically. Fig. 4 shows the generalisation error obtained from numerical simulations of the Langevin equation (8)  network with N = 10000,ν 0 = β = D = 1 and compares it to the result obtained by our analytical calculation that we just discussed. First, we note that is a self-averaging quantity, i.e. each simulation run generates the same over time within small fluctuations which are scale inversely with N . Furthermore, the dynamics of are well described by our analytical result. While the Hebbian learning takes the longest time to converge, it is perhaps surprisingly the most robust algorithm in the presence of noise, consistently yielding the lowest generalisation errors. Indeed, for online learning with k = 0 and no noise, it is well established that decays slower for the Perceptron than the Hebbian algorithm; on the other hand, Hebbian learning fails miserably with non-uniform input distributions [19]. The performance of the Perceptron is significantly improved by a choice of time-dependent learning rates in a process called annealing. This is beyond the scope of this paper, but see [46] for a detailed discussion of the impact of time-dependent learning rates on the convergence of learning algorithms.
A remarkable property of AdaTron learning is demonstrated in Fig. 5, where we plot the final, steady-state generalisation error against the normalised learning ratẽ ν 0 . While Hebbian and Perceptron learning (green and blue, resp.) show the expected decrease of withν 0 , there is a sharp increase of for AdaTron learning at ν c = 3 (green). Indeed, for large learning rates, the Ada-Tron algorithm will fail completely. This sensitivity of the algorithm to the value of the learning rate is wellknown in the noise-free case without potential V (w) [19] and persists in our model with noise, most markedly for low potential stiffness k. A detailed analysis of the firstorder system (Q, R) reveals that the fixed point with close to 1 becomes unstable as the learning rate crosses the pointν c = 3 and a second, stable fixed point emerges with → 1/2 (see Appendix F).

C. Efficiency of learning
We can also derive an ordinary differential equation for the ensemble average of the excess heat (36) in terms of Q and R, with the details to be found in Appendix G. Since the components of the teacher and the weights are normally distributed, the change in Shannon entropy of the marginalised distribution of a weight ∆S(w n ) can be expressed in terms of just Q, giving us all the information necessary to compute the efficiency of learningη (38). We plot the efficiencyη in the thermodynamic limit in Fig. 6 against the normalised learning rateν 0 and the potential stiffness k, which are the only remaining free parameters in this model.
The efficiency of Hebbian learning is roughly symmetric with respect to k around k = 1, while Perceptron and AdaTron learning display highly asymmetric patterns. However, we find that despite the different patterns, the maximum efficiency for all three algorithms is η 0.2. We can dig a little deeper by first noting that since p(T n , w n ) is normally distributed for the learning algorithms we have considered, both the mutual information I(T n : w n ) and the mutual information between the true and the predicted label for an arbitrary input I(σ T : σ) can be written as functions of only the correlation between T n and w n , ρ ≡ R/ √ Q. Expanding around ρ = 0 yields I(σ T : σ) I(T n : w n ) = ln 2 − S(arccos(ρ)/π) −1/2 ln(1 − ρ 2 ) = 4 π 2 + O(ρ 2 ) 0.4 (48) which turns out to be a good approximation for ρ 0.9. So at maximum efficiency, I(w n : T n ) ∆S(w n ) + ∆Q ex n 1 2 (49) for all three algorithms. The last plot in Fig. 6 shows that the bifurcation we discussed for the AdaTron learning in Sec. VII B leads to a decaying efficiencyη → 0 since I(σ T : σ) → 0. This effect is smoothed out with increasing potential stiffness.
Let us finally note that, in this model, the rate of heat dissipation of a single weight diverges in the thermodynamic limit,Q n → ∞ as N → ∞. This is readily understood from a physical point of view since the weights experience a large force, f ∼ √ N , which fluctuates very quickly. This observation reinforces the importance of our second bound involving the excess heat (36), which does not diverge even in the limit N → ∞.

VIII. CONCLUDING PERSPECTIVES
We have analysed the learning of linearly separable rules by neural networks as a model for the thermo-dynamics of generalisation. Using stochastic thermodynamics and information theory, we have shown that the accuracy with which the neuron is able to apply the rule to previously unseen inputs is constrained by the dissipation of free energy of a single weight during the learning process. Our results hold for all learning algorithms that have access to all the weights, w, and either a set of or a succession of samples (σ µ(t) T , ξ µ(t) ) in batch or online learning, respectively. We have furthermore given a detailed analysis of both the dynamics and the thermodynamics of online learning in large neural networks with noisy dynamics and weights constrained by an external potential.
It is worthwhile to revisit the results of our earlier work [21] in the light of these results. In this previous paper, we studied a different learning problem, namely the learning of P mappings ξ µ → σ µ T from fixed inputs ξ µ , µ = 1, . . . , P to their true labels σ µ T . The true labels were drawn at random for each input, and hence uncorrelated to the inputs and to each other. Hence there is no generalisation error for this problem -if the true label of every input is determined by pure chance, the mappings {ξ µ → σ µ T } P µ=1 carry no information about the label of a previously unseen input. Instead, the challenge is to find a set of weights that reproduce the mappings faithfully.
The two problems are however related in the following way. It is possible to at least construct a teacher T that reproduces all the mappings ξ µ → σ µ T using σ µ T = sgn(T · ξ µ ) if and only if the number of mappings P is less than the capacity of the network. This capacity is usually defined in the thermodynamic limit, where the number of weights N → ∞, and we are interested in the relative number of inputs α c ≡ P c /N ∼ 1 for which there exists a teacher T that reproduces all the true labels via σ T = sgn(T · ξ) with probability 1 [20]. Its numerical value can be derived analytically from replica calculations [47], but it was first understood using geometrical arguments [48,49] (see also [20] for a pedagogical discussion).
If it is possible to construct a teacher T , the rule implicitly defined by the mappings is realisable and can, at least in theory, be learned. Even in that case, however, the issue remains for the scenario considered in [21] that the number of samples from which the neuron learns is limited and might not be sufficient to learn the underlying "rule" effectively. On the other hand, learning the mappings {ξ µ → σ µ T } P µ is still a meaningful task even if it is not possible to even construct a network that reproduces them all, if one is willing to accept a certain error in the predictions of the network.
Neurons with a single-layer architecture described here are generally limited to the implementation of linearly separable Boolean functions ξ → ±1. A natural generalisation is to consider networks with several layers, where the output of the neurons in one layer is the input for the neurons in the next layer [19,20]. The capabilities of networks with intermediate layers are remarkable: a network of binary neurons (σ = ±1) with just a single intermediate layer can implement any Boolean function of its inputs [50], while a network with continuous neurons (σ ∈ R, e.g. σ = tanh(A)) is capable of approximating any function of its inputs to any required accuracy [51] [52]. However, the analysis of these networks is much more involved than that of the single-layer feedforward network discussed here and is left for future work.
Besides the generalisation to networks with intermediate layers of neurons, this work opens up numerous other avenues for further research. It would be intriguing to consider the generalisation of our model to multi-valued teacher functions, e.g. for a network learning to classify digits. The teacher could also be made subject to noise in its outputs σ T , or its components, T n , or both. Another intriguing learning problem is that of a changing environment, modeled by a drifting teacher [53,54]. Designing a learning algorithm that optimises the thermodynamic efficiency looks like a serious challenge. More broadly, studying the thermodynamic costs of learning to generalise might form a suitable basis to consider the thermodynamics of decision-making [55].
We have considered the limitations on computation in neural networks that are a consequence of the second law of thermodynamics and account for the free energy costs of computations. However, there are further limiting factors on the ability to compute. One is the availability of data: for a given inference problem, such as inferring a teacher T from a number of samples (σ T , ξ), there is a minimum amount of data that is required to make any prediction that is better than simply flipping a coin. A second limiting factor is time: ideally, we would like to have an algorithm whose completion time scales polynomially with the system size, rather than exponentially. It has recently become clear that these two constraints can be understood using statistical mechanics in terms of phase transitions [56]. It will be interesting to see whether the thermodynamic limits that we have derived in this paper fit into this picture, and if so, where they can be found.

ACKNOWLEDGMENTS
We thank David Hartich for valuable discussions and careful reading of the manuscript and Hans-Günther Döbereiner for stimulating discussions of the science of decisions.

APPENDICES
The following appendices give a detailed proof of our main results, inequalities (17) and (37), in Appendices A to C. Appendix D discusses how our results apply to batch learning. Detailed calculations for the learning dynamics in the thermodynamic limit are given in Appendices E to G.
Appendix A: Derivation of inequality (17) Our first main result, Eq. (17), can be derived from the second law of stochastic thermodynamics [2] which states that the rate of total entropy production of the full system is positiveṠ We will drop the explicit time argument in the following discussion but emphasise that since the distribution p(T , w, t) is time-dependent, so are of course all the quantities derived from it. We discussed in Sec. III that the total probability current of the system decomposes into a separate current due to the fluctuations in every subsystem w n , see Eq. (9). As a consequence, the rate of total entropy productionṠ tot of the network can be split into separate, non-negative contributions due to the fluctuations of every subsystem w n ,Ṡ tot n , each of which can further be split into two separate contributions, so that we havė The first part is the rate of change of the Shannon entropy of the full distribution p(T , w, t) due to the dynamics of w n and readṡ S n (T , w) = − dT dw j n (T , w)∂ n ln p(T , w). (A3) while the rate of thermodynamic entropy production in the medium by the nth weight is given bẏ for with the temperature set to unity. Writing p(T , w) = p(w n )p(T , w n |w n ) where w n ≡ (w 1 , . . . , w n−1 , w n+1 , . . . , w N ), we havė where ∂ t S(w n ) is the change of the Shannon entropy (11) of the marginalised distribution p(w n ) = dT dw n p(T , w) due to the dynamics of the nth weight. Throughout this paper, we use the dot notation to denote time-dependent rates, as opposed to the time derivative of state functions like the Shannon entropy. The second term in Eq. (A5) is an information-theoretic piece which gives the rate at which the dynamics of w n changes its correlations with the other degrees of freedom in the system, namely the other weights w n and the teacher T , as measured by the mutual information I(w n : T , w n ). This quantity has been introduced as the (thermodynamic) learning rate [36,57], not to be confused with the learning rate ν(t) introduced in Eq. (7), or the "information flow" [37,58]. Its explicit form is l n (w n : T , w n ) = dT dw j n (T , w, t)∂ n ln p(T , w n |w n ). (A6) We finally note that for the isothermal environment that we assume in this paper, the rate of thermodynamic entropy production is the heat dissipated into the environment,Ṡ m n =Q n , where we remind ourselves that we have set the temperature to unity.
Putting it all together, we can formulate the second law for a subsystem, which is the starting point of our derivation.
Integrating the N second laws (A7) with respect to time from t = 0 to t > 0 yields where we write ∆S(w n ) and ∆Q n to denote the total change in Shannon entropy of the distribution p(w n ) and the total heat dissipated by the dynamics of the nth weight up to time t, respectively. We can interpret the right-hand side of (A8) by computing the time-derivative of the mutual information I(T : w), ∂ t I(T : w) = dT dw [∂ t p(T , w, t)] ln p(T , w, t) p(T , t)p(w, t) . (A9) Using the Fokker-Planck Eq. (9) and integrating by parts, we find where in the penultimate line, we have recovered the integrand on the right-hand side of (A8). Integrating the term in brackets in Eq. (A13) with respect to time yields for all times t > 0 where we have used that at time t = 0, all the weights are independent of each other and hence S(w) = n S(w n ). The inequality follows from the fact that for any set of random variables, n S(w n ) ≥ S(w) [38]. Using this inequality, we can deduce from (A13) that where the inequalities follow again from the nonnegativity of mutual information and the fact that all the weights and all the components of the teacher are statistically identical. Using the latter argument for the total entropy production and inserting our last result into the integrated form of the second law (A8), we find that ∆S(w n ) + ∆Q n ≥ I(w n : T n ) (A21) Finally, we need to show that the mutual information between the nth component of the weight and teacher vectors are an upper bound on the mutual information between the true and predicted labels of any sample ξ, Our strategy will be to show that the inequality (A22) holds even if the neuron predicts a label deterministically, σ = sgn(w · ξ). The generalisation error is then the lowest for a given noise level in the weights and given by I(σ T : σ) = ln 2 − S( ). Let us first consider a network with N = 1. We start by noting that for arbitrary random variables X and Y and an arbitrary function F (Y ), we can always write p(x, y, f (y)) = p(x)p(y|x)p (f (y)|y). We thus identify X → Y → F as a Markov chain and find using the data processing inequality [38]. For N = 1, we can apply this result twice to show that as required.
In the thermodynamic limit N → ∞, we use the auxiliary variables x ≡ w · ξ/ √ N and y ≡ T · ξ/ √ N . We then have from (A23) since σ T and σ are functions of x and y, Eq. (4) and Eq. (3), respectively. We can now average x and y over the inputs (1) using ξ n ξ = 0 and ξ n ξ m ξ = δ nm . By the central limit theorem, x and y are then distributed according to a bivariate Gaussian distribution with correlation [38] ρ This is a crucial step in our derivation since it allows us to connect the statistics of teacher and weight in one dimension to the statistics of the true and predicted labels, which are functions of the vectors T and w. The mutual information of two variables with a bivariate Gaussian distribution is a function of their correlation alone [38], which would also be the mutual information I(w n : T n ) if w n and T n were jointly distributed normally, which they are not necessarily. However, we can show that I G (w n : T n ) is a lower bound on I(w n : T n ) using the maximum entropy principle. This is a prescription for finding the probability distribution that maximises the Shannon entropy given a number of constraint on the distribution, usually in the form of fixed moments. We briefly review this concept in Appendix B. The crucial point here is that a Gaussian distribution is the maximum entropy distribution for a given covariance matrix. We will denote the maximum entropy notations with an asterisk, e.g. p * . The mutual information I(w n : T n ) can be expressed as the relative entropy or Kullback-Leibler divergence between the joint distribution p(w n , T n ) and the factorised distribution p(T n )p(w n ) [38]: ≡ dT dw p(T , w) ln p(T , w) p(T )p(w) (A29) Our two main results, inequalities (17) and (37) apply to batch learning as well. This is because in our derivation, we only used the fact that the teacher enters the force on the weights, albeit indirectly. We did not have to specify the exact form of the learning force that introduces the correlations between the weight and the teacher. Hence it does not make a difference in the derivation of the inequalities whether the learning force is computed for just a single sample or averaged over a set of samples.
Appendix E: Solving the learning dynamics in the thermodynamic limit Here we give a detailed derivation of the equations of motion for the order parameters Q and R introduced in Sec. VII in the thermodynamic limit N → ∞. These are most easily derived by rewriting the Langevin equations for the weights, Eq. (8), as Itô stochastic differential equations [60] dw(t) = −kw(t) dt + ν(t)ξ µ(t) σ µ(t) T F(|w(t)|, w(t) · ξ µ(t) , T · ξ µ(t) ) dt + dW (t). (E1) The random Wiener process has components dW n (t) which are normally distributed with mean 0 and variance 2D dt = 2 dt in our choice of units. It is related to the noise term of the Langevin equation via dW n (t) = t+dt t dt ζ n (t ); see [60] for more details. All other symbols take the same meaning as discussed before Eq. (8). We assume that the inputs that enter the equation are changing on a timescale much faster than the relaxation time of the weights. Hence it is only the statistical properties of the inputs that determine the dynamics of w in the thermodynamic limit. We can thus average over the inputs, making the detailed dynamics of µ(t) unimportant. We can derive the equations of motion for the means of Q ≡ w · w/N and R ≡ T · w/N by expanding to second order in dw and keeping terms on the order of dt: dQ ≡Q(w + dw) − Q(w) where, contrary to ordinary calculus, the term dw · dw has contributed two terms, one from the Wiener process and one because ξ · ξ ≈ N ≈ 1/ dt. We have also used the scaling ν(t) = √ Nν(t) that we discussed in Sec. VII A of the main text. In this section, we will denote by · the average with respect to the distribution of inputs (1). Likewise, we have dR =T · dw/N = − kR dt +ν(t) sgn(T · ξ) T · ξ √ N F dt (E3) We can simplify the averages over the inputs and thus these equations by noting that the inputs only enter as products such that dQ =2(1 − kQ) dt + 2ν(t) sgn(x)yF(x, y) dt +ν(t) 2 F(x, y) 2 dt (E5) dR = − kR dt +ν(t) sgn(x)xF(x, y) dt (E6) The crucial point to compute the averages · is now to realise that p(x, y) is a binormal Gaussian distribution due to the central limit theorem, with moments x = y = 0 x 2 = 1, y 2 = Q xy = 1 N n,m T m w n ξ n ξ m = T · w N = R Critical learning rate for AdaTron learning. The first three plots from left to right are vector plots in phase space for the first-order system (Q,Ṡ), Eq. (F2), for AdaTron learning in the thermodynamic limit N → ∞ with constant normalised learning rateν(t) =ν0 and k = 0.01. Forν0 ≤ 2, there is an attracting state with S → 1. As we increase the learning rateν0, another attracting state appears with S → 1/2 and hence = 1/2. In the limit k 1, this behaviour can be understood from the bifurcation diagram of the closed, single equation for S, Eq. (F2b), shown on the far right, where stable (unstable) fixed points are indicated by straight (dashed) lines. Parameters: β = D = 1.
where we have used ξ n = 0 and ξ n ξ m = δ nm from Eq. (1). Here we only give the results of these integrals for the different learning algorithms for completeness, see e.g. [34] for details on how to perform these integrals [61].
where · now indicates an average over the distribution p(x, y) as described in Appendix E and we remind ourselves that we chose units where D = 1.