Frozen algorithms: how the brain's wiring facilitates learning

Highlights • Theory tells us why learning is hard to implement in the brain using local plasticity mechanisms.• These difficulties can be overcome by feedback pathways and redundant connectivity.• Connectomics reveals such connectivity motifs in learning centers across species.


Introduction
The brain learns by modifying itself in response to information from the external world. Since the earliest experimental demonstration of how learning can be implemented by synaptic modification [1,2], experimental attention has focused on mechanisms that link experience-dependent signals to modifications in synaptic strength. Over many decades of research we have witnessed an explosion of different synaptic learning rules, from timing-dependent mechanisms, to neuromodulator induced plasticity as well as combinations of both [3,4].
Paradoxically we are still unable to connect these many kinds of synaptic modification mechanisms to high-level biological learning at the circuit or behavioral level. Basic theory from optimization sheds light on the problem of connecting biological learning to low-level plasticity rules [5 ,6,7,8 ,9 ,10,11 ]. A key insight is the so-called credit assignment problem. In a large network with many interconnected cells there is a complex relationship between a change in any given connection and a change in overall behavior. This difficulty means that each synapse may require detailed information in order to modify in the correct way during learning [12 ].
Recent progress in AI shows that conceptually simple optimization processes can allow artificial neural networks to achieve superhuman performance on cognitive tasks. The high-level design of such AI systems has informed systems-level hypotheses on biological learning (e.g. [13]). However, the training step requires exchange across the network of vast amounts of information, depending upon the state of each synapse.
To a greater extent than artificial neural networks, biological neural circuits face limiting constraints on the speed, accuracy, and quality of information that can be exchanged between synapses and from the external world. This limits the set of plausible learning mechanisms, and, we argue here, imposes heavy constraints on circuit architecture that are beginning to be revealed in recent connectomics work [14 ,15 , 16].
We will outline key theoretical aspects of learning from an optimization perspective. This will give a concrete framework for understanding why learning is hard computationally, and how this problem can be partially solved with specific circuit connectivity. This topic does not address all aspects of learning, nor does it address all ways in which circuit architecture aids or hinders learning. We will cite recent experimental connectomics work that provides striking examples of circuit motifs that make sense in light of concrete theory. We will also review recent theoretical work that shows how circuit architecture can aid learning, providing hypotheses for future experimental studies.
Theoretical background: why is learning hard in large neural circuits?
Consider a neural circuit with N synapses. Abstractly, learning minimises some measure of behavioural error (a 'loss function') that depends on all N synapses. Any change in the synapses can be represented as an N -dimensional vector, endowing collective synaptic changes with a direction and a magnitude. The credit assignment problem boils down to three general factors that determine the difficulty of learning: Factor 1 The sensitivity of behavioural performances to 'good' synaptic changes. Factor 2 The robustness of behavioural performance to 'bad' synaptic changes. Factor 3 The relative proportions of 'good' and 'bad' changes induced by a synaptic learning rule.
What do we mean by 'good' and 'bad' changes? At any point in time during learning, there is a direction of maximal sensitivity of the loss function to changes in the synapses. This corresponds to the gradient of the loss function, which defines the 'good' direction of change.
Artificial learning algorithms compute gradients of loss functions directly to solve the credit assignment problem.
In optimization theory, this is known as a 'first-order' method [17], which explicitly computes the sensitivity of a circuit-level property to a change in synaptic weights, before adjusting all weights collectively to steer the network to a desired state. However, computing the full gradient generally requires knowledge of all synaptic weight values. While possible on a computer, there is no feasible means of exchanging this amount of weight information explicitly in the brain [12 ].
Biologically, this means that synaptic changes may only correlate with the gradient, with some proportion of synaptic changes that are not related to the gradient. These correspond to 'bad' changes.
An example of a class of biologically plausible learning rules that suffer from a problematic proportion of bad changes is as follows. Apply a random perturbation to all N synapses, then retain the perturbation or reverse it depending on the observed effect on behavioural performance. It is easy to see how a biological synapse might implement such a rule: all it requires is access to the recent error history, for example by integrating a global neuromodulatory signal over time. Learning rules of this form are known as 'zero-order', as they do not directly approximate the gradient. Examples include the MIT rule [18], the REINFORCE algorithm [19], and node perturbation [19,20]. Biologically plausible implementations have been proposed as the underlying mechanisms for birdsong learning [21] and cerebellar learning [22 ].
Unfortunately, although plausible biologically, such learning rules do not work well if they operate on large populations of synapses. A calculation outlined in Box 1 shows that the correlation of such learning rules with the gradient drops in proportion to 1= ffiffiffiffi N p where N is the number of synapses. Put another way, biologically plausible zero-order rules only provide feedback on a single direction in weight space, while the full gradient provides information on all directions. Intuitively, we can see that as N increases, the chance that a random direction in weight space aligns with the gradient drops. For zero-order rules this means that the proportion of bad 208 Neurobiology of learning and plasticity

Box 1 Mathematical analysis of learning
Denote behavioural performance in a task by a loss function L½w, where w is the N-dimensional vector of synaptic weights in a network. Learning involves descending this loss function. 'Gradientbased' learning rules explicitly form an approximation of the negative gradient ÀrL½wðtÞ and adjust synaptic weights in this direction during learning. However, any learning rule, even 'gradient-free' learning rules, must adjust synapses in direction that anticorrelates with rL½w on average. Consider a small time interval dt, in which the weights change by an amount dw. We can write the change in behavioural error over time t þ dt as where rL½w is the gradient of the loss function, and r 2 L½w is the Hessian (second derivative). If learning occurs in dt, the change in error in 1 will be negative. First, consider the average effect of a completely random weight change, dw, on loss. In this case, where TrðÞ is the trace of a matrix (sum of eigenvalues). We can see that A random perturbation is uncorrelated, in expectation, with the gradient. So the first term on the RHS is zero. Therefore Trðr 2 L½wÞ N quantifies the robustness of the learning system to random perturbations (Factor 2 in the main text). In a partially trained network, random perturbations are expected to negatively affect behavioural performance, that is, Now consider a perturbation-based (0 order) learning rule. Let dw be a random Gaussian perturbation that is kept or reversed so that it anticorrelates with the gradient. In this case, dw T rL½w has a halfnormal distribution. In expectation, the first term of the RHS of Equation 1 reduces to Thus the correlation with the direction of steepest descent decreases at a rate proportional to 1 ffiffi ffi N p . At the other extreme, exact (batch) backpropagation uses the update dw ¼ ÀcrL½w; for some constant of proportionality c. In this case, the correlation with the gradient is In either case, k rL½wk 2 defines the sensitivity of behavioural performance to plasticity in 'good' directions, and is a mathematical analogue to Factor 1 in the main text.
Intermediately to Equations 2a and 2b, a noisy first-order rule might align only a small amount with the gradient, like 2a, but have a constant degree of alignment that is independent of N, like 2b. For these learning rules (or for anywhere E½dw T rL½w degrades at a slower rate than 1 ffiffi ffi synaptic changes is likely to grow with network size. There are two ways to mitigate this problem: (1) Increase the dimensionality of feedback signals. A better correlation between weight changes and the full gradient can be achieved if the feedback signal itself has multiple components. This can be achieved in the brain via multiple, independent feedback pathways that carry vectorised error information to neural circuits. Biologically, error signals can be encoded in neural activity and, in particular, in neuromodulatory signals that convey behavioural valence directly to synapses where learning occurs. We will review recent connectomic studies, particularly in invertebrates, that have uncovered the existence of such precise modulatory feedback in learning centers in the brain.
(2) Use circuit architectures that are robust to 'bad' perturbations (i.e. improve Factor 2), while maintaining sensitivity to the 'good' component (i.e. maintain Factor 1). It turns out, perhaps surprisingly, that redundant connectivity in a neural circuit can reduce behavioural sensitivity to bad synaptic changes while maintaining sensitivity to good changes. This means that even crude but biologically plausible synaptic learning rules may achieve efficient learning with the appropriate circuit architecture.
In the following sections we review recent work that suggests these principles are at work in the brain.
Connectomic evidence for vectorised learning signals in the brain Imagine repetitively practicing and attempting to improve a tennis serve. A coach evaluates each serve with a mark out of 10. This is an example of scalar feedback, since the entire behaviour is reduced to a single number. With scalar feedback and multiple behavioural parameters, it is hard to determine which behavioural aspects should be altered, and how. Now suppose you can correlate aspects of the behaviour (the serve) to the coach's mark, and gradually build up a picture of which behavioural aspects contribute to a good mark (hit the ball harder, keep head raised higher, etc.). Such multifaceted error information is an example of vector feedback. Figure 1a depicts the architecture of feedback signals in a generic neural circuit, from a coarse (scalar) feedback error signal to a fine, vector signal that decomposes error across different behavioural readouts and/or subsets of circuit connections. Higher dimensional feedback eases the difficulty of learning through factor 3. This is because the more connections that are affected by a given component of an error signal, the worse the ratio of 'good' changes to 'bad' changes becomes (see Equation 2.
Regardless of the specific learning mechanism, more targeted information on behavioural contributions to a task cannot be a bad thing. This benefit can be enhanced further by the partitioning of subcircuits according to the behaviour they influence. For example, suppose separate neural subcircuits control head direction and racket force during the tennis serve in the example above. If these behavioural variables are uncorrelated and targeted feedback on how to change each variable reaches separate neural circuits, then the credit assignment problem need only be solved in each separate subcircuit. On the other hand if head direction and racket force influence each other than this benefit disappears, as both subcircuits collectively influence both behavioural variables.
There is a large body of literature detailing how abstract properties of behaviour, such as reward prediction error, are represented in the brain and inform learning [3,5,23,24]. Minimally, such error information can correspond to a scalar error signal from bulk release of neuromodulators such as dopamine.
However, more recently, detailed connectomics in Drosophila suggests that evolution has exploited the advantages of vector neuromodulatory feedback for learning. Feedback pathways necessary for learning target discrete neural subcircuits, each responsible for different aspects of behaviour. Moreover, these separate pathways seem to provide weakly correlated signals.
The Drosophila mushroom body (MB) mediates associative learning across sensory modalities, especially olfaction. This circuit has an architecture that enables weighted combinations of stimuli, such as a combination of odour signals, to become associated with behavioural actions in an experience-dependent manner [25,26]. Distinct modulatory subpopulations represent different error signals. For example, the optogenetic activation of specific modulatory subpopulations, in concert with an odour stimulus, is sufficient to form an aversive or appetitive association [27].
Even for a particular valence (e.g. appetitive), distinct modulatory subpopulations encode a more finely tuned, multidimensional reward signal. As an example, [28 ] showed that different subsets of modulatory neurons were required for the associative learning of an odour with two different kinds of reward (fructose versus aspartic acid), even while both triggered the same behaviour of learned odour attraction.
Is the functional specificity of these modulatory neurons encoded anatomically? Evidence points to different modulatory neurons receiving different sets of sensory afferents, and therefore potentially computing somewhat independent error signals. The mushroom body has a segregated architecture consisting of largely independent neural subpopulations learning associations from different neuromodulatory feedback signals. In [29], the MB was conceptually divided into 15 compartments, defined by the presence of specific dopaminergic (modulatory) and output neuron cell types. Tracing studies including [14 ,15 ] found that each modulatory neuron received a unique combination of sensory afferents, with same-type modulatory neurons receiving more similar sensory input. Moreover, many modulatory neurons receive over half of their dendritic input from recurrent connections [14 ]. This recurrence links different MB compartments, such as those responsible for aversive and appetitive associations (see Figure 2b), thus allowing for reward signals incorporating both sensory input and integrated past experienced from multiple associative systems. The recurrent input is highly diverse [15 ] across different modulatory neuron types, emanating from different regions including gustatory interneurons and the lateral horn.
Together, this work supports a hypothesis that different neuromodulatory pathways are affected by different aspects of past experience, as well as unique combinations of sensory stimuli. This may provide nonredundant, vector feedback signals to independent neural subpopulations in the MB. This circuit architecture is expensive in terms of developmental complexity and wiring cost, yet the theory discussed in the previous section tells us that this price may be necessary for efficient learning. It would be interesting, on the other end, to see whether different neuromodulatory pathways are associated with the formation of distinct behavioural responses.
We have reviewed the anatomical features in the Drosophila mushroom body that could ease the burden of the credit assignment problem. This was made possible by the detailed connectomic information available on the Drosophila. The existence of zero-order learning mechanisms has been proposed in higher organisms as well, notably the song-learning system in Zebra finches [16,21,30]. The songbird provides an excellent model system for a difficult learning problem because learning to mimic and recognise birdsong corresponds to learning 210 Neurobiology of learning and plasticity

Current Opinion in Neurobiology
(a) Schematic of a learning circuit with four behavioural outputs (denoted by colour) and feedback signals targeting connections that adapt during learning. Each row depicts a feedback signal with different degrees of coarseness. Top: A scalar (coarse) feedback signal provides information on overall behavioural performance to all connections. In this scenario the credit assignment problem is most onerous because the individual contributions of circuit connections to behavioral performance are hard to disentangle. Middle: A more detailed vector feedback signal specifies how changes in each of the behavioural outputs contributes to overall performance. Bottom: Separate subsets of the learning system inform separate behavioural outputs. Now vector feedback helps even a perturbation-based learning rule, as the number of synapses per behavioural output decreases by a factor of four. (b) Schematic wiring diagram of the extended MB circuit taken from [13 ]. Separate compartments are innervated by separate neuromodulators encoding distinct forms of feedback on behavioural performance.
complex temporal sequences of muscle movements and auditory stimuli: a rich feedback signal. A proposed implementation of local credit assignment, using the node-perturbation algorithm [21] lead to specific anatomical predictions subsequently verified by connectomics studies [16].

Redundant connectivity can boost learning efficiency of noisy, local synaptic plasticity mechanisms
Clearly, the difficulty of a specific learning problem is influenced by the architecture of the neural circuit used to solve it. We now review some recent theoretical results that provide mechanistic reasons as to why particular, observed architectures are favourable for the learning problems they solve.
One generic architecture that recurs across organisms is the 'input expansion'. Here a large, horizontal 'mixedlayer' of cells receives low numbers of inputs, and projects to low numbers of outputs. For example, this architecture occurs in Kenyon cells in the insect mushroom body, or cerebellar granule cells. One advantage of such an architecture is that it can 'separate out' correlated inputs by embedding them in a higher dimensional space, as proposed decades ago in [31]. Recent modelling work has justified more detailed aspects of these architectures [6,7,8 ]. In [8 ], a simplified circuit model of the input expansion motif was constructed. The sparsity (i.e. number of inputs projecting onto each mixed-layer neuron), and degree of feedforward inhibition onto the mixed layer were set as tunable parameters, as was the number of neurons in the two layers. They investigated how these tunable parameters related to dimensionality of the representation. This latter quantity is important as the higher it is, the greater the efficacy of associative learning using a Hebbian readout neuron. Optimal sparsity of the input connections to the mixed layer was calculated, and found to correspond exactly to the degree of anatomically observed sparsity in the mossy fibre inputs to cerebellar granule cells to the human brain, as well as antennal lobe inputs to kenyon cells in Drosophila.
We have discussed how a particular anatomical circuit motif supports pattern separation, an unsupervised learning problem. The specific anatomical predictions of [8 ] did not apply in the case where synapses were subject to supervised plasticity, although it was noted that dense connectivity seemed more supportive of fast learning in this case. More recently, [5 ] showed quantitatively how and why adding neurons in densely connected circuits could speed learning, even when smaller circuits could theoretically solve the learning problem to perfect accuracy. Thus, adding 'redundant' neurons can compensate (a) Embedding an input-output mapping into a higher dimensional space reduces the difficulty of feedback based learning. Noisy gradient descent, with a fixed signal:noise ratio, will learn faster and to higher steady state performance in a higher dimensional network. The effect is not limited to the particular linear network architecture shown, but also holds for multilayer, nonlinear networks. for an inaccurate learning rule that cannot induce plasticity well-aligned with the gradient. The key insight was that adding redundant neurons could increase sensitivity of behavioural performance to 'good' directions of plasticity, without decreasing the robustness of behavioural performance to 'bad' directions. Mathematically, the sensitivity of behavioural performance corresponded to the magnitude of the gradient, and the robustness depended upon the eigenvalues of the Hessian (see Box 1 for insight and some details). Note that this benefit of redundancy is counteracted, for zero-order learning rules, by the increased burden of the credit assignment problem in large networks (i.e. inability to accurately find the gradient).
The result of [5 ] suggests that learning rules could be quantified not only by how well they perform in in silico circuits, but on how their performance scales as network size increases to biologically realistic levels. A given learning rule may provide a highly inaccurate approximation of the gradient. Nevertheless, if the quality of this approximation does not decrease with increasing network size (or decreases more slowly than zero-order, perturbation-based learning rules), then the described benefits of redundantly large circuits could compensate to nevertheless enable fast, accurate learning. However, these benefits might only emerge at extremely large network sizes.
The previous point is relevant given the plethora of proposed, biologically plausible approximations to backpropagation that have emerged in the literature [12 ]. Backpropagation itself requires individual synapses to know the states of their downstream counterparts. This is known as the 'weight transport' problem. Various biologically plausible surrogates to weight transport partially circumvent this. One solution is to demand parallel feedback connections that instead provide synapses with information on downstream neural activities, which serves as a rough proxy [32][33][34][35][36][37][38][39][40][41]. It is worth noting that these feedback connections effectively provide an error signal with vectorised information on the errors of different output neurons. Another solution is to let individual neurons take the role of an abstract 'network', with internal 'layers' [9 ]. This makes weight transport an intracellular problem, and allows for the separation of feedback pathways respectively providing reward and weight transport signals [10,11 ,42].
All of these proposed algorithms explicitly, albeit inaccurately, compute the sensitivities of behavioural performance to individual synapses. This inaccuracy has led to mixed success when benchmarking their performance against artificial learning rules not subject to biological plausibility constraints [43 ]. However, the results of [5 ] suggest that learning performance could improve as network size increases to biological levels.

Conclusion
It is difficult, if not impossible, to understand how synapticlevel learning rules enable behavioural level learning without studying the structure of neural circuits and the abstract mathematical nature of learning as an optimisation process. Even with intricate synaptic plasticity mechanisms, there is a fundamental algorithmic difficulty in implementing learning at the whole circuit or behavioural level due to the credit assignment problem. We have reviewed two very plausible classes of solutions to this problem: vector feedback and redundant connectivity.
Recent advances in quantitative neuroanatomy [25,[44][45][46][47][48] have made it possible to measure and quantify circuit architecture, revealing striking evidence that these solutions are used biologically. We need principles to make sense of the mass of complex data that connectomics reveals. The question of how learning might work algorithmically was successfully addressed long before the era of modern connectomics in Marr and Albus's groundbreaking analysis of cerebellar architecture [31,49]. New theoretical insights and insights from artificial learning, coupled with extraordinarily detailed and comprehensive connectomics data, places us in an exciting era that will leverage experiment and theory to understand biological learning.

12.
Lillicrap TP, Santoro A, Marris L, Akerman CJ, Hinton GE: Backpropagation and the brain. Nat Rev Neurosci 2020:1-12 This review highlights how key bottlenecks in biologically plausible implementation of backpropagation could be circumvented. This would result in high-performance, biologically plausible learning rules.