Bio-inspired, task-free continual learning through activity regularization

Lässig, Francesco; Aceituno, Pau Vilimelis; Sorbaro, Martino; Grewe, Benjamin F.

doi:10.1007/s00422-023-00973-w

Bio-inspired, task-free continual learning through activity regularization

Original Article
Open access
Published: 17 August 2023

Volume 117, pages 345–361, (2023)
Cite this article

Download PDF

You have full access to this open access article

Biological Cybernetics Aims and scope Submit manuscript

Bio-inspired, task-free continual learning through activity regularization

Download PDF

1181 Accesses
1 Citation
Explore all metrics

Abstract

The ability to sequentially learn multiple tasks without forgetting is a key skill of biological brains, whereas it represents a major challenge to the field of deep learning. To avoid catastrophic forgetting, various continual learning (CL) approaches have been devised. However, these usually require discrete task boundaries. This requirement seems biologically implausible and often limits the application of CL methods in the real world where tasks are not always well defined. Here, we take inspiration from neuroscience, where sparse, non-overlapping neuronal representations have been suggested to prevent catastrophic forgetting. As in the brain, we argue that these sparse representations should be chosen on the basis of feed forward (stimulus-specific) as well as top-down (context-specific) information. To implement such selective sparsity, we use a bio-plausible form of hierarchical credit assignment known as Deep Feedback Control (DFC) and combine it with a winner-take-all sparsity mechanism. In addition to sparsity, we introduce lateral recurrent connections within each layer to further protect previously learned representations. We evaluate the new sparse-recurrent version of DFC on the split-MNIST computer vision benchmark and show that only the combination of sparsity and intra-layer recurrent connections improves CL performance with respect to standard backpropagation. Our method achieves similar performance to well-known CL methods, such as Elastic Weight Consolidation and Synaptic Intelligence, without requiring information about task boundaries. Overall, we showcase the idea of adopting computational principles from the brain to derive new, task-free learning algorithms for CL.

Online Continual Learning on Sequences

Hierarchically structured task-agnostic continual learning

Article Open access 28 December 2022

Toward durable representations for continual learning

Article 17 December 2021

1 Introduction

The mammalian brain has an astonishing ability to continually form new memories while preserving previous ones. In contrast, artificial neural networks are prone to catastrophic forgetting when trained on a sequence of tasks or datasets (McCloskey and Cohen 1989). This is true even if the tasks are very similar to each other and are likely to benefit from similar features. For example, learning to recognize different pairs of hand-written digits in sequence is notoriously difficult for artificial neural networks trained with backpropagation (Van de Ven and Tolias 2019).

For multi-layer artificial neural networks, a range of continual learning (CL) approaches have been devised that include modifications to the network architecture, loss function, or the implicit or explicit storage of previous task data (Van de Ven and Tolias 2019). Usually, these methods require external information about a task switch. This is in stark contrast to natural environments, where tasks are usually not well defined and need to be inferred from context.

To address the CL problem, brain-inspired approaches have been developed (Kudithipudi et al. 2022; Parisi et al. 2019). For example, French (1991) pointed out that the problem of catastrophic forgetting might not be intrinsic to biological neural networks, but is rather an effect of distributed and overlapping task representations that emerge when using the standard backpropagation (BP) algorithm. In line with this idea, it has been suggested that biological networks might avoid catastrophic forgetting by representing information through a sparse, but task-specific subset of neurons and synapses to which learning is restricted (Lin et al. 2014; Manneschi et al. 2021; French 1991). Other approaches relax the idea of restricting learning to sub-populations to the more general notion of learning within restricted subspaces (Duncker et al. 2020).

In this work, we exploit the idea of restricting learning to task-specific, sparse representations with the goal to derive a novel, bio-inspired task-free CL method. In line with the pervasive recurrence observed in the visual cortex (van Bergen and Kriegeskorte 2020), we argue that a task-specific sparsity mechanism should not only incorporate feedforward information (bottom-up) coming from lower hierarchical layers but also error feedback information coming from higher areas (top-down). To render both forms of information usable for such informed sparsity, we adopt Deep Feedback Control (DFC), a bio-plausible deep learning framework in which every neuron integrates inputs from the previous layer, as well as top-down error feedback during learning (Meulemans et al. 2022). To enforce sparsity, we combine DFC with a winner-take-all (WTA) mechanism and restrict learning of the feedforward weights to active neurons. To stabilize and protect previously learned representations, we further introduce intra-layer recurrent weights that are updated through a Hebbian-type learning rule. In the following, we term this new, combined method sparse-recurrent DFC.

To explain the basics of our algorithm, we first present related work in Sect. 2. Then, in Sect. 3, we provide implementation details on how we modified the DFC learning dynamics to integrate the two major factors required for CL—sparsity and intra-layer recurrent connections. In Sect. 4, we show that the introduction of these additional bio-plausible elements helps to stabilize learning and to reduce forgetting by regularizing neural activity. We compare our approach with other established regularization-based CL methods and show that sparse-recurrent DFC performs comparably well despite completely lacking information on task boundaries. Finally, we analyse the resulting task representations in order to better understand the mechanisms behind the observed improvement in CL performance.

2 Background

2.1 Computational strategies for continual learning

To overcome catastrophic forgetting, researchers developed a variety of different strategies that can roughly be classified into three categories:

(1)
Replay methods rely on implicitly or explicitly storing and revisiting previous data while learning new tasks. This can be accomplished by storing small subsets of previously seen data in a memory buffer, or by training a generative model (Shin et al. 2017). However, we do not consider data replay in this work, since we are interested in methods based on bio-plausible plasticity, without relying on external data storage.
(2)
Regularization methods constrain learning to preserve parameters that are important for previous tasks, usually by adding specialized loss terms. Elastic Weight Consolidation (EWC) and Synaptic Intelligence (SI) are commonly used representatives of this family, which we adopt as comparison benchmarks. In EWC (Kirkpatrick et al. 2017), after the network converges on a task, the Fisher information of the first task’s loss is computed through a sampling mechanism. The Fisher term contains information on parameter importance relative to the first loss and is added as a regularization term to the loss for the following task. Synaptic Intelligence (Zenke et al. 2017) works through a similar mechanism, but parameter importance is estimated online based on how much of the decrease in loss can be attributed to the variation of each given parameter. In both cases, the regularization term is added to the loss at the end of each task, and information on task boundaries is therefore required.
(3)
Architectural methods are based on structural changes such as freezing weights, or adding and removing neurons (Rusu et al. 2016). Alternatively, neurons can be dynamically gated based on context (Masse et al. 2018; von Oswald et al. 2020). Context, however, is usually externally provided rather than inferred by the network itself, which is a strong assumption that may not always hold for real-world scenarios. In another approach, a dedicated system, inspired by the role of the prefrontal cortex, is used to detect contextual information instead (Zeng et al. 2019). In this work, we adopt a similar gating-based approach, in which, conversely, gating is provided by recurrent activity independently of external task information.

2.2 Continual learning in the brain

Although CL in the brain is not well understood, it is likely that various mechanisms are at play simultaneously, with some being loosely connected to the three CL strategies described above (Kudithipudi et al. 2022).

In neuroscience, the trade-off between fast learning and slow forgetting is known as the stability-plasticity dilemma. To avoid this issue, the interaction between a more plastic system, the hippocampus, and a more stable system, the neocortex, has been suggested as a long-term memory storage mechanism, akin to a data replay strategy (van de Ven et al. 2020). On the other hand, biological networks might control the stability/plasticity of individual synapses through mechanisms collectively referred to as metaplasticity. Through metaplasticity, synapses that are particularly important for solving previously learned tasks are left unaltered when learning new tasks, while less relevant synapses are made available to store new information, analogously to certain regularization-based approaches in CL (Jedlicka et al. 2022).

Next, neurogenesis, the birth of new neurons, is sometimes considered equivalent to architectural approaches that gradually grow the network. However, neurogenesis is believed to be limited to very specific brain areas, with small numbers of new neurons, and it is unclear whether it occurs in adult humans. It is therefore contested whether neurogenesis plays a role in CL (Parisi et al. 2018).

Finally, animal brains heavily rely on context to flexibly switch between tasks and to direct learning to task-specific neurons and synapses. For example, previous studies have shown that afferents of the olfactory nucleus in rats provide contextual input from other brain areas, thereby enabling dynamic and flexible task learning (Levinson et al. 2020). This not only enables context-specific gating of neuronal responses to the same stimulus for different environments or tasks but it also facilitates forward-generalization. Similarly, the release of specific neuromodulators (e.g. dopamine) has been linked to the gating of activity and to learning based on context (Kudithipudi et al. 2022). Overall, it is likely that in biological networks the modulation of neuronal activities, either through hierarchical top-down feedback or specific neuromodulators, directs learning to the most salient aspects of the task, while protecting older memories that are irrelevant in the current context.

2.3 Task-free continual learning

Van de Ven and Tolias (2019) defined three CL scenarios for which training is organized sequentially on each task and performance is evaluated as the average accuracy on all previously learned tasks:

(1)
in task-incremental learning (task-IL), the task ID is available during training and at test time;
(2)
in domain-IL, the task ID is available during training but not at test time;
(3)
in class-IL, the task ID is available during training, but at test time the model must report the task ID alongside solving the task.

In all these scenarios, however, information on task boundaries is provided during training, i.e. the model knows when training on one task i ends and training on a new task $i+1$ begins. Most CL strategies need this information to update the loss or the network structure in preparation for the new task. However, such discrete changes in the loss or network structure do not seem biologically plausible. Therefore, in this paper, we focus on domain-IL, and on the more challenging class-IL, but in a setting where task information is entirely omitted during both training testing.

This so-called task-free form of continual learning is generally less studied, although a few examples have appeared in recent years. The majority of these follow a data storage and replay paradigm (Aljundi et al. 2019b; Wang et al. 2022; Rao et al. 2019), which we do not consider in this work. Lee et al. (2020) adopt an architectural approach, based on an expanding set of experts which, in turn, deal with new tasks. Among regularization-based methods, Laborieux et al. (2021) propose a metaplasticity-inspired mechanism, but so far limited to feedforward, binary networks. Aljundi et al. (2019a) circumvent the problem of task boundaries by heuristically detecting plateaus in the evolution of the loss, which signal the end of learning for a task, and use a mixed replay and regularization strategy. Finally, Pourcel et al. (2022) mix an architectural method with replay using a dynamic content-addressable memory for online class-IL.

To clarify how our method fits into this landscape of brain-inspired algorithms, we next provide details on our CL approach, which combines DFC, sparsity, and recurrent Hebbian-like connections.

3 Activity regularization through sparsity and recurrent gating

3.1 Deep feedback control

During training, the neuronal dynamics within the DFC network (Meulemans et al. 2022) can be described by a differential equation that takes into account the feedforward inputs $v_{i}^{\textrm{ff}}$ as well as the feedback control signal $v_{i}^{\textrm{fb}}$ according to

$$\begin{aligned} \begin{aligned} \tau _v {\dot{v}}_i(t)&= - v_i(t) + \,\,\, \,\,\,\,\,\,\, v_{i}^{\textrm{ff}}(t) \,\,\,\,\,\, +v_{i}^{\textrm{fb}}(t)\\&= - v_i(t) + W_i \phi \big (v_{i-1}(t)\big ) + Q_iu(t) \end{aligned} \end{aligned}$$

(1)

where the pre-nonlinearity neuron activations in layer i at time t are denoted by $v_i(t)$, and the incoming weights by $W_i$. $\phi $ refers to the activation function, while the neuron output is given by $r_i=\phi (v_i(t))$. The feedback signal u(t) is calculated by summing the integral and proportional parts of the network output error e(t) as described by Meulemans et al. (2022). u(t) is then fed back to each neuron of the network via the feedback weights $Q_i$. During learning, the feedforward network and the feedback controller constitute a recurrent dynamical system that converges to a stable state (ss) at which the neuron activity $v_{i,\textrm{ss}}$ minimizes the output error and stabilizes the feedback signal u(t). In practice, we simulate the dynamics for a set number of iterations and utilize the final activations as stable state values. The number of iterations is chosen to be high enough such that most simulations converge.

The final neuron activations $r_{i,\textrm{ss}} = \phi (v_{i,\textrm{ss}})$ are referred to as ‘targets’ or ‘target activations’ since they represent the values we want the network to produce without feedback. To achieve this, the forward weights are learned by comparing each neuron’s target activation $r_{i,\textrm{ss}}$ to its feedforward-driven activation $\phi (v_{i,\textrm{ss}}^{\textrm{ff}})$ upon converging to the stable state:

$$\begin{aligned} \Delta W_{i} = \eta (r_{i,\textrm{ss}} - \phi (v_{i,\textrm{ss}}^{\textrm{ff}})) r_{i-1,\textrm{ss}}^T \end{aligned}$$

(2)

where $r_{i-1,\textrm{ss}}$ is the presynaptic, post-nonlinearity activity with controller feedback, $r_{i,\textrm{ss}}$ is the activity of the neuron with feedback and $\phi (v_{i,\textrm{ss}}^{\textrm{ff}})$ is the postsynaptic neuron activity without feedback. In sparse-recurrent DFC, we additionally centre each weight update to have zero mean before applying it. This is done in order to prevent a small group of neurons to be more excitable and dominate the winner-take-all mechanism described in the next subsection. The feedback weights $Q_i$ can be learned (Meulemans et al. 2021, 2022), but we simplify the learning of the feedback pathway and re-initialize $Q_i$ as the Jacobian of the loss with respect to the neuron activations for every data point.

The update rule from Eq. 2 implements a learning paradigm where weight updates are determined by neural activity. This opens the possibility of regularizing weight updates indirectly by modulating neural activity. We will refer to this strategy as activity regularization. In the next sections, we describe how activity regularization (e.g. sparsity and recurrent gating) can be utilized to reduce interfering weight updates between representations of different inputs belonging to different tasks.

3.2 Dynamic sparsity

To gradually modulate the network activations towards sparse, non-overlapping representations, we add a winner-take-all mechanism on top of the existing DFC network. At each time step t, we set a fraction $s_{i}(t)$ of neurons to be zero. $s_{i}(t)$ is initialized to zero at $t=0$ and incrementally grows over time until it reaches the desired sparsity for the stable state $s_{i, \textrm{ss}}$, which is a hyperparameter fixed for each layer i. We refer to these hyperparameters as sparsity levels. As long as different inputs to the network lead to sufficiently different activation profiles, this technique should lead to a reduction in overlap between active populations pertaining to different data points. As a result, interference during learning should be reduced by only updating the weights of active populations.

However, the network cannot learn to suppress specific neurons because forward connections to inactivated neurons are frozen. This is an issue because, while we aim to decrease overlap between representations of different classes, inputs belonging to the same class should be represented similarly. WTA sparsity based on feedforward and feedback activity alone does not ensure this. Our intuition is that, if neurons keep dropping in and out of active populations during training, no consistent representations can be learned, leading to forgetting. To address this problem, we introduce an additional set of connections with the aim of learning which neurons are allowed to fire together, and which neurons are mutually exclusive. This way, we provide a way for the network to stabilize and protect the neuron populations that together constitute specific representations.

3.3 Gating neuron activity through lateral recurrent connections

We stabilize neuron populations involved in learned representations by introducing lateral recurrent connections. Because we want to strongly influence which neurons are active, we implement lateral connections with a gating effect that multiplies activations by a factor between 0 and 1, similar to ‘forget’ gates used in LSTMs (Hochreiter and Schmidhuber 1997). We then calculate the neuron feedforward activity before the nonlinearity as

$$\begin{aligned} v_{i}^{\textrm{ff}}(t) = W_i \phi \left( v_{i-1}(t)\right) \odot \sigma \left( R_i|r_i(t)|\right) \end{aligned}$$

(3)

where $R_{i}$ refers to the recurrent weight matrix in the i-th layer, $\sigma $ to the sigmoid function, and $\phi $ to the same activation function as used in Eq. 1. After applying the effect of the recurrent gating, we re-scale the population activity to have the same overall magnitude as before applying the gating. We thus only change the distribution, but not the total level of activity. At convergence, we learn the recurrent gating weights according to a rule inspired by the feedforward updates from Eq. 2

$$\begin{aligned} \Delta R_{i} = \eta (|r_{i,\textrm{ss}}| - |\phi (v_{i,\textrm{ss}}^{\textrm{ff}})|) |r_{i,\textrm{ss}}|^T \end{aligned}$$

(4)

where $r_{i,\textrm{ss}}$ are the target activations of the presynaptic neurons in the same layer. Because our multiplicative gating mechanism affects the magnitude, but not the sign of the activity, we render this inhibition to depend on the magnitude of presynaptic activity. We therefore use absolute values of activity in both the dynamics (Eq. 3) and the update rule (Eq. 4). Like forward weight updates, we normalize recurrent weight updates to zero mean. In contrast to the feedforward weights, however, we only update incoming weights of inactivated neurons (i.e. neurons with activity set to zero by the winner-take-all sparsity mechanism). This lets us simplify the above equation to a Hebbian-like update rule for suppressed neurons:

$$\begin{aligned} \Delta R_{i} = - \eta |\phi (v_{i,\textrm{ss}}^{\textrm{ff}})| |r_{i,\textrm{ss}}|^T. \end{aligned}$$

(5)

As a result, we only update incoming recurrent weights for inactive neurons within the target representation, while for active neurons, we only update the incoming feedforward weights. Figure 1 (dashed lines) summarizes the weight updates. As in standard DFC, we use a simple feedforward pass during test time, for which neither top-down feedback nor lateral recurrent effects are taken into account. Therefore, the number of parameters of the trained model is equivalent to a conventional feedforward network with the same number of neurons (see “Appendix A.3” for a further discussion on model complexity).

Please note that gating through lateral connections, while crucially influencing the WTA selection of the active neuron population by modulating neuron activity, does not determine the level of sparsity. WTA sparsity and lateral connections are interconnected, but distinct mechanisms.

4 Experiments

To test the CL capabilities of our approach, we train sparse-recurrent DFC on the split-MNIST dataset, according to the domain-IL and class-IL paradigms (Van de Ven and Tolias 2019). Split-MNIST is a simple computer vision CL benchmark in which five pairs of consecutive digits are presented as a sequence of individual supervised learning tasks. In domain-IL, all tasks involve predicting the parity (even/odd) of the input digit, meaning that the output labels stay the same across tasks, but the input data changes. In class-IL, a different class has to be predicted for every digit, so that, across tasks, both the input digits and the class labels change.

4.1 Performance

To establish whether sparse-recurrent DFC actually succeeds at CL, we compare its performance against other learning algorithms, namely Synaptic Intelligence (SI), Elastic Weight Consolidation (EWC), as well as standard BP as baseline. Previous studies evaluated models at a fixed learning rate (LR) for a fixed number of epochs (Kirkpatrick et al. 2017; Van de Ven and Tolias 2019), however, we consider this problematic. Both the LR and the number of epochs can be seen as indicators for how much a network learns, thus pointing to an inherent trade-off between learning the current task well and forgetting previous tasks. Less learning generally leads to less forgetting, while at the same time not allowing the training to converge on the current task. Comparing CL algorithms at a single LR for a fixed number of training samples is problematic for two reasons. First, it does not account for different (model-specific) optimal amounts of training. Second, it fails to capture how robust a CL approach is to more learning, beyond its optimum LR and number of training samples per task. To overcome this issue, we evaluate learning algorithms in two different scenarios. In the first scenario we fix the number of epochs and vary the LR. In the second scenario we fix the LR and vary the training accuracy that we expect on the current task, before training on the next task, which results in different numbers of batches trained on for different models on different tasks. In both scenarios, we cover a wide spectrum between minimizing forgetting, and optimizing the current task.

4.1.1 Learning rate performance evaluation

Figure 2a and 2b shows performance for a fixed number of training samples across a range of LRs for domain-IL and class-IL, respectively. The initial rise of performance followed by a decay can be explained by the fact that very small LRs (left of the peak) generally prevent sufficient learning while high LRs (right of peak) lead to catastrophic forgetting. These CL performance profiles confirm our initial intuition that choosing a single LR to compare CL methods might lead to overestimating one method over another. We regard good performance in this setting as a function of both peak accuracy and the degree to which accuracy can be maintained once the optimal LR is reached. In domain-IL, sparse-recurrent DFC significantly outperforms BP and achieves a similar performance profile to EWC. Compared to SI, our approach performs worse in terms of peak accuracy, but maintains accuracy over 70% for a wider range of LRs. In class-IL, sparse-recurrent DFC outperforms all other methods both in peak accuracy and average accuracy.

4.1.2 Early stop performance evaluation

Figures 2c and 2d show performance for a fixed LR across a range of early stop accuracies for domain-IL and class-IL, respectively. In domain-IL, sparse-recurrent DFC outperforms BP for almost all minimum accuracies. However, it is most competitive when we train each task to convergence. For training up to very high accuracies, sparse-recurrent DFC is comparable to both EWC and SI. In class-IL sparse-recurrent DFC outperforms all other CL algorithms for the majority of accuracies.

Overall, we conclude that sparse-recurrent DFC represents a competitive CL method that shows a robust performance independent of the amount of learning on each individual task. In the next section, we investigate in more detail the effects on accuracy with respect to the main components of our method: feedback, sparsity and intra-layer recurrency.

4.2 Integrating feedback signals facilitates CL

A major difference between standard BP and DFC is that in DFC, the activity of each neuron during training reflects feedforward as well as feedback (error) signals coming from the top-down controller. As a result, target representations $r_{i, \textrm{ss}}$ are specific to both input and output, with data points exhibiting larger overlaps in active neuron populations if these have similar features or the same label. Figure 3a shows that CL performance is improved across a wide range of LRs if we take into account feedback signals when selecting the remaining active population within the sparse target. Although the combination of equal parts feedforward and feedback activity yields the best results overall, feedback activity alone achieves high accuracy for $LR=1e-3.5$. We hypothesize that low LRs lead to less training of forward weights, rendering input selectivity less useful. Thus, it may be beneficial for the network to rely solely on feedback when determining the active population. This is consistent with our idea that incorporating feedback signals generally facilitates the sparsity selection process, allowing the learning of more task-specific representations.

4.3 Sparsity and recurrent gating are required for CL

We next investigate whether both sparsity and intra-layer recurrence in the DFC framework are crucial for CL. We compare the accuracy of sparse-recurrent DFC against standard DFC, sparse DFC and recurrent DFC. As opposed to sparse-recurrent DFC, recurrent DFC has no inactivated neurons to constrain the recurrent weight updates to. We thus apply the recurrent weight update rule from Eq. 4 to all neurons. Figure 3b shows that neither sparsity nor recurrent gating alone significantly alters CL performance across LRs. However, the combination of the two leads to better performance across a wide range of LRs.

Figure 3c shows accuracy as a function of the sparsity parameters $s_{i, \textrm{ss}}$. For the first hidden layer, a small but nonzero sparsity level yields the best performance, while for the second hidden layer, higher sparsity levels work best. This dependence on layer depth is expected, because the early layers of multi-layer neural networks encode low-level features common to multiple classes and class-selectivity is a disadvantage for these neurons (Morcos et al. 2018), while the later layers encode higher-level features which are more specific to individual classes (Zeiler and Fergus 2014; Mahendran and Vedaldi 2016).

4.4 Aligning sparse, separated representations across tasks facilitates domain-IL

Next, we test whether the combination of sparsity and recurrent gating facilitates CL by reducing representational overlap, in a domain-IL setting. We compute the reduction in overlap (i.e. separation) between last hidden layer representations of all pairs of digits, at the end of training. We distinguish between intra-label separation (MNIST digits with the same parity label) and inter-label separation (digits with different parity labels), as shown in Fig. 3f. We compute representational separation between digits as

$$\begin{aligned} s(d_1, d_2) = 1 - \frac{a_{l}^{d_1}\cdot a_{l}^{d_2}}{\Vert a_{l}^{d_1}\Vert \Vert a_{l}^{d_2}\Vert }; \qquad a_{l}^d = \sum _{j = 1}^n |r_{l,j}^d| \end{aligned}$$

(6)

where $r_{l,j}^d$ represents the activations in layer l elicited by the j’th sample of digit d. Figure 3d shows the averages of inter- and intra-label representational separations for DFC variants. Interestingly, sparse DFC shows high representational separation, but does not yield significantly higher accuracies compared to standard DFC or BP. This suggests that overall increases in representational separation alone do not account for performance improvements that we observe in Fig. 3b.

To better understand this result, we next devise a new measure of separation, which we term normalized inter-label separation and that is defined as the average difference between inter-label separation and intra-label separation. Figure 3e shows this separation metric over a wide range of LRs. For the LRs where sparse-recurrent DFC yields higher normalized inter-label separation, we also observe better CL performance (compare to Fig. 3b), suggesting that the relative degree of digit representational overlap can explain the CL performance profile that we observe for sparse-recurrent DFC. This indicates that sparse-recurrent DFC facilitates domain-IL performance by representing even and odd digits in two partially separated neuron populations that are reused across tasks. As a first result, we conclude that, although sparsity is necessary to create non-overlapping representations, sparsity alone is not sufficient for aligning these across tasks. Such alignment, however, seems beneficial for domain-IL, where several digits are represented by the same label. We next investigate how recurrent gating helps to learn representations that are compatible across tasks.

The final hidden layer of a network has to learn representations of the input that are linearly separable by its readout weights. One possible way to prevent catastrophic forgetting is to ensure two things. Condition 1: The hyperplane separating representations of different labels (implemented in the network by the readout layer) needs to stay the same, or similar across tasks. Condition 2: Data points represented in the final hidden layer need to stay on the same side of the classification hyperplane that was initially learned as we train on subsequent tasks. We measure feedforward activations $\phi (v_{L-1}^{\textrm{ff}})$ (no recurrent gating) and target activations $r_{L,\textrm{ss}}$ (including effects of controller and recurrent gating) to test whether recurrent gating helps to achieve this. Regarding condition 1, Fig. 4b shows that, if we classify target activations at training onset of a new task according to the previously learned separation boundary, sparse-recurrent DFC consistently yields higher classification accuracies than sparse DFC. This suggests that lateral connections regularize new target activations such that they better align with previously learned task boundaries. This idea is illustrated in Fig. 4a, showing that target activations of the second task are separated by the same hyperplane that divides targets of the first task. Regarding condition 2, we measure the direction of movement of feedforward activations from the beginning to the end of training. We next quantify how much the data points move towards the initially learned separation boundary. Figure 4c suggests that sparse-recurrent DFC reduces the movement towards the previous decision boundary compared to sparse DFC. Taken together, our results suggest that recurrent gating helps fulfil both conditions. For more details on the calculation of these metrics involving hyperplanes, see “Appendix C”.

4.5 Learning within separate subspaces facilitates class-IL

One possible strategy to address class-IL is to enforce sparse, non-overlapping representations of different digits, thereby preventing interfering weight updates between classes. To test whether sparse-recurrent DFC utilizes this strategy, we record target activities of different digits after they are first learned and measure the representational overlap of all pairs of digits using Eq. 6. Figure 5a shows that, while sparse DFC leads to some increase in representational separation, sparse-recurrent DFC maximizes separation across all LRs compared to other DFC variants. These results are consistent with our initial idea of reduced representational overlap facilitating CL. Intuitively, if different neurons are used for different tasks, weights of neurons that were important in early tasks are less likely to be changed. Similar to domain-IL, sparsity in class-IL can thus be seen as a necessary condition for the formation of non-overlapping representations.

To gain a better understanding of why recurrent gating helps to increase representational separation in class-IL, we next analyse its effect on altering the dimensionality of targets. Figure 5b shows the effective dimensionality (Roy and Vetterli 2007) of the target activations of different tasks after learning for recurrent DFC, sparse DFC and sparse-recurrent DFC. The results suggest that the combination of sparsity and recurrent gating leads to a significant decrease in effective dimensionality of the target activations. This led us to hypothesize that representations learned for a new task are less likely to affect dimensions that were important for previous tasks. To investigate if recurrent gating leads to a reduction in reuse of previously learned subspaces, we compute the fraction of the effective dimensionality used by previous tasks that is left unaltered by the current task (Fig. 5c). For more details on the calculation of this metric, see ‘Appendix D”. Figure 5d validates our hypothesis that recurrent gating reduces the fraction of dimensions that are altered by new tasks, thus reducing the extent to which new weight updates interfere with parameters important for previous tasks.

5 Discussion

In summary, we have presented a new, bio-inspired, task-free CL approach that yields competitive performance compared to other CL methods on a simple computer vision benchmark. To restrict learning to a reduced set of task-specific parameters, our method (sparse-recurrent DFC) integrates feedforward and feedback information to constrain activity to a sub-population of neurons. In addition to being more biologically plausible, we show that including top-down signals is beneficial for CL. Our results are consistent with the idea that sparsity is a requirement for reducing representational overlap, but suggest that sparsity alone is insufficient for protecting previously learned model parameters. We show that intra-layer recurrent connections, when combined with sparsity, facilitate the protection of old task representations, leading to competitive CL performance of DFC on split-MNIST. For both domain- and class-IL, recurrent gating in combination with sparsity restricts learning to low-dimensional subspaces. In domain-IL, the same subspace consisting of two separated neuron populations is shared across tasks; in class-IL learning is restricted to multiple distinct subspaces.

From a neuroscience perspective, our findings might allow experimental researchers to derive new hypotheses about how the brain minimizes catastrophic forgetting. One prediction of our sparse-recurrent DFC network is that intra-layer recurrent connections are only critical during learning but not inference, since we only use recurrence at training time. Although this is surprising, there are data suggesting that biological brains do this as well. Van Rullen et al. (1998) argue that, given the short response time in face recognition tasks, neurons do not have the time to emit much more than one spike at each processing stage. This would imply that initial inference can happen before recurrence takes effect. Based on our work, neuroscientists could, for example, manipulate recurrent communication within cortical hierarchies, to test if an animal’s ability to perform inference or to learn multiple tasks sequentially is affected.

From a machine learning perspective, our new method is relevant because it is based on a novel set of working principles to achieve CL. As sparse-recurrent DFC naturally infers non-overlapping representations and thus non-interfering parameter updates, it does not require any task boundaries or task information either during training or testing. While other task-free CL methods exist and achieve competitive performance, they are not exclusively based on specialized weight update rules, as they use either data replay or expanding architectures. The only exception we could find is limited to binary networks (Laborieux et al. 2021). Moreover, in future work, our approach could be combined with other task-free CL methods (replay and non-replay-based) which might lead to even better CL performances. Although the current implementation of sparse-recurrent DFC is computationally less efficient when compared to standard CL algorithms running on GPUs, DFC is ideally suited for a neuromorphic hardware implementation that might be more energy-efficient. Finally, we want to acknowledge the limitations of our experimental paradigm: MNIST is a simple dataset, and the number of tasks is limited. While results from additional experiments suggest that our method generalizes to other datasets (Appendix A.1) and more tasks (Appendix A.2), performance gains are diminished when considering a mixed-dataset training paradigm (Appendix A.2). This suggests a need for an overlap in useful features between tasks for sparse-recurrent DFC to facilitate CL.

Overall, our work showcases the idea of adopting biological principles of neural computation and learning to derive new CL methods that not only perform significantly better than BP, but also show performance comparable to existing CL algorithms.

Code availability

The implementation of all of the models, training pipeline and analysis code are available on GitHub(https://github.com/pennfranc/bio-inspired-continual-learning) and archived(https://doi.org/10.5281/zenodo.7414720). Additionally, a modified version of the hypnettorch library adapted from von Oswald et al. (2020) used to load split-MNIST data is likewise available on GitHub(https://github.com/pennfranc/hypnettorch) and archived(https://doi.org/10.5281/zenodo.7361495). The implementations of EWC and SI were adapted from Hsu et al. (2018) to be compatible with our codebase.

References

Aljundi R, Kelchtermans K, Tuytelaars T (2019) Task-free continual learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11254–11263
Aljundi R, Lin M, Goujaud B, Bengio Y (2019) Gradient based sample selection for online continual learning. Adv Neural Inf Process Syst 32
Duncker L, Driscoll L, Shenoy KV, Sahani M, Sussillo D (2020) Organizing recurrent network dynamics by task-computation to enable continual learning. Adv Neural Inf Process Syst 33:14387–14397
Google Scholar
French RM (1991) Using semi-distributed representations to overcome catastrophic forgetting in connectionist networks. In: Proceedings of the 13th annual cognitive science society conference, vol 1, pp. 173–178
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article CAS PubMed Google Scholar
Hsu Y-C, Liu Y-C, Ramasamy A, Kira Z (2018) Re-evaluating continual learning scenarios: a categorization and case for strong baselines. arXiv preprint arXiv:1810.12488
Jedlicka P, Tomko M, Robins A, Abraham WC (2022) Contributions by metaplasticity to solving the catastrophic forgetting problem. Trends Neurosci
Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A et al (2017) Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci 114(13):3521–3526
Article CAS PubMed PubMed Central Google Scholar
Kudithipudi D, Aguilar-Simon M, Babb J, Bazhenov M, Blackiston D, Bongard J, Brna AP, Raja SC, Cheney N, Clune J et al (2022) Biological underpinnings for lifelong learning machines. Nat Mach Intell 4(3):196–210
Article Google Scholar
Laborieux A, Ernoult M, Hirtzlin T, Querlioz D (2021) Synaptic metaplasticity in binarized neural networks. Nat Commun. https://doi.org/10.1038/s41467-021-22768-y
Article PubMed PubMed Central Google Scholar
Lee S, Ha J, Zhang D, Kim G (2020) A neural Dirichlet process mixture model for task-free continual learning. In: International conference on learning representations
Levinson M, Kolenda JP, Alexandrou GJ, Escanilla O, Cleland TA, Smith DM, Linster C (2020) Context-dependent odor learning requires the anterior olfactory nucleus. Behav Neurosci 134(4):332
Article CAS PubMed PubMed Central Google Scholar
Lin AC, Bygrave AM, De Calignon A, Lee T, Miesenböck G (2014) Sparse, decorrelated odor coding in the mushroom body enhances learned odor discrimination. Nat Neurosci 17(4):559–568
Article CAS PubMed PubMed Central Google Scholar
Mahendran A, Vedaldi A (2016) Visualizing deep convolutional neural networks using natural pre-images. Int J Comput Vis 120(3):233–255
Article Google Scholar
Manneschi L, Lin AC, Vasilaki E (2021) Sparce: improved learning of reservoir computing systems through sparse representations. IEEE Trans Neural Netw Learn Syst
Masse NY, Grant GD, Freedman DJ (2018) Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proc Natl Acad Sci 115(44):E10467–E10475
Article CAS PubMed PubMed Central Google Scholar
Michael McCloskey, Cohen Neal J (1989) Catastrophic interference in connectionist networks: the sequential learning problem. Psychol Learn Motivat 24:109–165
Article Google Scholar
Meulemans A, Farinha MT, Ordonez JG, Aceituno PV, Sacramento J, Grewe BF (2021) Credit assignment in neural networks through deep feedback control. Adv Neural Inf Process Syst 34:4674–4687
Google Scholar
Meulemans A, Farinha MT, Cervera MR, Sacramento J, Grewe BF (2022) Minimizing control for credit assignment with strong feedback. In: KC, Stefanie J, Le S, Csaba S, Gang N, Sivan S (eds) Proceedings of the 39th international conference on machine learning, vol 162 of Proceedings of machine learning research, pp 15458–15483. 17–23 PMLR
Morcos AS, Barrett DGT, Rabinowitz NC, Botvinick M (2018) On the importance of single directions for generalization. In: International conference on learning representations. https://openreview.net/forum?id=r1iuQjxCZ
Parisi GI, Ji X, Wermter S (2018) On the role of neurogenesis in overcoming catastrophic forgetting. arXiv preprint arXiv:1811.02113
Parisi GI, Kemker R, Part JL, Kanan C, Wermter S (2019) Continual lifelong learning with neural networks: a review. Neural Netw 113:54–71
Article PubMed Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Google Scholar
Pourcel J, Vu N-S, French RM (2022) Online task-free continual learning with dynamic sparse distributed memory. In: European conference on computer vision. Springer, pp 739–756
Rao D, Visin F, Rusu A, Pascanu R, Teh YW, Hadsell R (2019) Continual unsupervised representation learning. In: Advances in neural information processing systems, 32
Roy O, Vetterli M (2007) The effective rank: a measure of effective dimensionality. In: 2007 15th European signal processing conference. IEEE, pp 606–610
Rusu AA, Rabinowitz NC, Desjardins G, Soyer H, Kirkpatrick J, Kavukcuoglu K, Pascanu R, Hadsell R (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671
Shin H, Lee JK, Kim J, Kim J (2017) Continual learning with deep generative replay. Adv Neural Inf Process Syst 30
van Bergen RS, Kriegeskorte N (2020) Going in circles is the way forward: the role of recurrence in visual inference. Curr Opin Neurobiol 65:176–193
Article PubMed Google Scholar
Van de Ven GM, Tolias AS (2019) Three scenarios for continual learning. arXiv preprint arXiv:1904.07734
van de Ven GM, Siegelmann HT, Tolias AS (2020) Brain-inspired replay for continual learning with artificial neural networks. Nat Commun 11(1):1–14
Google Scholar
Van Rullen R, Gautrais J, Delorme A, Thorpe S (1998) Face processing using one spike per neurone. Biosystems 48(1–3):229–239
Article PubMed Google Scholar
von Oswald J, Henning C, Sacramento J, Grewe BF (2020) Continual learning with hypernetworks. In: International conference on learning representations. https://arxiv.org/abs/1906.00695
Wang Z, Shen L, Fang L, Suo Q, Duan T, Gao M (2022) Improving task-free continual learning by distributionally robust memory evolution. In: International conference on machine learning, pp 22985–22998. PMLR
Xiao H, Rasul K, Vollgraf R (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, pp 818–833
Zeng G, Chen Y, Cui B, Shan Yu (2019) Continual learning of context-dependent processing in neural networks. Nat Mach Intell 1(8):364–372
Article Google Scholar
Zenke F, Poole B, Ganguli S (2017) Continual learning through synaptic intelligence. In: International conference on machine learning. PMLR, pp 3987–3995

Download references

Funding

Open access funding provided by Swiss Federal Institute of Technology Zurich. This work was supported by the Swiss National Science Foundation (B.F.G. CRSII5-173721 and 315230_189251), ETH project funding (B.F.G. ETH-20 19-01), the Human Frontiers Science Program (RGY0072/2019) and funding from the Swiss Data Science Center (B.F.G, C17-18). P.V.A. was supported by an ETH Zürich Postdoc fellowship. M.S. was supported by an ETH AI Center postdoctoral fellowship.

Author information

Authors and Affiliations

Institute of Neuroinformatics University of Zürich and ETH, Zürich, Switzerland
Francesco Lässig, Pau Vilimelis Aceituno, Martino Sorbaro & Benjamin F. Grewe
AI Center, ETH, Zürich, Switzerland
Martino Sorbaro & Benjamin F. Grewe

Authors

Francesco Lässig
View author publications
You can also search for this author in PubMed Google Scholar
Pau Vilimelis Aceituno
View author publications
You can also search for this author in PubMed Google Scholar
Martino Sorbaro
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin F. Grewe
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

F.L., B.F.G., P.V.A. conceptualized the project idea. F.L. and M.S. carried out all CL experiments and simulations. F.L., B.F.G., P.A. and M.S. wrote the manuscript. F.L. and B.F.G. made the figures and plots.

Corresponding author

Correspondence to Francesco Lässig.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Communicated by Benjamin Lindner

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is published as part of the Special Issue on ’What can Computer Vision learn from Visual Neuroscience?’.

Appendices

Appendix A: Additional performance results

1.1 A.1 Fashion-MNIST

To assess whether performance gains achieved by sparse-recurrent DFC generalize to other datasets, we repeated our performance experiments on the Fashion-MNIST dataset (Xiao et al. 2017). Due to the high similarity of certain pairs of classes (e.g. sandal/sneaker or pullover/coat) and big differences between others (top/sneaker), domain-IL task performance becomes highly dependent on the order of the classes. To approximate general CL performance, we tested all models on the same 10 random permutations of classes, which were learned in the same pair-wise fashion as split-MNIST. Averages and standard deviations are thus computed over the 10 permutations of classes. For this performance comparison, we retuned the regularization coefficients for EWC and SI, but did not retune sparsity levels and recurrent learning rate for sparse-recurrent DFC. At most, this would give our approach a slight disadvantage.

Figure 6 shows that our approach still yields performance at least as good as EWC and SI. For domain-IL, the improvements are less significant than the ones we obtained with split-MNIST. However, the same can be said for EWC and SI. From these results, we conclude that our approach generalizes beyond the MNIST distribution. Moreover, not needing to retune the hyperparameters of our model to maintain the performance gains suggests that our approach is quite robust. However, for a thorough assessment of robustness to different datasets, more work has to be done.

1.2 A.2 Combining MNIST and fashion-MNIST

To test the continual learning models on a larger number of tasks, as well as to test their robustness to sequential learning across different datasets, we developed the following new continual learning task: We first train the network under consideration sequentially on x pairs of MNIST digits, and then on x pairs of fashion-MNIST pictures. We ran this experiment for $x=2$ and $x=4$, resulting in a total of 4 and 8 tasks, respectively. Considering both scenarios allows us to analyse the effect of a sequence task increase (doubling of the number of tasks) in addition to comparing performance of learning algorithms on a new cross-dataset training paradigm. For the same reason as discussed in Appendix A.1, we evaluate CL performance over 10 random permutations of class orderings. The results are shown in Fig. 7.

In the case of 2 tasks per dataset, Fig. 7a shows that sparse-recurrent DFC improves upon BP when it comes to accuracy on dataset 1 after having been trained on both datasets. The improvements are seen in the high learning regime (low LRs), which we would expect to see in learning methods that are more robust against catastrophic forgetting. However, SI and EWC show more significant improvements in accuracy overall. Moreover, because sparse-recurrent DFC does not learn dataset 2 as well as the other methods (Fig. 7b), its performance improvements over BP on both datasets (Fig. 7c) are low, and clearly not as good as EWC and SI.

Doubling the number of tasks in both datasets leads to a similar situation. Figure 7d shows significant and consistent improvements of sparse-recurrent DFC over BP accuracy in high LR regimes, although overall accuracy is lower than that for EWC and SI. Figure 7e shows that, unlike in the case of 2 tasks per dataset, sparse-recurrent DFC generally shows better accuracy on dataset 2 than all other methods. Combining these results, Fig. 7f shows that sparse-recurrent DFC yields significantly higher accuracy than BP over most LRs, as well as similar (although still lower) accuracy as EWC and SI.

Overall, the combination of two different datasets within one sequence task diminishes performance of sparse-recurrent DFC relative to other learning algorithms when compared to the case of just one dataset as seen in Sect. 4.1 and Appendix A.1. However, the reduction in forgetting compared to BP in high learning regimes and on average are still reliable. We suspect that, in the case of more drastic shifts in input distribution (as is the case when shifting from MNIST to fashion-MNIST), explicit task-boundary information is especially useful to consolidate weights important for the first dataset, which would give EWC and SI an advantage over our method. Moreover, the superior single-dataset accuracy of our method also suggests that sparse-recurrent DFC succeeds, especially in those situations where representations learned in one task can be reused in subsequent tasks. While this might sound obvious, the successful reuse of representations is not trivial, as is seen when looking at the CL performance of BP.

To summarize, this experiment shows that, while CL performance improvements over BP can be maintained for both cross-dataset learning, as well as an increase in the sequence task length, it also points to a limit of our method: A certain overlap in terms of useful features between tasks may be a prerequisite for our method to protect against forgetting. To make it more resistant against more drastic changes in input distributions, adaptations may be necessary. In any case, we believe that requiring different tasks to benefit from similar features is not an artificial limitation, as biological brains are likely to reuse common features of the natural environment (e.g. edges, textures,...) when learning different tasks. While MNIST and fashion-MNIST are both visual, MNIST is most likely too simple and idiosyncratic to learn representations that will also be useful for transfer learning on fashion-MNIST.

1.3 A.3 Role of model complexity

It might be argued that comparing sparse-recurrent DFC to other models with the same number of neurons is unfair due to the added complexity of lateral and feedback connections. However, this should not be a concern, since the additional connections of sparse-recurrent DFC are only in use during training. For testing, we only use feedforward weights. In other words, after training, recurrent and feedback weights are “discarded” and our model has the exact same complexity as models trained with BP, EWC or SI. Moreover, the notion that additional parameters alone cannot account for performance improvements is also supported by Fig. 3b showing that lateral connections alone do not produce gains in accuracy.

Regardless, to ascertain that our improvements in CL performance cannot simply be matched by an increase in model complexity, we compared the performance of sparse-recurrent DFC to a bigger BP-trained network, as well as a bigger winner-take-all DFC model (DFC-sparse). Equation 7 shows how the number of parameters in a feedforward network in our setting can be computed as a function of the number of units per hidden layer x. The first, second, and third term represent weights and biases of the first hidden layer, second hidden layer and output layer, respectively. Equation 8 shows how the number of parameters in our recurrent network (including feedback and lateral connections) can be computed. For this, we add the lateral connections in the second term and the feedback connections in the third term.

$$\begin{aligned} n_\textrm{ff}(x)&= (784x + x) + (x^2 + x) + (2x + 2) \end{aligned}$$

(7)

$$\begin{aligned} n_\textrm{rec}(x)&= n_\textrm{ff}(x) + 2x^2 + 2x \end{aligned}$$

(8)

To compare our sparse-recurrent DFC model, with 20 units per hidden layer, against a feedforward networks with at least as many parameters during training, we need to find x such that $n_\textrm{ff}(x) > n_\textrm{rec}(20)$, which is already achieved by setting $x = 21$. Unsurprisingly, this will lead to no significant performance improvement compared to $x=20$, so to show the effect of increased model complexity for BP, we will use $x=30$. Figure 8a shows how sparse-recurrent DFC compares to the bigger BP network, as well as the default BP network for reference. The same is shown in Fig. 8b for DFC-sparse.

We can see that, while the increased size does improve performance somewhat for low LRs, accuracy decreases at high LRs. Overall, changes in average performance are negligible and the gains in accuracy achieved by DFC-sparse-rec cannot be attributed to increased model complexity during training.

1.4 A.4 Accuracy vs. tasks learned

In Sect. 4.1, we evaluate the performance of models on split-MNIST by recording the test accuracies after training on all tasks. To analyse how cumulative performance develops as more tasks are learned, we plot the mean accuracy of the first i tasks after training on task i (Fig. 9). Each model was evaluated with its optimal LR for 4 epochs. Curves that start with low accuracies for task 1 can be explained by the fact that choosing an LR that leads to convergence on task 1 is not optimal for the final accuracy on all tasks. Moreover, the increase in cumulative accuracy for task 4 in domain-IL can be attributed to the similarity of the digit pairs 0/1 and 6/7.

Appendix B: Hyperparameters

Our approach for choosing hyperparameters in sparse-recurrent DFC is to start with a configuration that is optimized to solve normal MNIST classification (non-CL) (Meulemans et al. 2022), and to leave all existing parameters unaltered for split-MNIST. Adding sparsity and recurrent gating introduces layer-wise sparsity levels and recurrent learning rate, respectively, as new hyperparameters. These new hyperparameters were tuned separately for domain-IL and class-IL. For EWC and SI, we tuned the regularization coefficient. The overarching principle here is that we only tune hyperparameters specifically associated with solving CL. Table 1 shows all tuned hyperparameters, as well as the activation function (which was not tuned). Table 2 shows the remaining hyperparameters shared by all models. Hyperparameter tuning was performed with respect to maximal average accuracy over a consecutive window of 6 LRs in cross-LR evaluation. The same hyperparameters were used for minimum accuracy evaluation.

Table 1 Model-specific hyperparameters

Full size table

Table 2 Hyperparameters shared between all used models

Full size table

Appendix C Hyperplane metrics

In Sect. 4.4 we compute two quantities that involve the use of hyperplanes dividing datapoints into two classes, as per the domain-IL setup (Van de Ven and Tolias 2019). In both cases we obtain the separation hyperplane by fitting a logistic regression model to a set of target activations of the last hidden layer $\{r_{L-1}^{k, j}\}_{k\in t_{i}}$, where $t_{i}$ refers to a set of indices of datapoints belonging to task i. $r_{L-1}^{k, j}$ represents the last hidden layer target activations induced by datapoint k after that network has been trained on task j. Let $h_{i,j}$ denote the hyperplane obtained by fitting a logistic regression model to classify $\{r_{L-1}^{k, j}\}_{k\in t_{i}}$ according to the domain-IL class labels. We use an L1 penalty for the logistic regression model to encourage sparse hyperplanes, otherwise we use the default parameters from the sci-kit learn library (Pedregosa et al. 2011).

1.1 C.1 Hyperplane alignment

Here we measure the extent to which $\{r_{L-1}^{k, i-1}\}_{k\in t_{i}}$ are correctly separated by $h_{i-1,i-1}$, that is how well a hyperplane from a previously learned task $i-1$ divides targets of new tasks i, before the network has been fit on the new task. If we represent classification accuracy of $h_{i,j}$ on $\{r_{L-1}^{k, u}\}_{k\in t_{v}}$ (i, j and u, v representing arbitrary task indices) as $h_{i,j}(\{r_{L-1}^{k, u}\}_{k\in t_{v}})$, then the hyperplane alignment metric $\alpha $ is given by Eq. 9.

$$\begin{aligned} \alpha = \frac{1}{4}\sum _{i=2}^5h_{i-1,i-1}(\{r_{L-1}^{k, i-1}\}_{k\in t_{i}}) \end{aligned}$$

(9)

$\alpha $ values are further averaged over 5 random seeds.

1.2 C.2 Movement towards hyperplane

For this metric, we consider distances travelled of feedforward activations, which we would normally refer to as $\phi (v_i^\textrm{ff})$. But because we are running out of space for superscripts, we will refer to ${\tilde{r}}_{L-1}^{k, j}$ as the last hidden layer feedforward activations induced by datapoint k after that network has been trained on task j. Please note, however, that $h_{i,j}$ is still computed as before, using target activations (including controller and recurrent effects). We quantify the distance of feedforward activations travelled from when they are first learned, to when task 5 training has been finished, with respect to the initially learned hyperplane. More precisely, for all task indices i, we compute the difference of the projections of $\{{\tilde{r}}_{L-1}^{k, i}\}_{k\in t_{i}}$ and $\{{\tilde{r}}_{L-1}^{k, i-1}\}_{k\in t_{5}}$ on the normal of $h_{i,i}$, which we denote as $n_{i,i}$. Let $T_{i,j}^c$ denote the matrix that contains as rows all elements of $\{{\tilde{r}}_{L-1}^{k, j}\}_{k\in t_{i}}$ which have c as their correct class label, where $c \in \{0,1\}$. From these matrices, we can compute the L1 distances travelled by datapoints with class c from task i projected onto the hyperplane normal $n_{i,i}$ as seen in Eq. 10.

$$\begin{aligned} {\tilde{d}}_i^c = (-1)^c \cdot (T_{i,5}^c - T_{i,i}^c)n_{i,i} \end{aligned}$$

(10)

The $(-1)^c$ factor is important to ensure inverted signs of travelled distances in the two classes. We need this because directions towards the hyperplane for one class are directions away from the hyperplane for the other. Because we only want to quantify distance travelled towards the hyperplane direction, and not away from it, we clip the distance vectors to only have positive values.

$$\begin{aligned} d_i^c = clip({\tilde{d}}_i^c, 0, \infty ) \end{aligned}$$

(11)

Finally, we obtain the mean normalized movement towards the hyperplane of activations from task i by dividing the average distance travelled towards $h_{i,i}$ by the average absolute distance travelled in any principal direction, as shown in Eq. 12.

$$\begin{aligned} \beta _i = \frac{\langle d_i^c\rangle _{c\in \{0,1\}}}{\frac{1}{2}\langle |T_{i,5}^{0,1} - T_{i,i}^{0,1}|\rangle } \end{aligned}$$

(12)

We need to divide the normalizing factor in the denominator by 2 because we are technically averaging over twice as many directions as there are matrix entries. This is because we consider both positive and negative directions for each principle dimension. The $\beta _i$ values are averaged over tasks i and 5 random seeds.

Appendix D: Fraction of unaltered subspace

With the unaltered subspace metric $\gamma $ we attempt to approximate the idea of the fraction of dimensions used by previous tasks that are left unaltered by the current task, as visualized by Fig. 5c. We reuse the notation from the previous section, where $\{r_{L-1}^{k, j}\}_{k\in t_{i}}$ refers to the set of target activations $r_{L,\textrm{ss}}$ elicited by datapoints of task i upon learning task j. To quantify the dimensionality of a set of neural activity vectors of a given layer, we utilize the effective rank metric proposed by Roy and Vetterli (2007). The effective rank of a matrix A with positive singular values $\sigma _1, \ge \sigma _2 \ge ... \ge \sigma _Q$ is calculated using Shannon entropy H as shown in Eqs. 13 and 14.

$$\begin{aligned} p_k&= \frac{\sigma _k}{\sum _{k=1}^Q|\sigma _k|} \end{aligned}$$

(13)

$$\begin{aligned} \text {erank}(A)&= exp(H(p_1, ..., p_Q)) \end{aligned}$$

(14)

We compute the effective rank of the matrix containing activity vectors as rows to quantify the effective dimensionality of the representations. We calculate the effective dimensionality of previously learned tasks (up to but without task i), the current task i, and the combination of previous tasks and the current task as shown in Eqs. 15, 16, 17, respectively.

$$\begin{aligned} \text {dim}_\textrm{prev}(i)&= \text {erank}( \{r_{L-1}^{k, j}\}_{k\in \bigcup _{l=1}^{i-1}t_l}) \end{aligned}$$

(15)

$$\begin{aligned} \text {dim}_\textrm{curr}(i)&= \text {erank}(r^{t_i}) \end{aligned}$$

(16)

$$\begin{aligned} \text {dim}_\textrm{cum}(i)&= \max (\text {erank}([r^{t_1}, ..., r^{t_i}]), \nonumber \\&\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \text {dim}_\textrm{prev}(i), \text {dim}_\textrm{curr}(i)) \end{aligned}$$

(17)

Effective rank as a function of sets of target activations does not guarantee monotonicity, which means that the effective rank of a subset of targets can be larger than the effective rank of the superset. To avoid invalid fractions, we guarantee monotonicity between previous, current and cumulative dimensionality by making sure $\text {dim}_\textrm{cum}$ is at least as big as $\text {dim}_\textrm{prev}$ and $\text {dim}_\textrm{curr}$. If we subtract the cumulative dimensionality from the sum of the previous and the current one, we get the intersection of the two, i.e. the dimensionality that is affected by the current task. To quantify the unaltered fraction of previous dimensionality $\gamma $, we subtract the fraction of the intersection divided by the previous dimensionality from 1 as shown in Eqs. 18.

$$\begin{aligned} \gamma = 1 - \frac{\text {dim}_\textrm{prev}(i) + \text {dim}_\textrm{curr}(i) - \text {dim}_\textrm{cum}(i)}{\text {dim}_\textrm{prev}(i)} \end{aligned}$$

(18)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lässig, F., Aceituno, P.V., Sorbaro, M. et al. Bio-inspired, task-free continual learning through activity regularization. Biol Cybern 117, 345–361 (2023). https://doi.org/10.1007/s00422-023-00973-w

Download citation

Received: 30 November 2022
Accepted: 06 August 2023
Published: 17 August 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00422-023-00973-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Bio-inspired, task-free continual learning through activity regularization

Abstract

Similar content being viewed by others

Online Continual Learning on Sequences

Hierarchically structured task-agnostic continual learning

Toward durable representations for continual learning

1 Introduction

2 Background

2.1 Computational strategies for continual learning

2.2 Continual learning in the brain

2.3 Task-free continual learning

3 Activity regularization through sparsity and recurrent gating

3.1 Deep feedback control

3.2 Dynamic sparsity

3.3 Gating neuron activity through lateral recurrent connections

4 Experiments

4.1 Performance

4.1.1 Learning rate performance evaluation

4.1.2 Early stop performance evaluation

4.2 Integrating feedback signals facilitates CL

4.3 Sparsity and recurrent gating are required for CL

4.4 Aligning sparse, separated representations across tasks facilitates domain-IL

4.5 Learning within separate subspaces facilitates class-IL

5 Discussion

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Additional performance results

1.1 A.1 Fashion-MNIST

1.2 A.2 Combining MNIST and fashion-MNIST

1.3 A.3 Role of model complexity

1.4 A.4 Accuracy vs. tasks learned

Appendix B: Hyperparameters

Appendix C Hyperplane metrics

1.1 C.1 Hyperplane alignment

1.2 C.2 Movement towards hyperplane

Appendix D: Fraction of unaltered subspace

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation