LLEDA - Lifelong Self-Supervised Domain Adaptation

Humans and animals have the ability to continuously learn new information over their lifetime without losing previously acquired knowledge. However, artificial neural networks struggle with this due to new information conflicting with old knowledge, resulting in catastrophic forgetting. The complementary learning systems (CLS) theory [1, 2] suggests that the interplay between hippocampus and neocortex systems enables long-term and efficient learning in the mammalian brain, with memory replay facilitating the interaction between these two systems to reduce forgetting. The proposed Lifelong Self-Supervised Domain Adaptation (LLEDA) framework draws inspiration from the CLS theory and mimics the interaction between two networks: a DA network inspired by the hippocampus that quickly adjusts to changes in data distribution and an SSL network inspired by the neocortex that gradually learns domain-agnostic general representations. LLEDA’s latent replay technique facilitates communication between these two networks by reactivating and replaying the past memory latent representations to stabilise long-term generalization and retention without interfering with the previously learned information. Extensive experiments demonstrate that the proposed method outperforms several other methods resulting in a long-term adaptation while being less prone to catastrophic forgetting when transferred to new domains.


Introduction
Deep neural networks have shown near human-level capabilities in many fundamental computer vision tasks [3,4,5,6,7]. Humans and animals can continuously acquire new information over their lifetime without catastrophically forgetting the prior knowledge learned. This ability to continually learn over time by accommodating new knowledge while retaining the previously learned knowledge is referred to as lifelong or continual learning (in our paper, we will continue to refer to it as lifelong learning). However, artificial neural networks lack these capabilities as new information interferes with previously learned knowledge and sometimes the old knowledge completely gets overwritten by the new one, leading to impaired performance [8]. The root cause of catastrophic forgetting is that learning necessitates changes in the weights of a neural network, however, these changes also result in the forgetting of previous learning.
The focus of this paper is on lifelong domain adaptation, in which the model is trained on multiple sequential domains, continuously adapting to new domains with changing distributions as they become available, while maintaining its knowledge of previously encountered domains.
Domain adaptation (DA) methods based on deep learning have received significant attention in recent years for mitigating the domain shift from the training domain to the inference domain [9,10,11,12], and have even been suggested as transformative technologies in settings such as agriculture [13] and arts [14]. However, current domain adaptation methods operate under the assumption that datasets from both the source and the target domains are accessible at the same time during training, which may not be feasible in practice. In addition, DA algorithms require fully labelled datasets, even stateof-the-art Unsupervised Domain Adaptation (UDA) methods need access at least to the source labelled dataset. Therefore, these algorithms require persistent manual annotation, which is time-consuming, cumbersome and expensive. Finally, just updating the underlying model will not be sufficient, as the model would likely forget the past learned domain information resulting in catastrophic forgetting. Acknowledging these issues, we propose LLEDA that addresses both catastrophic forgetting and domain-agnostic knowledge transfer using solely unlabelled datasets with access to a single domain at any given time.
The mammalian brain can continually acquire, process, consolidate, retrieve, and infer knowledge over time without catastrophically forgetting the previously learned information which can be explained using CLS theory [1,2]. It suggests that efficient learning in the mammalian brain requires two learning systems: the neocortex and the hippocampus. The first system gradually acquires structured generalised knowledge, while the second system quickly learns the specific experiences, and the interplay between these two systems enables long-term retention. It also implies that memory replay is the mechanism that facilitates interaction between these two systems to consolidate and stabilise new memories for long-term generalisation to reduce catastrophic forgetting.
Recently, a study by [15] identified that the existing lifelong learning techniques are missing few biological elements. They highlight that many existing approaches solely focus on modelling the cortex directly and do not have a rapid learning network which is essential for facilitating effective lifelong learning in the brain. Additionally, the study also points out that none of the current methods employs information from the neocortex-inspired network to influence the training of the hippocampal-inspired network, whereas, in biological networks, the neocortex influences learning in the hippocampus and vice versa.
Our proposed LLEDA network attempts to solve the first issue by utilizing two distinct networks, DA network for rapid learning and the SSL network for gradual acquisition. LLEDA mimics the interplay between the neocortex and the hippocampus, where the hippocampal-inspired DA network functions as a rapid acquisition mechanism to adapt the distribution shift between the given data stream and the data from memory, and the neocortex-inspired SSL network works like a gradual learning mechanism to generalise the representations by gradually acquiring structured knowledge using self-supervised techniques enabling effective lifelong learning. LLEDA's Latent memory replay facilitates communication between these two networks by reactivating the neural activity patterns representing previous experiences to stabilise new memories for long-term generalisation and retention without interfering with the previously learned information. LLEDA attempts to address the second issue by querying the information from the neocortex to influence the training of the hippocampal-inspired network during training.
Overall, our framework reduces catastrophic forgetting, while facilitating domain-agnostic knowledge transfer without accessing labelled data both from the source and target domains at any given time. To the best of our knowledge, this is an area of do-main adaptation that has not yet been explored. In summary, our work makes the following contributions: 1. Inspired by the CLS theory, LLEDA mimics the interplay between the DA network which helps to rapidly adapt the distribution shifts between domains, and the SSL network that helps with the gradual acquisition of domainagnostic general representations, and the latent representations replay technique helps to replay the past memory representations, instead of raw image pixels to overcome catastrophic forgetting. 2. Our proposed self-supervised based approach does not require access to either source or target labels, hence saving time and effort to annotate data and assisting with the labeling bias. 3. Extensive empirical results demonstrate that our method performs competitively across several benchmarks, when compared against other approaches.

Related Work
Domain Adaptation: Under the assumption of independent and identically distributed (iid) data, a deep neural network trained on one set of data is expected to perform well on a new, unseen set of data. However, this assumption may not always hold in real-world applications due to the discrepancy between domain distributions, and applying the trained model to the new dataset may also result in negative performance. Domain adaptation is a special case of transfer learning where the goal is to learn a discriminative model in the presence of domain shift between source and target datasets. Various methods have been introduced to minimise the domain discrepancy in order to learn domain-invariant features. Some involve adversarial methods like DANN [11], ADDA [16] that help align source and target distributions. Other methods propose aligning distributions through minimising divergence using popular methods like maximum mean discrepancy [17,9,18,5,10,12], correlation alignment [19,20], and the Wasserstein metric [21,22]. MMD was first introduced for the two-sample tests of the hypothesis that two distributions are equally based on observed samples from the two distributions [17], and this is currently the most widely used metric to measure the distance between two feature distributions. The Deep Domain Confusion Network proposed by Tzeng et al. [23] learns both semantically meaningful and domain invariant representations, while Long et al. proposed DAN [9] and JAN [18] which both perform domain matching via multi-kernel MMD (MK-MMD) or a joint MMD (J-MMD) criteria in multiple domain-specific layers across domains.
Self-Supervised Learning: Self-Supervised Learning (SSL) is a paradigm developed to learn visual features from unlabelled data. Recently, SSL approaches have shown significant performance sometimes even surpassing, the performance of supervised baselines [24,25,26,27,28,29,30,31,32]. These methods use image augmentation techniques to generate multiple views of a given image and learn a model that is invariant to these augmentations. Most recent approaches are divided into two main categories, contrastive and non-contrastive methods. Contrastive methods learn an embedding space where positive pairs are pulled together, whilst negative pairs are pushed away from each other [24,25,26]. Non-contrastive methods on the other hand remove the need for explicit negative pairs either by using distillation or by regularisation of the variance and covariance of the embeddings [30,31,28,29]. However, none of these works studied the ability of SSL methods to learn continually and adaptively if they are applied directly. Moreover, very few works have attempted to use SSL in the lifelong domain adaptation setting, e.g. [33] is designed using contrastive learning, so it lacks the capability to adapt using other SSL paradigms. [34] trains model step wise by generating pseudo labels and fine-tuning on intermediate domains until it reaches the target domain, this model can adapt well only if the domain shift is small between the intermediate domains, and it also uses source-labelled data. In this paper, we present a generalpurpose framework to incorporate self-supervised learning approaches into the lifelong learning process to extract generalised representations.
Continual learning: Continual learning strategies aim to find the right balance between preventing catastrophic forgetting and acquiring new information. According to [35], catastrophic forgetting can be mitigated using model regularisation, memory replay or by expanding and training the network. Regularisation methods identify the network weights that contribute significantly to retaining knowledge about a previously learned task and then consolidate them when the model is updated to learn the subsequent tasks [36,37,38]. On the other hand, dynamic architectures modify the model's underlying architecture by dynamically accommodating neural resources as it learns new patterns [39,40,41]. Alternatively, the model can be expanded progressively to learn the new tasks using added weights that propose ways of constraining the tasks' objectives to avoid forgetting [42,43,44]. CLS and replay methods rely on memory replay by storing samples from old distributions and regularly feeding them back to the model to overcome catastrophic forgetting. Some of the existing CL methods [45,46,47] store raw inputs of previous data in the memory, however, replaying raw pixels is not biologically plausible. Generative replay approaches [48,49] are very difficult to train due to issues such as convergence and mode collapse, additionally scaling up generative replay to complex datasets is challenging. Latent replay is the most biologically plausible approach [50], hence we adapt latent replay in LLEDA.
The existing literature has limited research on domain adaptation within the context of lifelong learning. [51] tackles continual and supervised adaptation using labelled data, [52] and [53] address continual domain adaptation assuming gradual target shifts in changing environments, which limits the practicality of these works. We relax these limitations and develop a unified solution LLEDA inspired by the functioning of the mammalian brain and the CLS theory. LLEDA tackles the problem of catastrophic forgetting and enables domain-agnostic knowledge transfer without the need for labelled data from either the source or target domains. It operates exclusively with unla-belled datasets, allowing for learning from a single domain at a time. This work sits at the junction of lifelong learning, selfsupervised learning, and domain adaptation.

Methodology
Our overall objective is continually updating a model to learn distributional shifts while retaining knowledge about past learnings. We propose a novel lifelong domain adaptation framework (depicted in figure 1 and algorithm 1), which has three key components and is motivated by the CLS theory [1]. The DA network in LLEDA swiftly adapts to changes in the data distribution between the current domain and previously encountered domains. The SSL network learns to generalise representations through self-supervised learning of domain-agnostic data, while the latent memory component facilitates the interaction between the two networks. By replaying and reactivating past experiences, this component stabilizes new memories for longterm retention and generalization. The combined operation of the DA and SSL networks integrates new information into the long-term network without compromising previous knowledge.
The LLEDA framework process involves the following steps: first, the SSL network learns the visual features and their relationships from the unlabeled input data using self-supervised techniques. As the SSL network is not task-specific, the learned representations are more general, capturing the underlying structure of the data. Next, the DA network uses Maximum Mean Discrepancy (MMD) loss to address domain shift between the current domain and previous domains stored in memory. This loss is backpropagated to both networks for consolidation and to prevent interference. The latent memory component stores and replays past experiences as representations, rather than raw input pixels, to aid interaction between the two networks. All learning occurs in a synchronous and interleaved manner.

Generalised Feature Learning
LLEDA employs the SSL network to gradually learn and capture the visual features, underlying structure, and their relationship. As this network is trained independently from the DA LLEDA's SSL backbone network is compatible with all the existing SSL models (SimCLR [24], BYOL [30], etc.,), so any generic SSL model can be used as the backbone. However, we have considered VICReg [54] as our backbone to reduce the SSL loss due to its simplicity, additionally it does not require a memory bank, contrastive samples, or a large batch size. We have conducted ablation studies using alternative SSL models like SimCLR [24] and BYOL [30] as our background network to reduce the SSL network's loss, which has been discussed later in section 4.5. VICReg model uses the weighted average of invariance, variance and covariance to calculate the loss between . The SSL loss is defined as follows: Where λ, µ, ν are the hyper-parameters controlling the importance of each term in the loss. s(z i , z j ) is the Invariance, c(z i ), c(z j ) is covariance and υ(z i ), υ(z j ) is variance. The overall objective is given by

Domain-specific Representations Learning
The goal of the DA network is to rapidly learn to reduce the domain discrepancy for the incoming domains, simultaneously working well on the previous domains without catastrophically forgetting the learnings. The DA network uses Maximum Mean Discrepancy (MMD) loss to address domain shift. The DA network, inspired by Dualnet [47], also interacts with the SSL network and acquires generic representations that influence its learning in a manner akin to biological networks, improving its capacity to reduce discrepancy between domains. It reduces the discrepancy in two stages: The DA network uses Maximum Mean Discrepancy (MMD) loss to address domain shift. It calculates MMD loss using representations from block 4 of the Resnet (DA1). It again calculates the MMD loss between the memory representations and the current data stream propagation (DA2) following the element-wise multiplication. Calculating the MMD loss at two stages (DA1 and DA2), as seen in figure 2, helps to effectively reduce the domain shift, compared to a single domain adaptation loss.
Let s 4 be the feature representation from the SSL network's residual block, and d 4 be the feature representation from the DA network's residual block as shown in figure 2, the adapted feature is obtained during network interaction as follows: where ⊗ denotes the element-wise multiplication,the output of the rapid DA network d 4 , gradual SSL network s 4 and the transformed feature d 4 all have the same dimension.
The final layer's transformed feature d 4 will be fed into the DA network's head to calculate the DA2 loss using MMD. The rapid DA network takes advantage of the gradual SSL learner's generalised feature representations resulting in quick adaptation leading to reduced domain shift and improved generalisation leading to better identification of classes in the downstream classification task.
MMD defines the distance between the two distributions with their mean embeddings in the Reproducing Kernel Hilbert Space (RKHS). MMD is a two-sample kernel test to determine whether to accept or reject the null hypothesis p = q [17], where p and q are source and target domain probability distributions. In short, the MMD between the distributions of two datasets is equivalent to the distance between the sample means in a high-dimensional feature space and is computed by the fol-lowing equation: where: φ (.) is the mapping to the RKHS H; and k (., .) = φ (.) , φ (.) is the universal kernel associated with this mapping, and N, M are the total number of items in the source and target respectively.

Latent Replay
The mammalian brain has successfully evolved to resist catastrophic forgetting by reactivating, replaying, and recreating the experience preserved in memories [55,56]. It retains compressed versions of the crucial information from past experiences and reactivates by replaying these neural activity patterns of prior experiences. Inspired by this, LLEDA stores feature representations in the memory from a given(specific) layer instead of raw input pixels, it reactivates and replays these representations to overcome catastrophic forgetting.  In LLEDA, we chose to store the representations from block-1 of our backbone Resnet network. We freeze the network layers below block-1 (below the latent replay layer) to ensure the stability and accuracy of these representations and to prevent the aging effect [57]. Freezing the network also helps with the stability of the stored representations, else they will differ from the feature representations that would have been generated while feed-forwarding from the input layer.
As our model does not have access to labels, we follow a simple approach and thereby store a random subset of past latent representations in memory and train the network while interleaving with new domain representations [58]. While selective replay has shown promising results in few settings, several studies have found that random sampling works equally well [59,60], achieving similar performance compared to complicated techniques but uses less compute, hence we store random subset of representations in the memory. Following that, we save the latent representations from both the DA and the self-   Table 3: Comparison of the proposed LLEDA method on Office-Caltech datasets with state-of-the-art methods, using Average Accuracy (Avg) as the performance metric. The best average is indicated in bold supervised networks for the given random image. During memory consolidation, these memories are interleaved with new latent representations to form a more general representation supporting long-term retention and generalisation when encountering new domain experiences. As it would be inefficient and impractical to store all past latent representations in the latent buffer, we instead relatively store only a small number of latent representations per domain until the buffer reaches the given number. Thus, at any point in time, the buffer contains a limited size of past random experiences as shown in algorithm 2.

Datasets
We compare and evaluate our method against baseline approaches on a number of benchmark domain adaptation datasets, such as Digits, Office-Home [61] and Office-CalTech [62].
Digit Datasets: We consider the standard digits datasets broadly adopted by the computer vision community. MNIST [63] and USPS [64] are hand-written grey-scale images, with relatively small domain differences. SVHN [65] contains images of street numbers with more than one digit in each image. We conducted experiments on two tasks: SVHN → USPS → MNIST and MNIST → USPS → SVHN and reported the average accuracy of the trained model in the context of lifelong learning setting. These two scenarios will allow us to reflect on the performance of lifelong learning scenarios starting from easy datasets, moving to harder ones and vice versa. Sample images of the digit datasets are presented in figure 3.
Office-Home [61]: The office-home data consists of four visual domains: Art (A), Clipart (C), Real World (R), and Product (P) each consisting of images from 65 visual categories totalling 15,500 images in office and home settings leading to the possibility of defining 12 pair-wise binary UDA tasks. We conducted several experiments on two tasks: Art → Realworld → Clipart → Product and Product → Clipart → Realworld → Art and reported the average accuracy of the trained model in the context of lifelong learning setting. Sample images of the office-home datasets are presented in figure 4.
Office-CalTech [62]: This dataset is an extension of the Office-31 [66] with 10 common categories shared by Office-31

Training Methods
We benchmark LLEDA against the baseline method which uses a single network and finetunes the model as the new training domains come along, we then compare our LLEDA methodology with DANN [11] and DAN [9], both of them are classic domain adaptation methods and both these methods have access to source and target data during training. We also compare LLEDA with CUA [53] and GRCL [33] which are continual learning methods with access to source labels. They are both replay-based methods. We also compare these methods with the supervised version of our approach, LLEDA-S. Most of the methods provide the results in the domain adaptation setting, but we have provided our results in the context of lifelong learning setting as described in section 4.1.

Implementation Details
Our implementation consists of three stages. In the first stage, we pre-train the model on ImageNet. In the second stage, we use the pre-trained model and further train the LLEDA model as outlined in the methodology section. In the final stage, we train a linear classifier on a fixed representation while removing the MMD projection head, and evaluate it on the domain datasets.
We pretrain our network using Resnet18 as our backbone model [68] on Imagenet. During the pretraining phase, we train LLEDA on two nodes, each consisting of 4 GPUs (Titan Xp GPUs), using LARS optimizer [69] with a batch size of 512 and weight decay of le-6 for a total of 100 epochs.
We use the pretrained network trained in the previous step as a base during the training phase, and train the datasets by interleaving them with the stored memory representation from the given layer with the current domain data representations. During finetuning and evaluation, we freeze the trained network and train a linear classifier on top of a fixed representation, whilst discarding the MMD part of the network, which we then use for evaluation. Similar to most self-supervised models [24,25,70,25,26,28,29,30,31], we report performance by training a linear classifier on top of a fixed representation to evaluate representations which is a standard benchmark that has been adopted by many papers in the literature.

Results and Analysis
Baseline: We start by training a basic model M i on domain D i , we then finetune the model by training on the next available sequential domain D i+1 . When this training reaches the end of the cycle, it often performs badly on older domains due to catastrophic forgetting, we treat this as our baseline. In our experiments, we use Resnet18 as our baseline model.
To start with, LLEDA shows increased performance with respect to the baseline 56.7% to 86.6% table 1, as the baseline is a very basic finetuned model. But overall, from table 1, we can see that the performance of LLEDA and LLEDA-S on the Digits dataset is significantly better than the other state-of-theart methods.
Office-Home dataset: Similar to Digits dataset, Office-Home dataset has an increased performance when compared to the baseline method from 28.7% to 58.2% which can be seen in table 2, which is expected. The CUA method has a slight advantage probably due to the access to source labels, so compared to LLEDA, the accuracy is 0.4% higher. On the other hand, LLEDA-S with access to labels has an increased performance of 1.7% compared to the CUA methodology.
Office-Caltech dataset: Similar observation here with respect to the baseline comparison, the performance increased from 52.3% to 86.1% as expected. LLEDA method performed well compared to the other state-of-the-art methods from table 3. GRCL method is 1.1% higher than LLEDA as it has a slight advantage due to its access to source labels. But if we compare GRCL with LLEDA-S, LLEDA-S shows a marginal (0.1%) increase in performance compared to the GRCL methodology.
Overall even though LLEDA does not have access to labels or access to source datasets, we can clearly observe from tables-1-3 that the performance of LLEDA i.e, the average accuracy is comparatively similar or better than the other methods.

Ablation Studies
Ablation: LLEDA's SSL network using state-of-the-art self-supervised methods as building blocks We evaluated the effectiveness of LLEDA by replacing the LLEDA's SSL network with some of the state-of-the-art SSL networks. To assess lifelong learning performance, we start by training the image samples from one dataset, and then continue to train on image samples from the next dataset and so on. The two cycles we suggest are as follows SVHN -USPS -MNIST and MNIST -USPS -SVHN, we refer to these as cycle-1 and cycle-2 respectively.
We analyse the LLEDA's accuracy with respect to the method of the gradual learning SSL network used. In table 4 and figure  6, we compare three SSL methods-SimCLR [24], BYOL [30] and VICReg. We chose these SSL networks as all three methods feature different losses and use different techniques to avoid collapse such as negative samples, redundancy reduction, etc. Additionally, the former is a contrastive-based method, whereas the latter two are non-contrastive ones. Table 4 shows that the average performance of VICReg is robust in comparison to the average performance of contrastivebased SimCLR [24] as the latter requires a require large amounts of contrastive pairs and a higher batch size to converge. The average performance of VICReg slightly underperforms compared to BYOL [30], but overall, the comparative performance of all three SSL methods is almost similar, hence any generic SSL method will work with LLEDA.
LLEDA SSL and DA network interaction using elementwise operations: We analysed different types of operations used for interactions and influence between the DA network and the SSL network. We considered element-wise addition, element-wise maximum value, and element-wise mean besides adapting element-wise multiplication to test the generalisation ability. Table 5 demonstrates that the element-wise maximum value seems like a poor choice since the interaction between the two networks appears more competitive than complementary. Element-wise addition and element-wise mean look better than the element-wise max, but the adapted element-wise multiplication seems to be the best choice.

Conclusion & Future Work
Inspired by how the human brain works and the CLS theory, we developed LLEDA, a model that can perform well in a lifelong domain adaptation setting. Our experiments demonstrate that LLEDA can effectively tackle downstream domain adaptation tasks without access to labelled data, outperforming other existing methods. We believe our study will encourage future research in lifelong domain adaptation using unlabelled source and target data. As our next step, we aim to investigate efficient lossy and lossless compression techniques for compressing latent representations in LLEDA.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.