Domain Adaptation: Challenges, Methods, Datasets, and Applications

Deep Neural Networks (DNNs) trained on one dataset (source domain) do not perform well on another set of data (target domain), which is different but has similar properties as the source domain. Domain Adaptation (DA) strives to alleviate this problem and has great potential in its application in practical settings, real-world scenarios, industrial applications and many data domains. Various DA methods aimed at individual data domains have been reported in the last few years; however, there is no comprehensive survey that encompasses all these data domains, focuses on the datasets available, the methods relevant to each domain, and importantly the applications and challenges. To that end, this survey paper discusses how DA can help DNNs work efficiently in these settings by reviewing DA methods and techniques. We have considered five data domains: computer vision, natural language processing, speech, time-series, and multi-modal data. We present a comprehensive taxonomy, including the methods, datasets, challenges, and applications corresponding to each domain. Our goal is to discuss industrial use cases and DA implementation for those. Our final aim is to provide future research directions based on evolving methods and results, the datasets used, and industrial applications.


I. INTRODUCTION
Leon C. Megginson summed up Charles Darwin's work [1] by saying, ''It is not the strongest of the species that survives, not the most intelligent that survives. It is the one that is most adaptable to change''. The same thing can also be said about technology. The workhorse of Machine Learning (ML) and Artificial Intelligence (AI) -supervised learning has a severe disadvantage in that it works well when samples for training and testing both belong to the same distribution and are independent and identically distributed (i.i.d.). Domain Adaptation (DA) is a special case of Transfer Learning (TL), which supports and solves real-world (including in the wild) The associate editor coordinating the review of this manuscript and approving it for publication was Michele Nappi . challenges by effectively applying the model trained on one dataset (source) for testing on another domain (target) with different distribution.
Domain Adaptation (DA) is increasingly acquiring traction from academia and industry since it promises the practical and evolving side of AI and ML. DA, in many ways, mimics how humans learn and adapt to the real world around them. In practice, we see that the supervised learning model's accuracy (or another performance metric) is not transferrable for the same tasks to datasets not used as part of the training. The primary reason for this failure is a deviation from an assumption-the source and target domain data are drawn from the same distribution. The problem is further accentuated when we understand that acquiring labeled data is timeconsuming, costly, and at times, infeasible -which means VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the state-of-the-art models are limited to only some academic datasets. The performance degradation is caused by domain shift (domain gap or dataset bias): the difference in data distributions between source and target domains. DA is a field of AI that aims to alleviate as far as possible the impact of domain shift and ensures that the models perform well in the target domain after being trained on the source domain.
The target and source domains should have some similarities (e.g., features) for a meaningful adaptation. DA provides an attractive option for Deep Learning (DL) -which, more often than not, provide high performance over shallow learning or classical learning algorithms. DA negates the vast amount of labeled data requirements in the target domain and typically uses available (labeled) data in the source domain, a boon to data-hungry supervised DL algorithms. Realistically, there is an excessive amount of unlabeled data available, but labeled data is scarce. Some techniques have been tried to better the performance metric of deep networks by using more data (labeled) from the target domain, including better/alternative architectures and backbones, use of normalization layers (e.g., Instance Normalization (IN) [2], Batch Normalization (BN) [3]), data generation and data augmentation, etc. By far, DA appears to provide a more robust alternative to all the mentioned techniques.
Initial work on DA is related to shallow (or classical) learning. With DL more prevalent in recent years, the focus of research shifted to DA in DL. The invention of GAN [4], Attention and Attention-based Transformers [5] have boosted various DA in DL methods. The research direction and focus now is to solve real-world and practical setting problems with the latest methods and techniques (e.g., few or zero-shot, selfsupervised learning, meta-learning, etc.) and with real-world data situations (e.g., multi-modal data, multi-domain, continuous/ incremental domains, and data restriction, etc.). This survey does not focus at length on Domain Generalization (DG), a related area where information about the target domain is unknown.
A number of survey papers on DA are reported. The primary difference between this and the previous works is threefold; this survey encompasses various data domains instead of only focusing on a specific (text/image-based) modality. Secondly, the survey is conducted with a primary focus on the applications of DA in these data domains -the challenges faced and how those can be mitigated using DA. Thirdly, it tries to understand the application of DA approach across data domains/modalities and also tries to understand what makes a particular DA approach data domain specific. In summary, the primary goals of this work are: 1. To provide a joint perspective and recent updates of domain adaptation in five deep learning data domains -Visual or Computer Vision (CV), Natural Language Processing (NLP), speech, time-series, and multimodal domains. Most of the previous surveys only focused on the visual domain (CV) or NLP domain and missed out on areas of cross-pollination. This survey, we believe, for the first time, discusses DA in multi-modal data settings. To understand data domain (CV, NLP, speech, time-series, multi-modal) specific DA methods and techniques and ones that are used across data domains. 2. To compile a list of existing and emerging DA datasets and tasks in five data domains. 3. To review recent DA methods and techniques for more practical DA settings like learning with fewer data, learning on the go, continuous adaptation, presence of domain or category gap, etc., across data domains. 4. To understand challenges and issues that hinder the adoption of DA. Based on these challenges and issues, research directions are also provided. These challenges and issues also provide research direction.

Understanding and reviewing industrial use-cases
where DA has been employed and appreciating use-cases where DA if deployed, would provide rich dividends. Organization of paper: Pictorial view of the organization of the paper can be seen in Figure 1. For completeness, the survey also briefly discusses the background, definition, and theory of DA in section II and then discusses DA in shallow or classical learning in section III. DA in DL is discussed in section IV; this section also focuses on more practical DA settings. Datasets used in five data domains and observations are mentioned in section V. Challenges and issues being worked on in this field are mentioned in section VI. Section VII looks at common and specific DA use-cases across industries and provides a perspective on how DA can be helpful. Section VIII provides the future research frontiers. The paper is concluded in section IX.

II. BACKGROUND
This section aims to succinctly provide the formal definition of DA, the categories of transfer learning and domain adaptation, and a theoretical foundation of domain adaptation.

A. FORMAL DEFINITION OF DOMAIN ADAPTATION
Let there be a source domain D s , composed of a feature space χ s and marginal probability distribution P(X s ) such that D s = {χ s , P(X s )}. Also, there exists a sample set X s = {x s 1 , x s 2 , . . . , x s n } and corresponding labels Y s = {y s 1 , y s 2 , . . . , y s n } from ϒ. Similarly, there is a target domain D t , composed of a feature space χ t and data with marginal probability distribution P(X t ) such that D t = {χ t , P(X t )}. Also, there exists a sample set X t = {x t 1 , x t 2 , . . . , x t n } and corresponding labels Y t = {y t 1 , y t 2 , . . . , y t n } from ϒ. Sometimes, labels in the target domain are unavailable (case of unsupervised DA) or only a few are available (case of semi-supervised DA), or no data at all is available in the target domain (case of domain generalization or zero-shot DA). Supervised or unsupervised DA refers to labels in the target domain being available or not for training. There exists a domain shift between D s and D t . The task in the source domain is: T s = {ϒ s , P(Y s |X s )} and the target domain is: If, T s is related to T t , and the same model f also works for X t → Y t with a minimal error or acceptable error, the model f has adapted to the target domain D t and source domain D s .

B. CATEGORIES OF TRANSFER LEARNING AND DOMAIN ADAPTATION
The seminal work on DA by Pan and Yang [6] mentions that DA is a specific case of transfer learning (TL). The commonality between DA and TL is that some learning based on source domain data is utilized for the task in another. Hence it is beneficial to understand different instances/types of TL.
A. Based on the feature set and data distributions, there are two types of transfer learning approaches (refer to Table 1) B. Based on the task difference and the corresponding source and target domain data (refer to Table 2) Figure 2 shows DA categories based on various source and target domain characteristics. DA work typically falls into homogeneous and transductive TL. However, in the recent past, there have been reasonable attempts to focus on heterogenous DA. DA can be categorized based on the availability of labels in the target domain (refer to Table 3 and Table 4).
DA can also be categorized based on the label (classes) in domain and source data (refer to Table 4).
Typically, domain classification represents the scenario when there is only a single source domain. The adaptation is to another single-target domain (called single-target DA). However, recently, DA to multiple target domains (called multi-target DA) is also reported. Adaptation from multiple source domains (called multi-source DA) has also been researched. VOLUME 11, 2023  Until more recently, the DA focus was on reducing the dependency of labeled instances of data in the target domain; now, researchers are also focusing on reducing the dependency of data itself in the target domain. Few-shot DA, single-shot DA, and zero-shot DA are examples of efforts to incrementally reduce the requirement of target domain data. Predictive DA uses metadata in the target domain to adapt. Domain generalization (DG) can be seen like zero-shot DA, but it is bereaved of knowing anything about the target; however, more robust DG methods should also include some essence of multi-target DA, Universal DA. DA techniques also focus on the absence of source data during the DA FIGURE 3. Domain adaptation categories plotted based on the availability of annotated data in the source and the target domain (forming aa horizontal plane) and category (class) set difference in source and target (forming the vertical axis), enhanced and adapted from Tommasi [7]. process -this may be due to privacy reasons (Federated DA) or plain unavailability (Universal source-free DA).
Tommasi [7] (refer to Figure 3) categorized different DA approaches based on the amount of data available and the number of classes in the source and target domain.

C. THEORY OF DOMAIN ADAPTATION
The works of Ben-David and collaborators ( [8] and [9]) looked at formulating the theoretical assumptions of the DA problem. They and future researchers were interested in finding out how real-world challenges deviate from theoretical assumptions. Ben-David et al. [9] calculated a bound on the DA error (empirical target error) for a semi-supervised case as (1). In (1), ϵ T ĥ is the empirical target error, α is a linear combination of errors in sources and target domains, m is sample size with (1−β)m points are drawn from source domain while βm drawn from target. δ signifies the probability.d H H (U S , U T ) is H H divergence (or simply Hdivergence) between source and target samples, and λ is predictor error.
Researchers involved in finding the basis of the theoretical formulation of DA mention three primary conditions required for DA.  must be similar. Typically (as in [9]), H-Divergence is used to understand the difference in distribution. 3) Joint error minimization: DA works to minimize the joint error on source and target.
However, various works - [8], [9], and [10] -then focused on unraveling the above conditions, which are not sufficient to guarantee a good DA in the real world. Zhao et al., in the theoretical study [10], concentrated on domain-invariant learning methods and proposed the removal of the joint error minimization condition mentioned before. Another theoretical basis for DA in DL was offered by Le et al. [11], explaining why it is possible to close the gap between domains in joint space.

III. DOMAIN ADAPTATION IN SHALLOW (OR CLASSICAL) LEARNING
To grasp DA in DL, it is important to learn DA in shallow learning (or classical learning) to provide the chronology. Any work associated with DA that does not include DL is considered shallow learning and caters to the DA work that happened before the use of DNNs became more prevalent. The DA methodologies in both shallow and deep learning aim to strengthen the model somehow by using the features invariant to domains (also called domain-invariant) or transforming the target data into a form/space in which the model is trained to reduce the task error. However, given that most of the features of shallow learning are handcrafted and explainable, the features can be inspected separately. DA in shallow learning is mostly based on features (matching or alignment or transformation or augmentation) and less on data-instance based. Work done by Csurka [12] provided a comprehensive survey of shallow DA methods in the visual domain. This section extends Gabriela Csurka's work [12] by including frequently used shallow DA strategies in NLP, time-series and other data domains along with CV domain DA methods.
6978 VOLUME 11, 2023 A. FEATURE BASED APPROACHES Feature-based approaches (refer to Table 5) are prevalent in both shallow (origin) and deep DA. The main idea associated with feature-based approaches (matching/ alignment/ transformation/ augmentation) is to find a shared feature embedding/representation by reducing the data distribution difference. An effort is taken in the approaches to preserve input data properties. From Table 5, we observe two important aspects: 1) Maximum Mean Discrepancy (MMD) [14] is used by multiple methods (Table 5 column -''The criterion (/criteria) for distribution difference / Discriminative Methods'') to understand the distance between source and target distributions.
. , x t m } are from distributions P(X s ) and P(X t ) respectively, then MMD is defined in (2): where H is a universal RKHS, and ϕ : χ → H We see from the definition that: a. MMD is non-parametric, which leads to a closed-form solution -a trivial solution. b. MMD is dependent only on features and independent of classes and class labels and therefore supports unsupervised DA. In the case of semi-supervised or supervised, or pseudo-semi-supervised settings, class-conditioned MMD can be used to further improve DA. Further, the use of the Kernel trick i.e., dependency on the inner product only simplifies the MMD estimation, as any distance between samples is the inner product, and an inner product can be represented as a kernel.
2. Use of Reproducing kernel Hilbert Space (RKHS): a. When data is transformed into sparse spaces (like RKHS), the chances that it is linearly separable are high. b. Representer theorem applied to the inner product would mean that the inner product of samples and the inner product of samples in RKHS are the same. Therefore, transforming the samples to RKHS gives a distribution difference not only in RKHS but also in the feature space dimension.

B. INSTANCE RE-WEIGHTING AND SELECTION APPROACHES
Another widely used strategy is instance re-weighing; the focus here is on the input data altogether and not on features.
Further, the distribution difference is minimized by reweighting the source data for the task. The instance re-weighting approach is also called instance selection, as it leads to soft/hard selection of data. Table 6 mentions the instance re-weighting and selection approaches. However, the re-weighting strategy does not help much when there is little overlap between the source and target domain. Little overlap leads to a small set of source domain examples assigned high weights -leading to a sub-optimal classifier as it tends to see a smaller number of samples effectively. However, in specific scenarios, as mentioned by Jong [23], re-weighting can provide a decision boundary closer to the optimal decision boundary of the target data.

C. HYBRID APPROACHES
Hybrid approaches typically use both feature-based and instance re-weighting methods. An example of this is Transfer Joint Matching [24], wherein the feature matching is done by minimizing MMD in an infinite-dimensional reproducing kernel Hilbert space (RKHS). They also do reweight by minimizing the l 2 -norm. As we have seen in previous subsections, the two DA domains were homogeneous, i.e., χ s = χ t ; however, DA techniques have been applied to heterogeneous data (including multimodal data). In the case of heterogeneous data, we see χ s ̸ = χ t . Primarily, two transformation strategies are seen in Heterogenous Shallow DA -symmetric and asymmetric transformation.
When an attempt is made to project the source domain and target domain features to a common subspace (domaininvariant common latent subspace), the attempt to learn feature transformation is known as symmetric transformation. In asymmetric transformation, either source features or target features are transformed and aligned to target features or source features, respectively. An example of symmetric transformation is Heterogenous Feature Augmentation (HFA) [29]. At first, HFA transforms data (using projection matrices) from both domains into a common subspace. Then HFA augments transformed data, using two feature mapping functions, with the original features and zeros. SVM with hinge loss is applied to augmented features to learn the project matrices. Asymmetric Regularized Cross-domain Transformation (ARC-t) [30] is an example of asymmetric transformation. It uses a Gaussian radial basis function (RBF) kernel to learn asymmetric and non-linear transformation while mapping target data to source data. In ARC-t [30], it is mentioned that the power of ARC-t is that it can be applied to categories that were unavailable during training too.
Another perspective, according to Csurka [12], is that multi-view learning can be strongly related to heterogenous DA, in that multi-view solves the task by looking at features VOLUME 11, 2023   of each view simultaneously, assuming the features are not (much) common (i.e., χ viewi ̸ = χ viewj ), very similar to heterogenous DA where χ source ̸ = χ target . Also, similar is Domain Separation Networks (DSN) [31], where private and shared feature spaces are orthogonal, as far as possible, the endeavor is to have (in different co-training strategies) features split into two mutually exclusive views. Blum & Mitchell, in their co-training strategy [32], solved the NLP text classification problem, using as one of the views the anchor texts of hyperlinks of pages pointing to the page and another view as the text of the page. The features are taken to be dissimilar. Due to reduction of the bias of predictions on unlabeled data, Ruder [33] mentions that Tri-training is one of the best multi-view training methods.

IV. DOMAIN ADAPTATION IN DEEP LEARNING
Since deep neural networks are associated with high accuracy (or any required metrics) and can provide state-of-the-art (SOTA) results, there has been increased usage of deep neural networks in many AI and ML applications and tasks. However, these networks also face domain shift problems and are not able to adapt to different (from source domain) data distributions and provide the same SOTA results. Further, given that deep neural networks require a large amount of labeled data to train and the availability of labeled data is a concern (it is costly, arduous, or at times infeasible), it is much required that DA is supported for deep neural networks. Unlike DA in shallow learning, the focus of DA in deep learning is to include DA in the deep learning process and pipeline such that transferable representations are learned. In this direction, the earliest work, Glorot et al. [34], included Stacked Denoising Autoencoders (SDA) on amazon.com product reviews to do sentiment analysis for different products. After that, substantial work has been done in the CV area, with NLP picking up (again) fast in the recent past -primarily due to the availability of transfer learning in NLP using transformers and attention architectures. DA research has now gathered pace to solve real-world problems (like multi-modal data support, data restrictions, and scarcity). Table 7 lists the Deep DA methods and approaches and further extends on the deep DA categorization mentioned by Wang and Deng [35]. However, [35] only focused on Deep DA techniques for the visual domain. In contrast, we aim to include more deep DA approaches, which are data domainspecific and review progress on other existing approaches. Also, our emphasis is to learn more about DA in unsupervised settings; supervised settings, semi-supervised and pseudosemi-supervised are included for completeness or novelty.

A. DISCREPANCY-BASED METHODS
These methods build on the shallow domain adaptation methods, map the features to a high dimensional RKHS space, and VOLUME 11, 2023  understand the discrepancy using metrics like MMD or similar. The difference being distribution difference is understood and aligned using deep features against the hand-crafted features of shallow DA methods. Figure 4 shows the typical structure/architecture of networks implementing discrepancy-based methods -discrepancy metrics or representation of the network (along with discrepancy metric) or loss is used to regularize the network. Domain adaptation can happen at single or multiple layers (called adaptation layers in Figure 4)

1) DISCREPANCY METHODS: METRICS-BASED
Deep Domain Confusion (DDC) [45] was the first key idea that jointly optimized task (classification) and domain confusion. Similar to Figure 4, DDC used 2 parallel networks with one network as supervised (classification loss was included) and the other network as not supervised. Domain (confusion) loss is used to adapt two fully connected layers with the idea: features that the network learns should be agnostic, i.e., they should lie in a feature space where domain information is lost while the class information is intact. MMD [14] is used as a discrepancy metric for domain loss. Extending (2), DDC mentioned the joint loss for domain adaptation as Equation (4) also helps us understand that the discrepancy metric (MMD or similar) acts like a regularizer for the overall network. Later on, many works have built up on the DDC's key idea by using different/similar discrepancy metrics. Discrepancy metrics often used in deep DA are mentioned in Table 8.
The work of Kashyap et al. [46] further segregates the divergences into 3 classes -Geometric (distance between vectors), Information-theoretic (distance between probability distribution), and higher-order measures (amongst higher  moment distance between distributions or distance between projections or distance between representations).

2) DISCREPANCY METHODS: ARCHITECTURE-BASED
In these methods, the focus is more on learning more transferable features and architecture than the metric. The underlying principle with architecture-based discrepancy methods is that information about the domain change (source to target) is only an affine transformation away, i.e., there exists a small transformation on weights that can help the transformation from source to target features. This small transformation can be the affine transformation or Multi-Layer Perceptron (MLP) / Deep Networks themselves. Deep Adaptation Network (DAN) [40] uses the concept that in convolutional deep networks transition, earlier layers understand generic features while the later layer understands task-specific features. They froze the initial layers, finetuned the middle layers, and looked at discrepancy-based methods like MK-MMD (multiple kernel -MMD), a variant of MMD [13], to adapt later layers. Typically, the discrepancy-based methods look to align the marginal distribution of source and target data, but there are different approaches, too, like Joint Adaptation Network (JAN) [41]. JAN further improved on DAN architecture by learning joint distributions of multiple domain-specific layers across domains and using the joint maximum mean discrepancy (JMMD) criterion; they used a representation (ϕ) of the network itself.
Similarly, JAN-A [41] builds further on JAN architecture that there is now another network (θ) that computes representation on top of network representations (ϕ). This not only minimizes the JMMD but also learns the network (θ) -maximum is over the θ network, and the minimum is over the JMMD -an adversarial objective (min-max). Computer Vision (CV) also uses normalization layers as a key architectural concept for DA. Given that this is specific to CV, it is detailed in the normalization layers (refer to section Normalization layers). The hypothesis behind this is batch normalization (BN) layer represents domain-related knowledge. Transferrable Prototypical Networks (TPN) [47] focus on discrepancy (distances) for each class in an embedding space of 3 datasets -source only, target only and a mix of source and target. It also assigns ''pseudo-labels'' to unlabeled target samples. Adaptation is done so that the prototype of each class is close in the embedding space.

B. ADVERSARIAL METHODS
The idea behind adversarial set of methods is to enhance domain confusion while still being robustly trained to understand domain segregation (adversarial objective). This is closely related to Generative Adversarial Network or GANs [4], which includes two networks -Generator and Discriminator -in an adversarial setting. The generator aims to produce output (typically images) to fool or confuse the discriminator, while the discriminator, on the other hand, tries to segregate it into real and fake. In DA, the idea borrowed is that discriminator should be able to segregate the domain distribution of source and target domains (say by using domain invariant features). Adversarial Discriminative Domain Adaptation, or ADDA [48] introduced a generic framework (similar to Figure 5) for DA using adversarial models. Typical adversarial discriminative architectures follow a Siamese architecture with source and target stream and are trained on task loss (typically classification) and either an adversarial loss or a discrepancy loss. In contrast, adversarial generative architecture (in its simplest form) includes a generator that generates other domain (typically target) mapping from the first domain (typically source); after that, the generated mapping and other mappings then follow adversarial discriminative architecture.

1) ADVERSARIAL DISCRIMINATIVE MODELS
One of the seminal works in deep DA is Domain-Adversarial Neural Network (DANN) [49] (refer to Figure 6), which supports the idea of adversarial domain adaptation, i.e., learning task should be discriminative yet, it should encourage domain confusion. They showed that any feed-forward model could support adaptation if augmented with a novel gradient reversal layer.
DANN is the most widely used DA approach across all data domains. In CV, DANN was initially used for digit recognition and image classification. Later on, DANN or its derivatives are also used for more complex tasks like semantic segmentation and object detection. In the case of semantic segmentation, a Siamese network (consisting of two parallel tracks) approach is taken, where one track processes source samples and the other track processes target samples. Due to the inherent complexity of tasks -Domain alignment (Domain Classifier -pink network in Figure 6) is present at various layers/stages, and the convolution layers (input to feature extractors) and deconvolution layers (feature extractors to semantic map) are aligned (shared, mapped or statistical metric is used). Hoffman et al. [50] used 2 more losses other than the regular semantic loss -one loss to adapt to categoryspecific parameters, i.e., category-specific adaptation and the other loss to reduce ''global distribution distance,'' i.e., global domain alignment. Huang et al. [51] looked at aligning features at each layer of the network.
Adversarial Discriminative Domain Adaptation or ADDA [48] model also uses similar philosophy as DANN but differs in that feature extractors are not shared between source and target, and the loss function that is used in ADDA is GAN loss while DANN uses min-max loss and the training is multistep. Conditional Domain Adversarial Networks (CDAN) [58] use a conditional discriminator, taking input from both feature extractor and classifier. Work of Shen et al. [54], instead of using a pure classifier in the discriminator, used the loss as Wasserstein distance (similar to Wasserstein GAN by Arjovsky et al. [59]) during training between source and target samples. Inspired by the multiview strategy, Du et al. [60] proposed Dual Adversarial Domain Adaptation (DADA), having two ''joint'' discriminators, supporting all the classes of source and the target domain (2K -dimension), pitted against each other and back-propagating into feature extractor. They also used a source class predictor to classify source labels and provide pseudo labels. The latest attempt to improve adversarial discriminative models is Smooth Domain Adversarial Training (SDAT) [61], which mentions that reaching smooth minima only for the task-specific loss (and not the domain discriminator loss) helps better adapting to the target domain.

2) ADVERSARIAL GENERATIVE MODELS
Adversarial generative models are different from Adversarial discriminative models in that they have a generative component (typically, a generator of GAN) along with the discriminative component of discriminative models. This generative component typically creates synthetic target data from labeled source data. This synthetic labeled target data alleviates the need for labeled examples in target domains. Then the network is trained to assume there is no or little domain shift present in the synthetic data. The source mapping component is the generator that maps the source domain into the target domain. Therefore, colloquially these generators are also known as domain mappers. One of the earliest works in adversarial generative models is Coupled Generative Adversarial Network -CoGAN [60]. As the name suggests, two GANs run parallel, and weight sharing happens in the initial layers for generators and the final layers of discriminators. These layers capture high-level features in discriminators and high-level semantics in generators. This helps the GAN to understand the joint distribution of domains. In CoGAN, the target domain is transformed into the source domain, and then the classification happens.
Typically, the DA is specific to a task (shared across two domains); however, PixelDA [61] used an adversarial generative DA setup to provide a framework that is decoupled from task-related aspects. Typically, source images are transformed into target-like images; however, Generate-toadapt [62] uses GANs for domain adaption with Generator creating source-like images for target domain cases. It uses the embeddings (learned during training) of images as the latent space as an auxiliary input to the GAN to create source-like images from the generator and discriminator, discriminating the domain (real/fake) and providing class labels. Other examples of adversarial generative models in the speech domain are Park et al. [63] and Augmented Cyclic Adversarial Learning (ACAL) [64].

3) ADVERSARIAL RECONSTRUCTION-BASED METHODS
Another variation of adversarial generative methods is reconstruction-based methods (on the same lines as shallow feature matching DA strategy): Reconstruction methods typically use Adversarial GAN-based networks or Autoencoder (AE) based networks to reconstruct one domain content in another domain style. Table 9 provides key ideas behind some Adversarial reconstruction-based methods. There are other methods in the literature that do not fully comply with the adversarial reconstruction definition but still are very close to its working.
• An example in NLP is AE-SCL: Ziser and Reichart [66] brought SCL [16] into the neural networks using Autoencoders; their network is called Autoencoder-SCL or AE-SCL. AE-SCL does not reconstruct the input but predicts if the pivot features will be present in the input or not. They used this for cross-domain sentiment analysis. They further improved AE-SCL using Pivot-Based Language Modeling (PBLM) [67] and Task Refinement Learning using PBLM (TRL-PBLM) [68].
• An example in CV is DiscoGAN: DiscoGAN [69] is also very similar to CycleGAN, the difference being that it does not have cyclic reconstruction loss. VOLUME 11, 2023 FIGURE 6. Domain-Adversarial Neural Network (DANN) trains two network together. DANN trains feature extractor (green network) and class/label predictor (blue network) on source data. DANN also trains feature extractor (green network) and domain classifier (pink network) on the source and target data. The gradient reversal layer (GRL) allows the feed-forward network to progress as it is; however, during the backpropagation, it changes (reverses but multiplies by a negative quantity) the gradient from domain discrimination, which leads to the feature extractor (green network) understands domain invariant features (domain confusion). λ helps to learn the classification features and then slowly learn domain features. Best viewed in color.

C. MULTI-DOMAIN ADAPTATION
Multi-Domain DA setting differs from a typical DA setting in that either number of source domains would be multiple (called multi-source adaptation), or the number of target domains would be multiple (called multi-domain adaptation).

1) MULTI-SOURCE ADAPTATION
To create more robust domain-adapted models, it makes sense to train the models on multiple sources. In earlier surveys (pre-deep learning), Sun et al. [70] [73] introduced Multi-source Adversarial Domain Aggregation Network or MADAN, which essentially uses CycleGAN (sub-domain aggregator discriminator for source domains and cross-domain cycle discriminator for source-target domains) coand creates a latent adapted domain for all source data and target data. Similarly, Russo, Tommasi, and Caputo [74] used CoGAN to adapt each source and target domain. Rebuffi et al. [75] used one residual adapters (which sit on the residual branch) for each domain. Yang and Hospedales [76] provided both multi-task and multi-domain perspectives using low-rank tensor methods; this work also provides an alternative to zero-shot learning.
In NLP, Guo et al. [77] introduce DistanceNet-Bandit, with distance metrics (DistanceNet) providing loss functions in addition to task loss along with using multi-armed bandit to control switching between multiple domains dynamically. Guo et al. [78] used meta-learning to combine predictors from each source-target domain.
In time-series, Zhu et al. [79] used a multi-adversarial strategy where multiple source domains (sample of roller bearings) were projected into a shared subspace, and domain invariant features were obtained. Xia et al. [80] introduced a moment matching-based intraclass multisource domain adaptation network, which measures the discrepancy (MMD) between each source domain and target domain samples.

2) MULTI-TARGET ADAPTATION
Typically, DA follows the pairwise approach, with the source domain linked to the target domain. Inspired by [73], Gholami et al. [81] also look for shared information across domains. They propose Multi-Target DA-Information-Theoretic-Approach or MTDA-ITA, which uses private and shared spaces between source and target combination, much like Domain Separation Networks (DSN) [31]. Isobe et al. [82] used multi-target DA for semantic segmentation tasks using the individual source-target and individual bridges created amongst the pairs for collaboration. A student model is learned based on all the individual source-target model pairs using regularization on each individual source-target model pair. Similar knowledge distillation is understood in Multi-Teacher Multi-Target DA (MT-MTDA) [83] D. HYBRID METHODS Hybrid methods indicate the amalgamation of multiple techniques discussed before for executing DA.

1) ENSEMBLE-BASED METHODS
Ensemble methods contain multiple models, where the output of multiple models is combined, typically averaging in regression and voting in case of classification tasks.
The diversity of the models makes sure that the deviation from correctness is not much. One of the most significant drawbacks of these models is that they are computationally expensive. Ensemble methods for DA can be segregated into two sub-techniques -pseudo labeling ensembling and self-ensembling.
In the case of the self-ensembling method, the combining of output is done on multiple outputs of a single model over time. Combining outputs over time is also known as temporal ensembling. French et al. [84] used Teacher Student (mean teacher variant) architecture proposed by Tarvainen and Valpola [85], as a self-ensemble technique for visual DA. The teacher network is first trained on the task and outputs floats (probabilities) instead of Boolean (0-1 integer) labels. The student then learns from the teacher, and the student can learn things better because the teacher informs the student of the nuances. Gradient descent is used to train the student network, while the exponential moving average of the student network is the weight of the teacher network. The training loss is a combination of a supervised and an unsupervised component. This architecture dramatically reduces the model parameters without compromising on accuracy metrics. In NLP, [86] also used adaptive ensembling, an extension to temporal ensembling, and classified political data while studying temporal and topic drift. They used a temporal curriculum and a student-teacher network.
Another data-centric variant of ensembling is pseudolabeling ensembling, wherein the target domain labels are provided based on the combined perspective of comprising models. If most models converge, i.e., there is high confidence in label class for a particular instance of the target domain. An instance of the target domain (not the source domain) is used for training the target classifier, hence the name of the technique pseudo-labeling. In computer vision, Saito et al. [87] proposed Asymmetric Tri-Training (ATT), which had two networks providing the labels for target domain instances -first trained on the source domain if the two networks converge, then the pseudo label is assigned to the target instance, and that data is used for training the third network. Final labels don't have to be provided at all times, and the probability score can also be used instead (examples: Zou et al. [88] and, to some extent, French et al. [84])

E. MULTI-MODAL DOMAIN ADAPTATION
Multi-modal is a complex data domain with respect to DA, as the DA process has to take into account the different modality structures and different domain shifts (for each modality). In the case of heterogeneous multi-modal DA, the DA process must also take care of different feature spaces/ feature representations/ dimensions of feature spaces.

1) HOMOGENEOUS MULTI-MODAL DOMAIN ADAPTATION
Most of the work in DA supports homogeneous data, i.e., feature space remains the same (χ s = χ t ), but the shift is because of different data distributions, i.e., P(X s ) ̸ =P(X t ). When both source and target domain would have at least two modalities, i.e., multi-modal, but still, the feature space (features fed for the task perspective) is the same, it is called Homogeneous multimodal DA. Typical homogeneous multi-modal architecture (refer to Figure 7) does implement intra-modality interaction compulsorily; however, it is seen that implementation of inter-modality and inter-domain aspects is optional.
Qi et al. [89] created a multi-modal DA network with attention and fusion modules along with hybrid domain constraints to learn domain invariant features. The intra and inter units in the attention module help to understand the relationship among modalities. The bilinear model approach ( [90], [91]) was used for fusion, and then tucker decomposition was used to support computational (GPU) [92] restriction.
For social media event rumor detection, Zhang et al. [93] proposed Multi-modal Disentangled Domain Adaption (MDDA), which looks to resolve two challenges -entanglement and domain. Disentanglement of event content with rumor style was done as part of the first challenge, and domain shift was tackled in the latter challenge (with only rumor style taken after the first challenge). The network learned only a transferrable rumor style with the alignment of feature distributions over different events.
Multi-Modal Self Supervised Adversarial Domain Adaptation or MM-SADA [94] uses two modalities -optical flow and RGB of EPIC-Kitchens video dataset, and understand if the fine-grained action recognition (depends highly on the environment) can be improved across dataset domains. They used self-supervision across two domains with both modalities and adversarial adaptation between each modality of source and target data (i.e., one discriminator for RGB and one for optical flow).
Li et al. [95] look at DA amongst multiple modalities from domains (scripted source, improvised source). They use an emotion recognition model based on adversarial training (which helps to remove domain difference between emotion elicitation approaches) and a soft label loss approach (which helps to understand non-rigid emotions and to consider emotion and domain categories simultaneously).

2) HETEROGENEOUS MULTI-MODAL DOMAIN ADAPTATION
One of the most prevalent real-world data is the heterogeneous multi-modal domain; as deep networks look to use more heterogeneous multi-modal data, it is imperative to learn DA in heterogeneous multi-modal settings. The DA, in the case of heterogeneous data, is carried out by extracting features of two domains using separate network and the task level aspects either by sharing weights (strong parameter sharing) or weakly parameter-shared weights as in the work of Shu et al. [96].
The importance of heterogeneous multi-modal DA lies when one of the modalities is missing in the target domain: consider the source domain having modalities m1 and m2, while the target domain may just contain m3 with missing m4. Ding et al. [97] look at solving a real-world 'Missing Modality Problem' by introducing Missing Modality Transfer Learning via latent low-rank constraint (M2TL). The transfer of learning is twofold -one, from one database to another (cross-database transfer), and two, from source modality to target modality (cross-modality transfer). They use low-rank matrix constraint to learn subspace within a database across modalities and MMD to couple databases in the source domain (known modalities).
Conditional adversarial domain adaptation [58] uses conditional domain adversarial networks (CDAN), a variant of the adversarial discriminative model, which assists adversarial adaptation by employing discriminative information understood in the classifier predictions. The discriminator is conditioned on the cross-covariance of domain-specific feature representations and classifier predictions. CDAN can adapt to multi-modal data distributions and can support scenarios involving higher-dimension also (supported by a variant called Randomized Multilinear (RM) conditioning).
Athanasiadis et al. [98] present Domain Adaptation Conditional Semi-Supervised Generative Adversarial Networks (dacssGAN) in the realm of emotion recognition, where domains (audio, video) are heterogeneous and multi-modal. The network uses GANs and conformal prediction techniques [99] to implement DA.
Seo et al. [100] aim to improve audio-visual sentiment analysis performance using text modality during the training phase by ''transferring knowledge'' of unimodal (text modality) to other modalities (audio and visual). The knowledge transfer employs the reduction of distribution differences of feature representation in data for each modality.
In NLP, Cross-lingual translation also falls under heterogeneous tasks as the words and the construct of the two languages are very different, leading to an assumption that input features don't match. i.e., χ s ̸ = χ t . Various attempts, including Conneau et al. [101], have been made to support cross-lingual DA as an unsupervised task; however, Søgaard et al. [102] showed that the underlying assumption that the words are isomorphic in a language is incorrect. They further suggested that a weakly supervised solution outperforms (the metric used was bilingual dictionary induction scores) unsupervised cross-lingual DA. Conneau et al. [103] mentioned that pre-trained models (discussed later in the section Pre-Trained Models) achieve better results in unsupervised cross-learning representation translation tasks. Generative adversarial text-to-image synthesis [104] provided a way to generate an image based on text, translating visual concepts of pixels from characters using a convolutional-recurrent neural network. Along similar lines, StackGAN [105] also created photo-realistic images in two stacked steps from the text.

F. DOMAIN ADAPTATION IN COMPUTER VISION (CV)
This section focuses on DA strategies typically only seen in the computer vision data domain and not shared with other data domains.

1) NORMALIZATION LAYERS
Normalization layers help maintain a stable training of neural networks and are used in nearly all neural networks. A few examples of normalization layers in regular neural networks are batch normalization or batchnorm [3], layer normalization or layernorm [106], instance normalization or instancenorm [2], and group normalization or groupnorm [107].
Chang [108] created a DA framework using a domainspecific batch normalization layer -other model parameters were shared between domains. Li et al. [109] proposed the Adaptive Batch Normalization (AdaBN) layer. The intuition behind the layers is that these layers learn domain knowledge in contrast to weights learning task knowledge and biases learning some sort of priors. Carlucci et al. [110] in Auto-DIAL built further on [109] AdaBN layers and used DA FIGURE 8. Typical self-supervision network structure. It is a multi-task network and includes an auxiliary task which aims at understand the feature distribution mode, but does not impact the core DA task but ''provides'' knowledge of sorts to core DA task. Best viewed in color.
layers amongst the standard CNN layers. The purpose of these layers was to normalize the target and source minibatches (separate for two domains) but influenced by each other based on a parameter learned as part of the training process.
Roy et al. [111] proposed Domain-specific Whitening Transform (DWT) -domain alignment layers to compute intermediate feature covariance matrices, along with Min-Entropy Consensus (MEC) loss (a merger of entropy and consistency loss) for coherent predictions for sample.

2) SELF -SUPERVISION METHODS
Self-supervision DA methods look at joint training of an auxiliary self-supervision task alongside the main task and therefore are also aligned to multi-task. In the Deep Reconstruction Classification Network (DRCN), [65] had a deconvolution network to reconstruct the image (an auxiliary self-supervised task) while the convolution network performed the label prediction (main task). The feature mapping parameters were shared in DRCN, very much similar to Figure 8. The intuition is that the main task receives knowledge transfer from the auxiliary task.
Carlucci et al. [112] used the auxiliary task of jigsaw puzzle solving (permutation index) while solving the main task as a DA/DG strategy. It is noted that typically the auxiliary task is an unsupervised task; however, the main task is a supervised task. Xu et al. [113] further increased the number of auxiliary tasks (image rotation prediction, flip prediction, and patch location prediction), further underlying those low-level differences (like pixel-level reconstruction/prediction) are not much useful in DA. In contrast, high-level structural task (like part of image rotation) is very useful. Kim et al. [114] showed that the self-supervision technique is useful even with few labeled instances in the source domain. They used within-domain instance discrimination (in-domain selfsupervision) and cross-domain matching (across-domain self-supervision) to learn features that are domain-invariant as well as discriminative.

G. DOMAIN ADAPTATION IN NATURAL LANGUAGE PROCESSING (NLP)
This section focuses on DA strategies typically only seen in the NLP data domain and not shared with other data domains. Most of the work in DA has been done in the CV area, though the origins of DA have been in NLP. For example, DANN [49] was initially applied to sentiment classification, but later it was used for computer vision classification tasks. Ramponi and Plank [115] categorize NLP domain adaptation models into Model-Centric, Data-Centric, and Hybrid. Model-Centric models (focus on augmenting the feature space, tinkering with loss functions, and changing the architecture of the model), discussed before, has been used in other applications and computer too. Pre-trained models are Data-Centric models and are discussed below, and hybrid models are discussed in the section Hybrid Methods.

1) PRE-TRAINED MODELS
The Data-Centric models are not shared with computer vision tasks, perhaps because these models focus on data elements, different in computer vision and NLP, to support adaptation. These models are less prevalent but, of late, have picked up the interest of researchers. BERT Devlin et al. [116] was a model to revolutionize transfer learning-other methods include pseudo labeling, pre-training (zero-shot) (example: Multilingual BERT)/fine-turning (including multiphase) (example: SciBERT [117] / BioBERT [118]). Figure 9 provides a typical pre-trained training strategy, and Table 10 lists different pre-trained training data and strategies. Based on the DA definition, Pre-training and fine-tuning are not kinds of DA processes, but these transformer-based language models are task agnostic in the sense that they can be fine-tuned on specific tasks using a small dataset. It is included in this survey for completeness.
AdaptaBERT [119] used a two-step approach for domainadaptive fine-tuning. In the first step, they performed domain tuning by taking contextualized word embeddings (unlabeled source and target domain data) and maximizing the FIGURE 9. Typical Pre-trained Training strategy. Pre-training is typically task agnostic; future steps are required to adapt the model to the task in question. An optional multi-step pre-training is done to reduce the data distribution gap of source data and target data. Best viewed in color. probability of masked tokens. In the second step, they focused on task tuning by taking labeled source data and backpropagating for the desired task (PoS tags in this case).

2) MULTI-VIEW LEARNING
Another NLP-specific DA technique is multi-view training (also discussed briefly in heterogenous DA). Different views of data are used to train different models in multi-view training. The views differ from each other in the following dimensions (or a combination of dimensions): 1. Architecture of models 2. Features 3. Data used for training The philosophy behind multi-view training is that the views complement each other, and the collaborated models improve each other's performance. Examples of multi-view training are Co-Training [31], Democratic Co-Training [129], and Tri-Training [130] H

. DOMAIN ADAPTATION IN SPEECH
In speech domain adaptation tasks, the focus is to first identify which elements of the data are actually speech and not noise; for the elements identified as speech, then the focus is either recognition of speech called Automatic Speech Recognition (ASR) or adapting to a speaker. Textto-speech (TTS) is a multi-modal variety where the output modality (space) is speech. The DA strategies that are typically employed are discrepancy based ( [131])(refer to section Discrepancy-Based Methods), adversarial-based ( [132], [133]) (refer to section Adversarial Methods), pseudo-semi-supervised training based ( [131]) (refer to section Pseudo-Semi-Supervised Domain Adaptation) and knowledge distillation based ( [134], [135]) (Ensemble-based methods or Teacher-student based, refer to section Ensemble-Based Methods).
One speech-specific strategy understood is the work by Zhang [136], where a pretraining process is undertaken on the DNN model using unlabeled target domain data first. Later, labeled source data is used to fine-tune the network. The intuition behind the pretraining process is to seek shared representation.

I. DOMAIN ADAPTATION IN TIME-SERIES
Typically, the tasks that are prevalent in time-series DA are classification (generally 2 class classification) and forecasting (predicting based on past time-stamped information). Further, the problems solved are univariate and multivariate, i.e., involving multiple time-stamped variables used for prediction, e.g., pressure, temperature and flow rate predicting fault in a power station. Jin et al. mention [137] the complexity in time-series DA as two-fold: 1) Varyinginput and output space: The output space of the source domain time-series (say, the flow rate in the power station) may be different from the output space of the target domain time-series (say, a count of units in a warehouse). Hence, it is imperative that not only domain-invariant features are captured but also domain-specific features be captured as in Domain Adaptation Forecaster (DAF) [137]. Similarly, input space may be different.

2) Dependence on different time period subsets:
It may be possible that the outcome (classification/forecasting) may not be captured by overall history representation. In most likelihood, it would be a subset of overall time-period representation that may impact the outcome. A survey on sensor time series [138] mentions that the strategies used for time-series DA bear much resemblance to non-time-series DA, with two specific strategies for time-series DAinput space adaptation and output space adaptation.

1) INPUT SPACE ADAPTATION
In the input space DA strategy, the impetus is to use/generate the source domain samples which resemble the target domain samples, much like reconstruction-based methods. Typically, prior knowledge (Wang et al. [139]) or GANs (Contra-GAN [140]) are used in this strategy.

2) OUTPUT SPACE ADAPTATION
Output space DA strategy is used both for classification and forecasting (DAF [137]). In the case of classification Yang et al. [141], high-confidence labels on the target domain are selected for training, analogous to pseudo-semisupervised training (refer to section Pseudo-Semi-Supervised Domain Adaptation). In the case of forecasting, domainspecific features are used (values of transformer network in DAF [137])

J. EMERGING DOMAIN ADAPTATION FOR PRACTICAL SETTINGS AND REAL-WORLD CHALLENGES
Some models and techniques available in the literature do not fit into existing categories, have gained a lot of traction, and are, to some extent, very innovative and adapted to more practical settings and/or real-world challenges. These emerging DA techniques are mentioned below.

1) FEW-SHOT DOMAIN ADAPTATION
The challenge with few-shot DA is that there is not enough target data that can conclusively conform to the simultaneous requirements of DA. These requirements are domain confusion and representation alignment between the two domains. One of the first works on few-shot DA is Motiian et al. [142]. They introduced Few-Shot Adversarial Domain Adaptation (FADA) using adversarial learning focusing on speed of adaptation. They alleviated the difficulty mentioned before by mixing source and target samples into four categories based on domain and class labels, and the classifier then worked on these four categories instead of the standard two. Further, they initialized the network (feature extractor and label classifier) using source data only, then updated the domain class discriminator (freezing feature extractor). Finally, they froze the domain class discriminator and updated the feature extractor and label classifier.
In the Domain-Adaptive Few-Shot Learning (DA-FSL) [143], they look to solve even a more complex problem related to few-shot learning, i.e., target data may have classes that can come from different domain. The focus of the domain-adversarial prototypical network (DAPN) in DA-FSL is to attain alignment in global domain distribution while keeping class discriminative-ness intact by introducing new losses (domain discrimination, domain confusion, classification). The losses are weighted using an adaptive re-weighting mechanism. Another novel aspect was the use of attention before the embedding of the source.
Further, Yue et al. [144] proposed an end-to-end Few Shot Domain Adaptation method, which includes self-learning (called Prototypical Cross-domain Self-Supervised Learning (PCS) framework) and is unsupervised. The main idea is knowledge transfer from source to target is to find similarities between instance and prototype (representative), making the transfer more robust.

2) ZERO-SHOT DOMAIN ADAPTATION
Zero-Shot DA is a complex scenario because actual target domain data is not present during training time; only some information about it (typically target metadata) is available. Zero-Shot DA differs from DG because DG does not have any information about the target data, not even the metadata.

1) Zero-Shot Learning (usage of task-irrelevant data):
For the computer vision task, Peng et al. [145] used information in task-irrelevant data (domain pairs) to help understand network information about the nonavailable task-relevant target domain.

2) Zero-Shot Learning (new labels in the target domain):
The intention is to learn ''different'' class labels in the target domain, given labels in the source. This is genuinely not a DA scenario, as the label domain is different in both source and target. An example mentioned in Kodirov et al. [146] is that the label ''Polar Bear'' can be represented as embedding vectors of 'has fur,' 'is white,' and 'eats fish.' Any semantic embedding that is close to these embedding vectors can help label effectively.

3) LABEL SET DIFFERENCE IN DOMAINS
This perspective helps to close the category (label) gap in DA -it may be possible that the target label set may contain more (or open-set) or less (or partial) than the source. The typical DA scenario is called closed-set DA, where the label set in source and target is the same. The solution that supports both open-set and partial is called universal domain adaptation.  [153] is typically to appreciate two elements -domain similarity (which helps to understand if the task can be supported) and prediction uncertainty. Domain similarity deduces samples coming from similar labels, while prediction uncertainty deduces the unknown class. It further includes aspects of partial domain adaptation strategies by the same research group and supports all settings -closed/partial/openset variations. The training tries to find an optimum probability (that the sample is part of the source class) which can help segregate if data can be worked on; else, mark it as unknown. V. N. and Kundu et al. [154] support Universal DA by using a proxy of unobserved class (a hypothetical negative class) and therefore helps in class separability.

4) CONTINUOUS / SEQUENTIAL / INCREMENTAL DOMAIN ADAPTATION
In a representative DA setting, the source data and target data are available during the training time. However, in real-world settings, target data may be made available as we progress on DA testing over time, or the target domain itself may change.
In these settings, continuous (or sequential or incremental) DA is imperative.

1) Online domain adaptation:
In the work of J. P. and Mancini [155], continuous domain adaptation is done using batch normalization for unsupervised domain adaptation. Sharing of network parameters happens between source and target (online) except for the batch normalization params. Batch normalization parameters are updated on the go (over time). This online DA strategy was used in robotics use where the objects were lit differently in different settings. 2) Predictive and Online domain adaptation: For unsupervised learning scenarios, Mancini et al. in VOLUME 11, 2023 AdaGraph [156] focused on a predictive domain adaptation scenario with an online learning component.
The system learned generalizing from annotated source images alongside unlabeled samples (with associated metadata) from secondary domains. AdaGraph is used to understand the domain-specific parameters, and it provides those parameters to batch normalization layers as part of predictive DA. 3) Continuously Changing Domains: Sometimes, the task involved is such that domains vary continuously (e.g., self-driving car driving on a sunny day, and suddenly it rains); we cannot treat the shift as discrete or static domains. Continuous Unsupervised Adaptation or CUA [157] learns to adapt to new distribution without not deviating (replay) from how it performed in previous distributions. CUA has an element of adaptation (Adapt Module) and memory (to replay if the same domain is countered again, called Replay Module). 4) Continuously Indexed Domain Adaptation: One of the drawbacks of the existing DA techniques is that they look to transfer knowledge between categorical (A and B) domains. However, in the real world, continuously indexed domains are involved in many tasks. Continuously Indexed Domain Adaptation or CIDA [158] conditions domain index distribution on a discriminator that models the encoding. Another variant of CIDA is Probabilistic CIDA (PCIDA); here, instead of the predicted domain index as output, it provides mean and variance for the domain.

5) OPEN COMPOUND DOMAIN ADAPTATION (OCDA)
At times, there do not exist any clear boundaries amongst the source and multiple target domains. X. S. and Liu [159] concentrated on open compound domain adaptation (OCDA), where the target domain is a composite of numerous unlabeled and homogeneous domains. To bootstrap generalization, they used curriculum domain adaptation in a data-driven self-organizing fashion -understand easy-to-hard, based on domain gaps. OCDA also separates characteristics discriminative between classes from those specific to domains. The curriculum of domain-robust learning is constructed from the teased-out domain feature. Further, the use of memory modules increases the support for new domains. The knowledge transfer happens from the source domain to target domain instances, and also, the network can dynamically balance the memory-transferred knowledge and the input information.
If the new domain is close to any source domain, it can work as a typical domain adaptation; in case of a difference, the memory module helps.

6) SOURCE DATA RESTRICTIONS
There are conditions where data privacy is a concern or source data is not available. DA model that relies less on no source data (post model creation) is a boon in those conditions. For example, Source Hypothesis Transfer (SHOT) by Liang et al. [184] only uses the source model instead of the source data. The model aligns the source model with target data by learning target-specific features (uses information maximization and self-supervised pseudo-labeling). Universal Source Free domain adaptation [154] and Federated domain adaptation [160] also aim to support DA where the availability of source data during training is unsure. V. N. and Kundu et al. [154] support Universal DA (closed, open set, and partial domain adaptation) and use synthetically generated hypothetical negative classes, which can act as a proxy for the unobserved class, knowledge of class separability, and category gap. In federated domain adaptation [160], model parameters are trained for each source note separately, converging at different speeds. The use of dynamic attention help understands the weightage of each source model. Federated domain adaptation also uses concepts of Domain Alignment, Domain Disentanglement, and Mutual information minimization.

7) SELF-SUPERVISED LEARNING IN DOMAIN ADAPTATION
Self-supervised learning (including domain adaptation) is typically a two-step sequential process; the first process step includes unsupervised learning from a pretext task (in CV: rotation, image reorganization, implanting, colorization, etc.), which is used to understand intrinsic domain information (in CV: say semantic information of images in a particular domain). In the second process step, this learning is applied to a new task which further broadens it. Bucci et al. [161] implemented a similar process for object recognition across domains. The first task broadens the previous supervised learning of semantic labels, and the second task focuses on understanding the structure of the objects and their orientation. Given that label bias does not affect self-supervised learning, it can be used in partial (Bucci et al. [162]) and open-set (Bucci et al. [163]) DA areas.

8) META-LEARNING IN DOMAIN ADAPTATION
Meta-learning (or learning-to-learn) represents algorithms that learn from the output of other algorithms. These sit one level above (can be visualized as outer loop algorithms) over the standard task algorithms and are vital in model selection and tuning processes. Li and Hospedales [164] implemented meta-learning for semi-supervised DA and multi-source DA; they also mentioned that meta-learning could be used for good initialization. Meta-learning in DA helps to increase evaluation metrics (positive impact) by 0.7% (DANN) to 2.5% (MCD). Another example in the speech domain is the adaptation of generative-based dialogue systems for unseen domains -Ribeiro et al. [165] improved DiKTNet (a dialogue model) adaptation to unseen domains using meta-learning. Meta-learning also finds use in domain generalization ( [166], [167])

9) PSEUDO-SEMI-SUPERVISED DOMAIN ADAPTATION
This set of methods includes the treatment of a subset of unlabeled target domain data and labeling them before the start of the ''core'' DA process (refer to Figure 10). Therefore, for the ''core'' DA process, there exists a subset of target domain data that is labeled and hence the name pseudo-semisupervised DA. It may be noted that the initial labeling of unlabeled target domain data may be accurate or inaccurate, which is further refined during the ''core'' DA process.

1) Active Learning in Domain Adaptation (Active DA):
While DA attains excellent results, the performances of DA methods often fall far behind their supervised counterparts. In such cases, active domain adaptation (Active DA) has recently gained a lot of interest. In the Active DA method, a subset of target samples is used to obtain annotations and further helps to improve the performance of the ''core'' DA. The focus is on selecting samples that not only include the diversity of target data but also represent the complexity. Su et al. [168], in Active Adversarial Domain Adaptation (AADA), used selection criteria based on diversity cue (dependent on optimal discriminator in adversarial setting) and uncertainty cue (dependent on crossentropy, a proxy for empirical risk). They showed superior performance for digit recognition and object detection tasks. Prabhu et al. [169] further improved on basic active learning techniques of diversity cue and uncertainty cue by proposing Clustering Uncertaintyweighted Embeddings (CLUE). They weighted samples and selected them; here, diversity was supported by clustering and uncertainty by entropy weighting. They surpassed previous active learning-based SOTA (i.e., AADA) results in digit recognition and object detection. 2) Pseudo-Labeling in Domain Adaptation: Unlike active learning, Pseudo-label DA includes applying the model trained on labeled source data on a batch of unlabeled target data to predict labels / annotate. Here the labels/annotations on target data are not accurate but a reflection of labeled source data. Thereafter, one of the techniques is to train a new model with labeled source data and pseudo-labeled target data. However, this method has the inherent weakness of propagating noisy labels (incorrect labels). In CV, Kim and Kim [170], worked on abating the noisy label problem by implementing a joint optimization framework, i.e., iteratively updating the model (network) and pseudo-labels. In NLP, Wang et al. [171] used Generative Pseudo Labeling (GPL) for query-passage extraction purposes: where they retrieved positive passages from labeled data and applied that model for retrieving negative passages in target data. Thereafter, they used Margin-MSE loss which helped the cross-encoder to soft-label query-passage pairs effectively. They then used the soft-labeled pairs for the core task.
In time-series, as part of the output space strategy, Yang et al. [141] selected high-confidence labels on the target domain for training. Moving Semantic Transfer Network (MSTN) [174] looked to align the centroid of each class in both labeled source and pseudo-labeled target data. Chen et al. [175], in Progressive Feature Alignment Network (PFAN), formulated an easy-to-hard strategy (ETHS) and used only an easy sample for downstream network (Adaptive Prototype Alignment or APA) use. ETHS and APA were then used iteratively till convergence for best results.

V. DATASETS USED IN DOMAIN ADAPTATION
This section captures the existing and emerging datasets used for DA across CV, NLP, speech, time-series, and multi-modal data domains. One observation is that researchers use very few benchmark DA datasets, and the research is done in a very narrow set of tasks.

A. COMPUTER VISION (CV) DATASETS
In Computer Vision (CV), most of the DA work has been done in digit recognition and image classification. Complex CV VOLUME 11, 2023 VOLUME 11, 2023 tasks (like pose estimation) are now getting traction. Table 11 lists common CV datasets used in DA in recent times.

B. NATURAL LANGUAGE PROCESSING (NLP) DATASETS
Most of the domain adaptation work in NLP has happened for the sentiment analysis task. In recent years, more tasks have been explored. Table 7 lists common NLP datasets used in DA in recent times. Table 8 provides a list of common speech datasets used in DA in recent times. We can see that most of the speech data domain DA work has happened in the speech recognition task. In industry, there is an expectation that many time-seriesrelated DA problems would be there; however, the number of public time-series datasets used in DA continue to be very less.

E. MULTI-MODAL DATASETS
The core field of multi-modal deep learning is developing, yet advances have been made in multi-modal DA. Table 15 provides common multi-modal datasets used in domain adaptation. The diversity of tasks is less, with the majority being Face Expression / Emotion recognition related.

VI. CHALLENGES
Typical challenges of DA in the real-world and practical settings include: 1) Few datasets in DA use: Few datasets (Table 11 and  Table 12) viz., MNIST, MNIST-M, SVHN, USPS, Office, and Amazon reviews) are typically used by researchers. There is a need to include more data sets -in the number of datasets and the size of data sets and develop a DA framework for specific applications. Further, the common datasets have fewer classes and  instances. Results shown by researchers on diverse datasets would promote the creation of more datasets, and finally, the diversity would lead to capturing more practical settings. VOLUME 11, 2023    2) DA has the promise to apply to real-world problems and solve them. Researchers have started investigating and solving some of the challenges, and some are yet to be explored.  [84], while MNIST to SVHN is not very high. A general-purpose strategy is required for bi-directional DA. 6) Effective comparison metrics missing for some DA scenarios: Typically, absolute mAP is used for object detection tasks -however, it is the relative mAP (source-only baseline and after DA) that is important for DA. It is much better than absolute mAP as different papers also use models trained with different hyperparameters. There is a need of similar effective comparison metrics. 7) Varied model and data parameters in DA: Fair and comprehensive evaluation of DA approach and reusability comparison is difficult due to varied metrics, hyper-parameters and data input (e.g., image size). There is an imperative need of standardization of some possible parameters e.g., image size.

VII. APPLICATIONS OF DOMAIN ADAPTATION
Given that DA includes relevant elements and supports generalization, it has found usage in many applications. Mentioned are some motivating examples and possible usage in the future.

A. COMPUTER VISION (CV) DOMAIN ADAPTATION USAGE
DA in CV continues to mirror the progress of CV tasks and techniques with a lag. The initial focus of DA in CV was on simple CV tasks -like digit recognition and image classification, but later, the focus included complex tasks of object detection, segmentation, depth estimation and similar. Surveys have been done on domain adaptation on specific computer vision tasks, e.g., semantic segmentation [294] and object detection [295]. The current focus is increasingly on even more complex tasks (e.g., pose estimation, video classification), complex datasets (e.g., in the wild, 3D), improve state-of-the-art DA metrics in previously mentioned tasks. Also, due to the scarcity of data in the target domain, most DA methods adapt from synthetic or other domain data to real data. Most of the work on DA in CV is on 2 Dimension (2D) data, e.g., camera images, followed by 2D data with time, e.g., video images, followed by a focus on 3D, e.g., LiDAR (Light Detection And Ranging). A survey on LiDAR perception by [214] further captures deep DA techniques. Table 17 provides a view of different CV tasks and key DA advances in those specific tasks. These tasks and techniques have found much use of DA in the CV in industries (further discussed in the section Industrial Applications), e.g., AI imaging is widely used in the healthcare sector while LiDAR DA is used in Advanced Driver Assistance Systems (ADAS) or Autonomous driving. These techniques are also used in situations where the data is derived from different foundations (geographic, genetic, cultural, age, etc.)

B. NATURAL LANGUAGE PROCESSING (NLP) DOMAIN ADAPTATION USAGE
Similar to CV, DA in NLP also mirrored NLP task and technique progress with a small lag. Recurrent Neural Network (RNN) based models (including LSTM) are of much use for NLP settings. Initial research in NLP focused on improving the embedding layer and vocabulary difference between source and target domain. Thereafter, adversarialbased methods (including GANs) were also employed for the NLP domain adaptation task. Post-2017, after the advent of Attention and Attention-based Transformers [5], considerable NLP research has been done as to how to use pre-trained models for the task at hand. This deviated from the typical DA technique where both source and target domain data were available at once; in the case of pre-trained models, source data was not available.
NLP and DA in NLP have been popular in the industry because data creation is much easier than CV -there is no need for a camera in business process; further, much of the data is generated in the form of social media content, literature by authors and as news articles. Tasks like sentiment analysis, text classification, natural language inference, language identification, part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER), Question and Answers (Q&A), relation extraction (RE), neural machine translation (NMT), Sentence specificity prediction are used in document and information focused  industries where a lot of text data is generated due to the business processes involved. Tasks discussed in Table 18 are used in NLP applications widely in the industry (industrial applications are further discussed in the section Industrial Applications).

C. SPEECH DOMAIN ADAPTATION USAGE
Most of the DA work in the speech area is in Automatic Speech Recognition (ASR). Environment noises are the main culprit that a model trained on the manually collected dataset (source) does not perform in real-world data (target) in ASR. Table 19 mentions many references as to how DA is used to address this mismatch and enhance quality. There is also work related to Text to speech (TTS) translation that employs DA for increasing application domain and robustness.

D. TIME-SERIES DOMAIN ADAPTATION USAGE
The main idea of using DA in time-series data is to learn temporal latent representations of time-series data that are domain-invariant. However, learning the temporal representations is an arduous task due to dependency amongst timestamps, and a change in lags/offsets leads to difficulty in extracting domain-invariant representation. Table 20 provides a view of how DA is used to solve two major time series tasks of classification and forecasting.
DA is used to improve the performance of time series systems in healthcare [264], Driver assistance systems [267], and others [319]. Also seen is a movement from univariate time series to multivariate time series problem-solving.

E. MULTI-MODAL DOMAIN ADAPTATION USAGE
Domain adapting multimodal data is very much relevant as much real-world data is multi-modal. Multi-modal DA  systems can support missing modalities in target data ( [315], [100]), and the adaptation process is much more robust than unimodal DA, reinforcing that AI and ML systems can improve by learning from multiple DA has been used in various multi-modal settings, i.e., tasks and modalities (refer to Table 21). The advances made here mirror the advances in individual modalities and other trends (e.g., knowledge distillation).

F. INDUSTRIAL APPLICATIONS
Domain Adaptation has been widely adopted by the industry and is of relevance in Industry 4.0. Table 22 provides different use cases and how DA is used.
DA has uses in cross-industry and industry-specific use cases. DA drastically reduces not only the data requirements but also the number of machine learning / artificial intelligence models. This leads to reduced capital expenditure (CAPEX) and upfront effort. The reduced CAPEX is due to truncated activities of data procurement, data annotation, multiple model training etc. Further, a decrease in operational expenditure (OPEX) costs and efforts is guaranteed as Machine Learning Operations (MLOPs) efforts are reduced.  MLOPs efforts that are reduced involve monitoring, retraining, versioning, and serving-all because of the lesser number of domain-adapted models.

VIII. FUTURE RESEARCH FRONTIERS
The future research frontiers must look at solving the challenges mentioned in section VI. Also, the body of research in DA is currently focusing on    proliferated usage in NLP and CV is used in DA (DAF [137], Adversarial Memory Network (AMN) (Attention + DANN + SCL MemNet) [55], Federated domain adaptation [160]). Also, the focus is to include two or more DA techniques together. Wilson and Cook [356] mention the combination of the teacher-student network [84] and AutoDIAL [110], AutoDIAL can replace the student network to understand the degree of adaptation. Similarly, GAN, a data augmentation technique, can replace stochastic data augmentation in [84]. This augmentation of multiple techniques or methods can be useful in multimodal DA.
• Multi-domain support: To support multiple domains in DA, techniques or methods are required to deal with larger domain shifts and/or are robust. StarGAN [357] looks at multi-domain image-to-image translation and can be used in multi-domain adaptation.
• Cross-modal application: DA techniques or methods primarily developed for one modality (say text) can be used in another modality (say an image).
It is observed currently that other than adversarial methods, not many methods are used across modalities.
• Supporting more real-world scenarios: DA researchers are looking to support more real-world scenarios. These real-world which are inspired by data (unavailability, label-set difference, etc.) and environmental (restricted, sequential, etc.) limitations. The current research endeavor is to support a larger domain shift in DA when applied to real-world applications. WILDS Datasets [358] provide 10 curated real-world dataset benchmarks having a varied range of domain shifts. Further, DA provides the potential for reinforcement learning applications to learn in a simulated environment and then apply the policy learned to the real-world environment. More industrial applications as part of Industry 4.0 can be supported by DA. For example, IoT devices or edge devices are quite varied, and they are installed in varied environments / used by varied users; this variation provides good ground to use DA.
• Use of more stable training approaches: Adversarial feature learning-based approaches are still most utilized by researchers, even though the training at times is unstable in practice and requires careful selection and tuning of parameters. However, pseudo-learning-based approaches (including pseudolearning based self-training) are being adopted by researchers more and more based on their outperformance and training stability. However, one drawback of pseudo-learning-based approaches indeed is noise in pseudo labels, which can lead to under performance. Focus of researchers are now looking to employ only more confident pseudo-predictions for training. Similarly, the use of mean-teacher strategy is on the rise, as the approach utilities additional regularizations or feature matching strategy which improve the performance.
• Post-DA over pre-DA strategies: Post-DA techniques are becoming more common to improve ''fallen'' task accuracy. For example, Saunders and Byrne [355] used Elastic Weight Consolidation (EWC) and lattice-rescoring technique to prop-up the ''fallen'' accuracy (due to catastrophic forgetting during DA). However, pre-DA methods are not much found in the literature. Incorporating pre-DA knowledge of domain gaps arising from either data processing (image processing techniques, text extraction techniques) may lead to a performance increase. One possible way to incorporate this would be to use multi-level constraints in adversarial-based approaches. Further research work undertaken in both Pre and Post DA strategies would improve task accuracies. • Removing bias for specific frameworks: Just like we see a classification task bias for nearly all DA work, there also exists a research bias for specific frameworks. A case in point is object detection DA, where nearly all the DA strategies focus on Faster RCNN. Other frameworks like YOLO, SSD, and DETR must also be evaluated for DA performance.
• Solving Industrial use-cases: DA has the potential to solve many AI industrial use-cases, which are not implemented due to economies of scale in implementation for multiple locations, multiple cultures, multiple demographics, etc., large domain gap is understood in high frequency, etc. Table 23 provides a list of industrial use-cases where DA would lead to enormous benefits for the industry if applied.

IX. CONCLUSION
There is an imperative need for deep networks to adapt to multiple domains to reduce costs, increase application, and be more human-like -the ultimate aim of artificial intelligence. This paper explores the work done in DA in deep neural networks (also known as deep DA) in multiple data domains (computer vision, NLP, multimodal, speech, timeseries), reviews different methods and techniques, and mentions emerging datasets related to DA. This paper focuses on applying DA in more practical settings, in various industries, in the wild, and in real-world scenarios where the DA challenges lie. We believe that research undertaken in mentioned future research frontiers would greatly impact DA and AI as a whole. KETAN KOTECHA received the M.Tech. and Ph.D. degrees from the Indian Institute of Technology Bombay. He is currently the Head of the Symbiosis Centre for Applied AI, the Director of the Symbiosis Institute of Technology, and the Dean of the Faculty of Engineering, Symbiosis International University, Pune, India. He is also an Expert in artificial intelligence and deep learning. He has published widely in a number of excellent peer-reviewed journals on various topics ranging from cutting-edge AI, education policies, teaching-learning practices, and AI for all. He was a recipient of multiple international research grants and awards.