An Alternative Hard-Parameter Sharing Paradigm for Multi-Domain Learning

Hard-parameter sharing in multi-domain learning (MDL) allows domains to share some model parameters in order to reduce storage cost while improving prediction accuracy. One traditional paradigm of the sharing practice borrows an idea from multi-task learning (MTL), which is to share bottom layers of a deep neural network among domains while using separate top layers for each domain. However, it is unclear whether the effectiveness of sharing bottom parameters in MTL can transfer to MDL or not. Therefore in this work, we revisit this common practice via an empirical study on image classification tasks on a diverse set of visual domains and make two surprising observations. (1) Using separate bottom-layer parameters could achieve significantly better performance than the common practice and this phenomenon holds for the different number of domains jointly trained on different backbone architectures with different quantities of domain-specific parameters. (2) A multi-domain model with a small proportion of domain-specific parameters from bottom layers can achieve competitive performance with independent models trained on each domain separately. Our observations suggest that people adopt the new paradigm of using separate bottom-layer parameters as a stronger baseline for model design in MDL.


I. INTRODUCTION
Deep neural networks (DNNs) have evolved rapidly in recent years, showing excellent performance in many areas of artificial intelligence (AI) and vision tasks. AI-powered applications are increasingly adopting DNNs to solve many application-related tasks on single or multiple data streams, resulting in the simultaneous running of more than one DNN model on resource-constrained embedded devices. For example, a social robot may perform many fine-grained image classification tasks including identifying facial expressions, gestures, and postures, to communicate with human beings smoothly and recognizing scene-specified objects like blackboards, desks, monitors in a classroom for environment perception [1].
Although the recent progress on efficient model design and model compression has made it easier to deploy a single The associate editor coordinating the review of this manuscript and approving it for publication was Shovan Barma . model on device, supporting many models for different tasks is still challenging due to the linearly increased bandwidth, energy, and storage costs. The problem becomes even worse when users are allowed to personalize their applications and add more tasks or datasets, leading to a continuously expanding set of models to support on device.
An effective approach to address the problem is multi-domain learning (MDL), where several different domains are learned jointly. For example, one can build a single model that learns across images collected with different camera types or images that contain different object types for image classification tasks. Compared to learning each domain separately, MDL allows some parameter sharing across domains to take advantage of the domains' similarities for improved task performance and reduced storage cost.
MDL is closely related to multi-task learning (MTL), where a set of tasks are learned jointly. Typically, MTL refers to learning different downstream tasks (e.g., depth estimation and semantic segmentation) together while MDL aims to FIGURE 1. The forward propagation of a multi-domain model given a domain's input. The model consists of domain-specific convolution filters, BN layers, and a classifier. When a set of filters in a convolutional layer is domain-specific, the activation maps from these filters (called filter-level conv feature) will replace the corresponding activation maps from the backbone architecture (called backbone feature). The combined feature will be fed into the next layer. (Best viewed on screen). learn on multiple datasets (e.g., data collected from multiple sources with differing statistical bias) addressing tasks corresponding to each dataset [2] simultaneously. Despite the subtle differences, both MDL and MTL face an open question on how to share parameters across domains or across tasks. In this paper, we use the term ''domain'' and ''task'' interchangeably as each domain is associated with one task.
Inspired by the effectiveness of the hard-parameter sharing strategy in MTL for improving task performance, researchers believe the same strategy is also a promising solution for MDL. Hard-parameter sharing is generally applied by sharing the bottom layers among all tasks, while keeping several top layers and an output layer task-specific [3]. It is commonly used in designing multi-task DNN models in the literature [4], [5], [6] and gains popularity in MDL [7], [8], [9].
However, it is unclear whether the effectiveness of sharing parameters in bottom layers in MTL can transfer to MDL or not. As shown in Figure 1, a DNN model usually relies on a stack of layers (including bottom and top layers) to transform inputs to features and then an output layer to produce predictions based on the features. Researchers [10], [11], [12], [13] believe that the first several layers (bottom layers) serve as low-level feature extractors such as edge detectors and corner detectors, which should be able to be shared across multiple tasks. On the contrary, the top layers are more sensitive to the input data (i.e., the training task), implying that different tasks should possess distinct top layers to generate diverse highlevel features. This belief is intuitively reasonable but lacks solid experimental verification for tasks in diverse domains. Whether the commonly-used hard-parameter sharing strategy could lead to satisfying performance in MDL remains an open question.
In this work, we show that the above sharing strategy does not produce the best prediction accuracy in MDL through an empirical study on fine-grained image classification tasks and a popular multi-domain learning benchmark Decathlon [14]. Specifically, the common hard-parameter sharing strategy, which shares low-level (bottom-layer) parameters while keeping high-level (top-layer) ones domain-specific, is compared to its counterpart, which uses separate bottom-layer parameters for each domain and shares top-layer ones. For the description purpose, we refer to the common hard-parameter sharing strategy as top-specific while the counterpart as bottom-specific, which is considered as an alternative hard-parameter sharing paradigm for MDL. Both strategies use a separate output layer for each domain to fit the needs of different output dimensions. We refer to a model that is trained on diverse domains as a multi-domain model. Based on an extensive evaluation on four representative convolutional neural networks (CNN) architectures, we make two major observations.
• Multi-domain models constructed using the bottomspecific strategy could achieve significantly better performance than the ones constructed using the top-specific strategy (i.e., the common practice). Controlled experiments show that this phenomenon can be reproduced with the different number of domains trained together on different backbone architectures using different quantities of domain-specific parameters.
• Multi-domain models with few domain-specific parameters from bottom layers can achieve the same, if not better, performance, as independent models trained separately on each domain in terms of the validation accuracy. It not only confirms the benefits of hard-parameter sharing in reducing overfitting, but also introduces the bottom-specific strategy as a strong baseline for model design in MDL.
These observations advocate people to rethink the design of hard-parameter sharing strategies in MDL. Particularly, because top layers of a modern CNN architecture are usually wider, they tend to have higher redundancy and representation capability that is not fully exploited when trained on a single task. Several prior studies [15], [16], [17] empirically demonstrate that bottom layers are less redundant than top layers in existing architectures. These studies raise a potential explanation for our observation: the bottom-specific strategy could achieve better performance than the top-specific strategy because it makes full use of the capacity in top layers while alleviating task interference by increasing the representation power of bottom layers.
The rest of the article is organized as follows. Section II introduces the related works in multi-domain learning. The methodology and experimental settings for our empirical study is presented in Section III and IV. Comparison results and related discussions are illustrated in Section V and VI. Section VII summarizes this article.
Notice that a preliminary version of this manuscript has been accepted by ICME 2022 [18]. The following improvements are made in this version: 1) In Section IV, two initialization options for domain-specific classifiers are compared to suggest classifier pre-training in MDL for better validation performance. 2) As an empirical study, we provide more experimental evidence to suggest using the bottom-specific strategy as a stronger paradigm for model design in MDL. Specifically, more comparisons under different number of domains and different quantities of domain-specific parameters are included in Section V-A. 3) In Section V-D, the bottom-specific strategy is also compared to a state-of-theart task grouping framework in MTL under the same resource constraint to indicate that the new paradigm could serve as a stronger baseline for future works. Besides, the performance gap between the baseline models with and without separate batch normalization (BN) layers also verifies the benefits of our methodology in Section III.

II. RELATED WORK A. MULTI-TASK LEARNING
Multi-task models can achieve higher prediction accuracy and improved efficiency for each task because they can benefit from commonalities across related tasks [3], [19], [20]. Recent works in MTL create multi-task models based on popular DNN architectures called backbone architectures. They fall into either hard-parameter sharing or soft parameter sharing [3]. In hard-parameter sharing, most or all of the parameters in the backbone architecture are shared among tasks. In soft parameter sharing, each task has its own set of parameters. Task information is shared by applying regularization on parameters during training (e.g., enforcing the weights of the model for each task to be similar). Compared with soft parameter sharing where each task still keeps its own model and parameters, hard-parameter sharing allows multiple tasks to share some model parameters and enjoys the benefits of reduced storage cost and inference latency. This paper thus focuses on hard-parameter sharing.
One of the widely-used hard-parameter strategies is proposed by Caruana [19], [21], which shares the bottom layers of a model across tasks. For instance, Multi-linear Relationship Networks [4] share the first five convolutional layers of AlexNet and use task-specific fully-connected layers for different tasks. Similarly, Meta Multi-Task Learning [6] shares the input layer, makes two top layers task-specific, and uses trainable parameters to decide whether the inner hidden layers are shared or task-specific. This structure substantially reduces the risk of overfitting but may suffer from optimization conflicts caused by task differences because tasks may compete for the same set of parameters in the shared bottom layers. Recent works [22], [23] attempt to determine which layers to share via neural architecture search. This line of work can generate a compact multi-task model fit for the given tasks automatically. However, it needs a complex training procedure and the learned sharing strategy cannot be generalized to other tasks directly. Although researchers resort to different strategies when developing multi-task models, it is unclear how to effectively decide what parameters to share given a backbone architecture with a set of tasks in interest. Sharing bottom layers while keeping top layers task-specific is still the most commonly-used paradigm.

B. MULTI-DOMAIN LEARNING
Multi-domain learning (MDL) aims at utilizing a single network to perform target tasks in a diverse set of domains. This paper focuses on MDL and studies how to design a compact model that jointly learns representations for all the domains with a few domain-specific parameters.
There are two types of approaches to developing multidomain models. The first type of approaches designs various adapter modules (e.g., Batch Normalization adapter [24], 1 × 1 convolutional adapter [25], residual adapter [14]) and plugs them in the backbone architecture. The entire backbone architecture keeps domain-agnostic and is shared across domains while the adapters are domain-specific. A recent study [26] has shown that the choice of adapters and the locations they are plugged in depend on the set of domains. It leverages neural architecture search to figure out what adapter to use and where to add adapters for a given set of domains.
The second type of approaches follows the common practice of hard-parameter sharing in MTL. Some researchers [7], [8], [9] proposed to share bottom layers and design sophisticated top layers for each domain. However, it remains an open question whether the effectiveness of sharing parameters in bottom layers in MTL can transfer to MDL. In other words, it is unclear whether the common practice (i.e., the top-specific strategy) adopted for tasks from a single input domain could lead to similarly satisfied performance for tasks from different domains. In this work, we aim to answer the open question by revisiting the common practice using typical CNN architectures and providing insights on designing a better hard-parameter sharing paradigm for MDL.

C. MULTI-DOMAIN LEARNING APPLICATIONS
Many real-world applications have adopted MDL algorithms and observed substantial performance improvements. For multi-lingual machine translation tasks, when some model parameters are allowed to be shared, the performance on translation tasks having limited training data can be improved by jointly learning with tasks having a large amount of training data [27]. When building recommendation systems, MDL is also found helpful in providing context-aware recommendations. In [28], a text recommendation task is improved by sharing feature representations for items or users. In [29], a top-specific model is used to learn a ranking algorithm for video recommendation. Motivated by the wide adoption of MDL, we aim to revisit the traditional hard-parameter sharing paradigm with typical architectures and provide insights on designing better hard-parameter sharing strategies.

III. METHODOLOGY
Our goal is to study the performance of different hard-parameter sharing strategies via controlled experiments, in which the number of tasks, the backbone architectures, the quantity of task-specific parameters are taken into account. This section describes three design considerations we apply to the studied sharing strategies. Details of experiment settings and results are reported in Sections IV and V.

A. DOMAIN-SPECIFIC PARAMETERS IN THE FILTER GRANULARITY
Our experiments focus on fine-grained image classification tasks, where each task has its own dataset. CNN models naturally become the backbone architectures due to their superior performance on vision tasks and their popularity in the MDL literature. To create multi-domain models, we determine domain-specific parameters in the granularity of filters instead of layers. Specifically, we will compare the performance of multi-domain models created using the common practice (i.e., the top-specific strategy) and its counterparts (i.e., the bottom-specific strategy) given the same targeted amount of domain-specific weights. The filter-level granularity allows us to precisely control the percentage of domain-specific parameters and ensures that each strategy in comparison can have the same percentage in controlled experiments. Our experiments show that both the amount of domain-specific parameters and where these parameters come from (e.g., bottom layers or top layers) have a significant impact on the performance of a multi-domain model.

B. SEPARATE CLASSIFIER FOR EACH DOMAIN
It is common that different domains expect a different size of outputs or even have diverse prediction goals. In our experiments, the backbone architectures use a separate classifier (i.e., the output layer) for each domain to fit the needs of different output dimensions. Although it is still feasible to allow multiple image classification tasks to share the same classifier, we observe serious performance degradation due to the aggressive sharing. Thus, following the practice in prior work [30], [31], we adopt a separate classifier for each domain.

C. SEPARATE BATCH NORMALIZATION LAYERS FOR EACH DOMAIN
We use separate Batch Normalization (BN) layers for each domain in our multi-domain models. It is motivated by a prior study [32], which shows that re-learning a set of scales and biases is sufficient to achieve comparable performance as re-learning the entire set of parameters when a pre-trained model is transferred to another task. The scales and biases correspond to parameters in BN layers in typical CNN architectures. In our experiments, we adopt the idea of making BN parameters domain-specific and observe a significant improvement on the performance of multi-domain models. Figure 1 illustrates the forward propagation of a multi-domain model given a specific domain. All domains to be learned together share the same backbone architecture. Each domain has its own BN layers, the output layer that produces logits, and a subset of convolutional filters from the backbone architecture. The remaining convolutional filters are shared by all domains. When a set of filters in a convolutional layer is domain-specific, we use the activation maps produced by these filters to replace the corresponding activation maps from the backbone architecture before the activation maps are fed into the next layer.

IV. EXPERIMENTAL SETTINGS
To comprehensively and fairly compare the performance of different hard-parameter sharing strategies, we need to consider several potentially influential factors including the backbone architectures, the set of domains and domainspecific parameters, parameter initialization, and hyperparameters. We next explain each factor in detail.

A. BACKBONE ARCHITECTURES
Our experiments use four popular CNNs, including MobileNetV2 [33], ResNet50 [34], MNasNet [35], and SqueezeNet [36]. When creating multi-domain models based on each of these backbone architectures, we add a separate set of BN layers and a separate output layer for each domain. We also select a subset of filters as domain-specific parameters and share the rest among all domains to be learned jointly. Since SqueezeNet does not have BN layers, we do not have domain-specific BN layers in its multi-domain versions. The backbone models are pre-trained using ImageNet [37], and their multi-domain versions are trained jointly on target domains.

B. DATASETS
We conduct extensive experiments on five fine-grained image classification tasks including FGVC Aircraft [38], CUB-200-2011 [39], Stanford Cars [40], Stanford Dogs [41], and MIT Indoor Scenes [42]. For simplification, we call it the FGC benchmark. Table 1 shows the dataset statistics. The number of domains to be learned jointly could affect the performance of multi-domain models. To quantify its effects, we consider four different domain combinations including A + C, A + B + C, A + B + C + D, A + B + C + D + I, where A, B, C, D, I are the ID of the datasets listed in Table 1.

C. DOMAIN-SPECIFIC PARAMETERS
We experiment with different quantities of domain-specific parameters and sharing strategies. Specifically, we pick different quantities of domain-specific filters such that the number of weights in these filters accounts for 0% to 100% of the total number of convolution parameters in the backbone architecture. We use a step size of 10%. Given the same percentage of domain-specific parameters, we compare three hard-parameter sharing strategies as shown in Figure 2: • Top-Specific. As in Figure 2(a), this strategy shares filters in bottom layers and makes filters in top layers domain-specific. It is the common practice used in hardparameter sharing [3], [31], [52].
• Bottom-Specific. Figure 2(b) shares filters in top layers while making filters in bottom layers domain-specific. This is a direct counterpart of the commonly-used topspecific strategy.
• Random. It randomly selects a subset of filters from all convolutional layers as domain-specific parameters and shares the rest. As shown in Figure 2(c), some layers will be shared or domain-specific randomly.

D. INITIALIZATION
Our goal is to eliminate randomness caused by initialization during controlled experiments to ensure the fairness of the comparison and the reproducibility of the experimental results. All multi-domain models are initialized with their corresponding backbone architecture pre-trained on Ima-geNet. However, the classifiers of these multi-domain models need to be initialized separately since they are specific to targeted domains and pre-trained weights from ImageNet are not applicable. We consider two options to initialize each domain's classifier: 1) random initialization: set random seeds to guarantee the same initialization for each domainmodel pair; 2) classifier pre-training: pre-train the classifier of each domain for the four backbone architectures and use the pre-trained weights to initialize the classifier in multidomain models. Although both options can achieve the goal of eliminating randomness, we adopt the second option due to its positive side-effect on the performance according to our experiments Figure 3 shows the validation accuracy of MobileNetV2 on FGC when its classifier is initialized randomly (blue lines) or is pre-trained for the first 20 epochs (red lines). Under the same learning rate scheduling, the models with classifier pre-training consistently achieve higher validation accuracy in the end. We observed similar trends when training multi-domain models. The results also indicate that classifier pre-training could provide a better initialization for transfer learning.
Hyper-parameters. The batch size is 64 for all domains. All the multi-domain models are trained for 150 epochs using the optimizer ADAM [53] and the same learning rate scheduling. Specifically, the initial learning rate for MobileNetV2, ResNet50, and MNasNet is 0.001 and decreased to 0.0001 after 100 epochs. The initial learning rate for SqueezeNet is 0.0001 and then 0.00001. All experiments are conducted on a GPU cluster with Xeon E5 CPU and NVIDIA Titan-X GPUs.

V. RESULTS AND ANALYSIS A. COMPARISONS BETWEEN SHARING STRATEGIES ON FGC
Our first surprising discovery is that when constructing multi-domain models, using separate bottom-layer parameters could achieve much better performance than using separate top-layer parameters, which contradicts the common belief in hard-parameter sharing in MDL.
In the experiment, we construct three multi-domain models from each backbone architecture using the three sharing strategies (top-specific, bottom-specific, and random) and train these models on FGC. We strictly control the percentage of domain-specific parameters to be the same for the multi-domain models created using the three strategies. Table 3 reports the validation accuracy of the 12 multidomain models (4 backbone architectures × 3 sharing strategies) on the five domains. The number of domain-specific parameters is 20% of the total amount of convolution parameters in the backbone model. The rows ''random'' report the mean accuracy of two multi-domain models constructed using the random strategy. The results indicate that the bottom-specific strategy consistently achieves better performance than the top-specific one and outperforms the random strategy in most cases.
We further provide an independent T-Test [54] to show the statistical significance of adopting the bottom-specific strategy compared to the top-specific one. Specifically, we repeat the exact same experiment as we did in Table 3 for 10 times independently, i.e., 4 backbone architectures × 2 sharing strategies × 10 times on the FGC benchmark. Then for each architecture and each task, we can compare the two sharing strategies by calculating the mean value of the task accuracy samples and the p-value between samples from different strategies. Table 4 includes the average task accuracy for  different settings, their standard deviation, and the corresponding p-value. The higher average task accuracy of the bottom-specific strategy and the small p-values, i.e., ≪ 0.01, suggest that the bottom-specific strategy is significantly better than the top-specific one with high confidence.

1) DIFFERENT BACKBONE ARCHITECTURES
We observed the same phenomenon using different backbone architectures. Figure 4 shows the accuracy curves of multi-domain models created with different strategies. The accuracy curves of the bottom-specific strategy are always above the ones with the top-specific strategy on all five domains with different backbone architectures, indicating the better performance of the bottom-specific strategy.

2) DIFFERENT NUMBER OF DOMAINS
The same observation also holds with the different number of domains. Table 3 shows the performance when all five domains are trained together, while Tables 5, 6, and 7 summarize the prediction performance when two, three, and four domains are used to jointly train multi-domain models built with the four typical backbone architectures respectively. All the multi-domain models have the same amount of domain-specific parameters that account for 20% of the total number of convolution parameters in the backbone architecture. Similar to the observations from Table 3, the bottom-specific strategy consistently achieves much better performance than the top-specific strategy and outperforms the random one in most cases regardless of the number of domains to be learned jointly. The results suggest that the bottom-specific strategy consistently yields a better prediction performance than the top-specific strategy no matter the number of domains to learn jointly.

3) DIFFERENT NUMBER OF DOMAIN-SPECIFIC PARAMETERS
The superiority of the bottom-specific strategy still holds with different quantities of domain-specific parameters. Figure 5 shows the validation accuracy of the multi-domain models whose domain-specific parameters account for 0% to 100% of the total amount of convolution parameters of their backbone architecture MobileNetV2. Note that picking 0% domain-specific parameters results in a multi-domain model with only separate BN layers and domain-specific classifiers, while possessing 100% domain-specific parameters is equivalent to building up a completely independent model for each domain. The results show that the TABLE 3. The validation accuracy of the 12 multi-domain models (4 backbone architectures × 3 sharing strategies) on FGC with the same amount of domain-specific parameters for each domain (20% of the total number of convolution parameters in the backbone model). ''independent'' reports the results of independent models trained on each domain separately .   TABLE 4. T-Test results on 8 multi-domain models (4 backbone architectures × 2 sharing strategies) on FGC. Each setting is executed 10 times to record the average task accuracy, the standard deviation, and p-value between samples from different strategies.
performance of the bottom-specific strategy is always higher than the top-specific one, as indicated by the significant gap between the red and green curves. Besides, randomly selecting domain-specific filters also consistently produces higher validation accuracy than the top-specific strategy but is worse than the bottom-specific one in most cases. Table 8 reports the validation accuracy of the 12 multidomain models (4 backbone architectures × 3 sharing strategies) on FGC. The number of domain-specific parameters is 40% of the total amount of convolution parameters in the backbone architecture. This table as well as Table 3 show that even with different percentages of domain-specific parameters, the superiority of the bottom-specific strategy compared to the top-specific strategy still holds for multi-domain models created with four different backbone architectures.

B. COMPARISONS BETWEEN SHARING STRATEGIES ON DECATHLON
The same phenomenon can be observed when constructing multi-domain models on Decathlon with the bottom-specific and the top-specific sharing strategy. Figure 6 shows the accuracy curves of multi-domain models built on MobileNetV2 and ResNet50 with different sharing strategies under the same amount of domain-specific parameters (20% of the total number of convolution parameters in the backbone model). It can be seen that the  bottom-specific strategy can produce better accuracy performance than the top-specific one for all or most of the ten domains, which is consistent with the observations on FGC.

C. COMPARISON WITH INDEPENDENT MODELS
Our second discovery is that multi-domain models with a relatively small proportion of parameters selected from bottom layers for each domain can achieve competitive performance with independent models trained on each domain separately.
In Figure 5, the performance of independent models correspond to the point where the percentage of domain-specific parameters is 100%. Overall, multi-domain models constructed with the bottom-specific strategy can achieve competitive validation accuracy as independent models when the percentage of domain-specific parameters is over 20% for all the five domains. This observation is consistent with the well-recognized benefits of MDL in reducing overfitting and improving prediction accuracy.

D. COMPARISON WITH A TASK GROUPING FRAMEWORK
The bottom-specific strategy is also compared to a state-ofthe-art task grouping framework [31] in MTL. Given a set VOLUME 11, 2023  of tasks, the framework finds a set of multi-task models, each of which is trained on a subset of the tasks, that result in the highest overall prediction performance within a given resource constraint. The reason for comparing with the task grouping framework is that it provides another perspective of designing multi-task models within a limited computation budget. To meet the resource constraint, the traditional paradigm as well as our alternative one focus on deciding which parameters should be shared, while the task grouping framework still enables parameter sharing throughout the whole backbone model but tries to decide which tasks should be learned together. The framework also allows some tasks to be auxiliary tasks if, when co-trained with targeted tasks, they can improve the accuracy of a multi-task model on targeted tasks. It is worth noting that this MTL framework shares the entire backbone model, hence does not include separate BN layers for each task. We denote this framework as Task Grouping (w/o Ind. BN). We also compare with a variant of the framework that uses task-specific BN layers and denote the variant as Task Grouping (w/ Ind. BN).
In our experiments, we assume that the resource constraint limits multi-domain models to have only twice the number of parameters in MobileNetV2 (excluding weights in the classifier). Following the steps of the task grouping framework, we obtain two domain sets and each set contains domains that should be trained together. For Task Table 9 reports the validation accuracy of our bottomspecific approach and the task grouping framework. Our multi-domain model could produce the best performance in all five domains, echoing our first discovery in constructing effective multi-domain models. In the meanwhile, the huge gap between the performance of the task grouping framework with and without separate BN layers also provides strong evidence for the performance benefits of making BN parameters domain-specific.

VI. DISCUSSIONS
We summarize the two main observations from our experiments as follows.
• Multi-domain models with domain-specific parameters from bottom layers could achieve better performance than the ones with the same amount of domain-specific parameters from top layers.
• Multi-domain models with a relatively small quantity of domain-specific parameters from bottom layers achieve the same, if not better, performance as their independent counterparts.
Below, we provide explanations for these observations and insights on designing better hard-parameter sharing strategies in MDL.

A. WHY THE BOTTOM-SPECIFIC SHARING STRATEGY OUTPERFORMS THE TOP-SPECIFIC ONE?
A potential explanation is that top layers of a modern CNN architecture are usually much wider than bottom layers and thus have a higher representation capability. Prior studies [16], [17], [55] have shown that bottom layers have less redundancy than top layers in existing architectures. When tackling multiple domains together, top layers may have sufficient capacity to learn diverse features while the bottom layers are easily distracted by different domains during training. The bottom-specific strategy could achieve better performance than the top-specific strategy because it makes full use of the capacity in top layers while alleviating task interference by increasing the representation power of bottom layers. To validate the above explanation, we conduct an experiment that compares the performance of multi-domain models built on pruned CNN variants. The rationale is that network pruning can remove redundancies in CNNs. If bottom layers are less redundant than top layers, multi-domain models built on networks whose bottom layers are pruned would suffer more severely than the ones built on networks whose top layers are pruned. We use MobileNetV2 in the experiment and create two pruned MobileNetV2 variants. The first variant losses half of the parameters in the last top layer, leading to a reduction of 10% of parameters in MobileNetV2. The other variant losses half of the parameters in several bottom layers starting from the bottom to top until the reduction of parameters also reaches 10%. We refer to the two variants as top-pruned and bottom-pruned respectively. We then construct multi-domain models based on these variants as well as the original MobileNetV2. All three multi-domain models have separate classifiers and BN layers for each domain and are initialized randomly and trained with the same learning rate scheduling. Figure 7 shows the accuracy curves of the multi-domain models built on pruned MobileNetV2 variants. Overall, the accuracy of the model built on the top-pruned backbone is consistently higher than the accuracy of the model built on the bottom-pruned variant, and is comparable with the accuracy of the model built on the original MobileNetV2 on all five tasks. The results indicate that pruning bottom layers has a larger negative effect on the validation performance than pruning top ones, implying the low redundancy and capacity of bottom layers. This experimental evidence suggests that increasing the capacity of bottom layers when constructing multi-domain models would bring larger benefits to the performance of multi-domain models. The reason comes from two aspects. Firstly, DNNs are well-known to be over-parameterized [56], [57], [58], [59]. It is reasonable to assume that a single DNN, especially its top layers, can largely accommodate the representation requirements of multiple domains by taking advantage of the redundant parameters that are not fully exploited on a VOLUME 11, 2023 TABLE 8. The validation accuracy of the 12 multi-domain models on FGC with the same amount of domain-specific parameters for each domain (40% of the total number of convolution parameters in the backbone architecture).  single domain. A small portion of domain-specific parameters from bottom layers increases the representation power of bottom layers, alleviates domain interference, and thus improves model performance. Secondly, the commonalities among tasks serve as implicit data augmentation [3] and avoid overfitting. This is consistent with the common belief about the benefits of MDL.
Based on the above observations and explanations, we propose to use the bottom-specific strategy as an alternative paradigm for designing multi-domain models. We further suggest several practices for model design in MDL.
(1) Using separate BN layers for each domain can significantly boost the performance of multi-domain models. (2) Classifier pre-training could help improve the performance of multi-domain models. (3) The representation power of a multi-domain model tends to be limited by bottom layers rather than top layers of a modern CNN architecture given tasks on a diverse set of domains. Selecting the domain-specific parameters from bottom layers would be more beneficial in improving performance.

VII. CONCLUSION
In this work, we revisit the common practice in hard-parameter sharing for multi-domain learning (MDL) and conduct an empirical study on fine-grained image classification tasks and Decathlon to compare the performance of different sharing strategies. Experiments show that the common sharing strategy is outperformed by its direct counterpart-that is, selecting domain-specific parameters from bottom layers rather than top layers. The counterpart can also achieve competitive performance compared with independent models. We further provide explanations for the observations and insights on model design in MDL.

A. LIMITATION AND FUTURE WORKS
Notice that we provide an alternative hard-parameter sharing paradigm for MDL that is not guaranteed to go beyond complex multi-domain model design, but can serve as a more robust baseline for related works. While in practice, our paradigm is easy to implement for any given set of tasks and backbone models, and is suitable for deployment on resourceconstraint devices. For future works, a possible extension is to determine the quantity of shared parameters for each layer based on the measurable layer's parameter redundancy.