Fast and Easy Sensor Adaptation With Self-Training

Object detectors based on deep neural networks have the disadvantage that new labels should be acquired whenever the complementary metal-oxide semiconductor (CMOS) image sensor (CIS) is changed. In this study, we propose a fast and easy two-step sensor-adaptation method without labels for the target domain; 1) simple adaptation, and 2) self-training. The simple-adaptation process transfers the knowledge of the source model to the target model by updating the batch normalization parameters, and matches the feature distributions of the source domain and those of target domain. In the self-training process, we employ the ensemble model strategy to mitigate the over-fitting problem using noisy pseudo labels generated by the simple-adaptation model. Quantitative and qualitative experiments show that the proposed method can transfer the knowledge from one CIS model to another, even if the data format of the target domain is different from that of the source CIS domain.


I. INTRODUCTION
Advances in deep neural networks (DNNs) have significantly improved the performance of image-based objectdetection algorithms. With such improvements, DNN-based object detectors have drawn much interest in advanced driving assistance system (ADAS) and autonomous driving (AD) [1], [2]. In general, the RGB images from a complementary metal-oxide semiconductor (CMOS) image sensor (CIS) including an image signal processor (ISP) is injected to image-based object-detection algorithms as an input. Depending on the characteristics of ISP, RGB images exhibit different properties, in terms of color, contrast, and brightness. This implies that the change in the CIS changes the RGB distributions and causes domain-adaptation problems resulting in performance degradation in a similar scene.
Recently, domain-adaptation algorithms for object detection have been developed based on the faster region-based convolutional neural network (faster R-CNN) architecture [3], [4], [5], [6]. These algorithms transfer the instance-level representation from the source domain to the The associate editor coordinating the review of this manuscript and approving it for publication was Taous Meriem Laleg-Kirati . target domain to improve the object detection performance. However, their instance-level representation is extracted from the region of interest (ROI) pooling process; therefore, the domain-adaptation algorithms [3], [4], [5], [6] can only be applied to the faster R-CNN architecture. For example, crossdomain algorithms such as GPA [5], which can work on both the source and target domains, are designed to use the ground truth (GT) labels of the source and target datasets while training the domain-adapted model. An expensive process of labeling GTs for the target domain is required in this procedure.
Another approaches have been investigated to solve the domain-adaptation problem through the batch normalization (BN) parameter update [7], [9]. In particular, because the BN layer normalizes the feature distribution, domainadaptation algorithms based on BN can be a solution for the domain-adaptation problem invoked by CIS change. It is called a sensor-adaptation problem. However, these domain-adaptation methods do not fully ensure stable detection performance in ADAS/AD applications. To achieve stable performance, it is essential to carry out additional training with a sufficient amount of data from a new CIS and their GTs. However, the GT-labeling process is costly FIGURE 1. Overview of the proposed method. From the source model, the simple-adaptation model is trained by updating the parameters in the batch normalization (BN) layers. Then, with the ensemble self-training process illustrated in the yellow box, multiple submodels are trained using the corresponding ensemble training set, and merged into the ensemble model. The ensemble self-training can be applied repeatedly.
and time-consuming. The self-training method, which is suggested in [9] and [10], is a possible solution to effectively solve the problem of insufficient GT.
The self-training method [9], [10] generates pseudo labels, and boosts the performance of the network model using these pseudo labels [11], [12]. However, the pseudo labels include noise, i.e., false positives and false negatives, which may disturb the neural-network model training.
Several studies have been conducted [13] to solve the suppression of the effect of noisy pseudo labels including [14], which presents that the ensemble filtering method shows the general performance of the noisy labeling problem.
In this paper, we propose a method for solving the sensor-adaptation problem without large-scale labeled data. As shown in Fig. 1, the proposed method involves two steps: simple-adaptation by updating the parameters of the BN layers and ensemble self-training. Through this simpleadaptation process, the network model can be adapted to the feature distribution of the target sensor image, and pseudo labels can be generated using the adapted model. In the ensemble self-training process, we generate several training pseudo label sets based on their confidence and then train multiple models with the corresponding set. The multiple models are merged into a final ensemble model to cope with the noisy labels.
The contributions of this study are as follows.
• We propose simple-adaptation, which adapts a welltrained model to the new sensor domain, even if the data format of the source is different from the format of target domain.
• We propose an ensemble self-training method, which trains the model from the noisy pseudo labels.
• Through experiments, we demonstrate that the proposed method transfers the knowledge of a pretrained RGB model to a model for a raw image domain. This paper is configured as follows. In Section II, we describe the domain-adaptation algorithms and in Section III, the proposed method is introduced. Section III-A introduces the domain-adaptation problem and the self-training method to provide an overview of our algorithm. Section III-B and III-C describe the simple adaptation and ensemble self-training, respectively. We demonstrate the Proposed method in Section IV. The experimental settings, evaluation of the private dataset, and ablation studies are described in Section IV-A to IV-C. Finally, Section V summarizes this study and suggests conclusions.

II. RELATED WORKS
Domain adaptation refers to the transfer of knowledge from a source domain to a target domain. The researches on domain adaptation for object-detection tasks are based on the architecture of two-state detectors such as Faster R-CNN. Their domain adaptation framework contains a classifier to distinguish the domain of the input image. Chen et al. [3] proposed an algorithm to perform domain adaptation by training a classifier that distinguished domains at both instance and image levels. Chen et al. applied adversarial training to solve the domain-adaptation problem and proposed HTCN [4]. Using a domain discriminator, HTCN [4] classified domains at the pixel level. Xu et al. classified domains at the instance and image levels, and used an adversarial training method to increase the recognition performance in the target domain [6]. Xu et al. proposed GPA [5], which constructs graph relations between object proposals and trains a cross-domain object detector that may detect objects in both the source and target domains. Li et al. proposed SIGMA [15], witch establishes an affinity of a semantic-aware node using utilizing graph nodes, and achieves fine-grained adaptation with node-to-node graph matching by applying a constraints in a structure-aware matching loss. He et al. [16] proposed an algorithm witch integrates branches of both source and target domains in a unified teacher-student learning scheme. Li et al. proposed AT [17], which trained a student network for the cross domain and teacher network for the target domain simultaneously. The cross-domain student network was trained using strong augmentations and an adversarial training method, and the target-domain teacher network was trained from the student network using an exponential moving average. However, these methods were proposed for networks associated with a two-stage object detector, and were not applicable to one-stage object detectors. Furthermore, they did not consider the data-format conversion, and therefore, it was not applicable to apply them for sensor adaptation. In the following section, we describe a method that can be applied regardless of the network architecture and data format.

III. PROPOSED METHOD A. OVERVIEW
A domain-adaptation algorithm is a type of transfer-learning method. A model trained on a source domain may perform poorly on a different, but related, target domain. Domainadaptation algorithms enable the model to learn transformations to match the distributions between domains.  Self-training is a variant techniques of semi-supervised learning, which trains a model using unlabeled datasets. Selftraining algorithms generate pseudo labels using pretrained models, classify the feasible labels from the generated pseudo labels using a classifier, and train the model using the classified labels.
The proposed method is a domain-adaptation algorithm that uses self-training. Our target scenario is domain adaptation for conversion of the input data format from RGB to raw. The following sections describe the details of the proposed method in detail.

B. SIMPLE ADAPTATION
The first step of our method is simple adaptation. In a simple-adaptation process, the parameters of the BN layers are updated to adapt the source model to the target domain. Fig. 2 describes the simple-adaptation model, i.e., the model whose parameters are updated from the BN-layer parameters, detects most vehicles in the target domain even when the data format is converted. However, there are many false negatives and false positives in Fig. 2, which could be crucial problems in applications such as ADAS/AD. For example, a crash or unexpected stop may occur due to such false negative or positive. To cope with the effects of noisy labels, additional training without GT annotation, which is high cost and timeconsuming, is essential.

C. SELF-TRAINING
The simple-adaptation model can generate pseudo labels for the target domain data, but they are quite noisy when training the model using a supervised learning scheme. In this section, a self-training scheme designed to train a noise-robust model is described. Our self-training scheme can be applied repeatedly, and a cycle of training is called a step. Each selftraining step is based on the ensemble training method [14]. We trained three models, namely, a high-precision and lowrecall model, a moderate model, and a low-precision and high-recall model, and merged them into a final ensemble model. For this purpose, we generated three training sets for each model from the overall pseudo labels. Each training set was composed of training data suitable for the characteristics of the corresponding model.
At each step, we generated pseudo labels with the ensemble model of the previous step. A simple-adaptation model was used in the first self-training step. Each label was composed of the pixel coordinates of a bounding box and its confi- where the subscript i indicates the index of a label and the superscripts tl and br indicate the top left and bottom right, respectively. x and y are the pixel points and c is the confidence. From the overall pseudo labels, three training sets were generated using the confidence threshold and position condition. Each training set was composed of three types of labels: negative, positive, and ignored. The label whose confidence was lower than the lower threshold, or whose bottom of the bounding box was located above the vanishing point, was classified as negative. The label whose confidence was higher than the upper threshold was classified as positive. Labels that did not correspond to either positive or negative labels were classified as ignored. A negative label set N (n) , positive label set P (n) , and ignored label set I (n) were defined as follows: high and y br i ≥ p vp,y }, high and y br i ≥ p vp,y }.
(3) where p vp,y denotes the horizontal position of the vanishing points. The pseudo labels in N (n) were assumed to be false positives and were removed from the training set, and those in P (n) were assumed to be object GTs and were used as the training target. The ignored labels I (n) did not generate losses. In other words, if a network output was associated with an ignored label, then neither classification nor regression losses were added to the total loss.
To determine the upper and lower thresholds of the three training sets, we measured the number of true positives (TP) and false positives (FP) at 0.025 confidence interval in a validation set. This results in the threshold that made TP−FP close to zero, which was set as α. We set β = α − 0.1 and γ = α−0.2. With these α, β, and γ , we defined the thresholds for each training set as follows: where the superscripts indicate the indices of the training sets and models. With these training sets, three models were trained by a supervised learning scheme using a focal loss [18] for classification and a smooth-L1 loss [19] for regression.
Subsequently, we set the network parameters of each layer in the ensemble model as the weighted sum of the corresponding layer parameters of the three models.
where w c o ,c i ,x,y is a network parameter of the n-th trained model. ω n is the predefined weight used to merge the trained models, and they may affect the detection performance of the ensemble model.
Our self-training method, i.e., from pseudo labeling to the ensemble scheme, can be repeatedly applied. Algorithm 1 summarizes the proposed method. In the following experiments, to distinguish the result of each self-training iteration, we refer to the results of the i-th iteration of self-training as Iter-i.

IV. EXPERIMENTS A. EXPERIMENTS SETTING
In experiments, we adapt the pretrained model trained on 8-bit RGB format data captured by a Pointgray RGB camera. The architecture of the pretrained network is Foveabox [20], whose backbone is resnet-50 and neck network is FPN [21]. Focal loss [18] and smooth-L1 loss [19] were used to train the network.
We conducted experiments to adapt the model to two types of target datasets; 16-bit raw and 8-bit RGB image datasets both captured using a Samsung S5K2G1 image sensor. The 16-bit raw images had the RCCB Bayer pattern format and were captured without an ISP, while the 8-bit RGB images were captured using the ISP which is installed in Samsung evaluation board. To distinguish the 8-bit RGB images from the Pointgray RGB images (source domain), the target RGB images were called ''RA1B''. The source and target datasets both contained day-time and highway environments, and vehicle classes such as cars, trucks, and vans were considered in the following experiments.
We captured more than 100, 000 raw and RA1B images and built a private dataset. Note that it is not possible to capture the raw and RA1B data simultaneously, the scenes in raw and RA1B datasets are not exactly same. We then randomly sampled 10, 000 images for training the simple-adaptation model, 5, 000 images for Iter-1, and 10, 000 images for Iter-2, from both the raw and RA1B target datasets without replacement. To validate our algorithm, we annotated 3, 500 raw images and 1, 200 RA1B images for the validation set. As mentioned in Section III-C, the validation set was used to select the self-training parameters.
We used the average precision (AP), Precision, Recall, and F1-score to evaluate the algorithm. Precision is a measure of the correct positive predictions and Recall is a positive case correctly predicted by the classifier.
where TP, FP, and FN are the numbers of true positives, false positives, and false negatives, respectively. The F1-score is a measure combining both Precision and Recall, and is described as follows: AP is an evaluation metric of computing the average precision value for a recall value over 0 to 1. AP indicates the general performance of an object detector without threshold, while the other measures indicate the detection performance of a specific threshold.

B. EVALUATION OF THE PRIVATE DATASET
An experiment was conducted using a private dataset. Fig. 3 shows the qualitative results and Table 1 shows the quantitative performance of our proposed method. As shown in Fig. 3 the first column, the simple-adaptation model finds most of the objects but some vehicles are missed. the second and third columns of Fig. 3 show the results of Iter-1 and Iter-2 ensemble models respectively, with the same ensemble weight, ω 1 = ω 2 = ω 3 = 1 3 . The Iter-1 and Iter-2 ensemble models detect vehicles missed by the simple-adaptation model. Table 1 lists the performance enhancement at each step of the proposed method. The source model shows a high detection performance in the source domain, but its performance decreases relatively in the other domains. The SA model indicates the simple-adaptation model, and the simple-adaptation updating the parameters of the BN layers makes the RGB model suitable for each target domain. It also increases the detection performance. For the raw target dataset, our self-training process for Iter-1 and Iter-2 increases the AP by 0.011 and 0.007, respectively. In contrary to the raw image case, no performance improvement from Iter-2 is found with the RA1B target dateset. Because the RA1B images have the equal color type and bits as the source domain, the network for RA1B appears to converge faster than that for the raw image.

C. ABLATION STUDY 1) ENSEMBLE WEIGHTS
To confirm the effect of the ensemble weights, we tested three ensemble weight modes, i.e., equal-weighted mode, recall enhanced mode and precision enhanced mode, and evaluated their detection performance in the raw image dataset. The equal-weighted mode is the basic form that gives the same weight on the three submodels, i.e., ω 1 = ω 2 = ω 3 = 1 3 . The precision enhanced mode is a setting that is derived to reduce false positives, and the recall enhanced mode is a setting that is derived to reduce false negatives. To this end, we set ω 1 = 0.6, ω 2 = 0.3, ω 3 = 0.1 for the precision enhanced mode, and set ω 1 = 0.1, ω 2 = 0.3, ω 3 = 0.6 for the recall enhanced mode. Table 2 describes the detection performance according to the self-training iterations and ensemble weight modes. FN is similar, independent of the iteration index and ensemble weight mode, and the AP changes according to FP. In the results of Iter-1, the AP varies according to the ensemble weight mode. The AP performance is similar regardless of the ensemble weight mode in the result of Iter-2.

2) EXPERIMENT ON CITYSCAPES DATASET
We also tested our method on the Cityscapes [2] dataset. Because no detection annotations exist in the Cityscapes dataset, we converted the segmentation annotations to 2d detection annotations. The RGB images in the Cityscapes dataset are converted to a raw-image format using the reversible imaging pipeline [22], and evaluated using our method with the pseudo raw images. We compared our method with the GPA [5] and AT [17]. GPA [5] is a supervised domain-adaptation algorithm for object detection, which uses labels in the source and target domains. We trained the baseline Faster-RCNN model using the KITTI [1] dataset and performed GPA [5] on the Cityscapes dataset in both the raw and RGB image formats.
AT [17] is a cross-domain adaptive method that learns labeled source data and unlabeled target data as student and teacher networks, respectively. In addition, the cross-domain student and target-domain teacher models are trained simultaneously. Table 3 presents the quantitative experimental results. Since GPA [5] requires a two-stage object-detector architecture, for instance Faster R-CNN, we cannot perform an equivalent quantitative experiment. Hence, the quantitative experiment only shows a performance comparison based on the data format of each algorithm. As shown in Table 3, the proposed method enhances the detection performance at each steps; however, GPA [5] cannot be adopted  in the raw-image format. We compare the qualitative results of the GPA [5] and our method in Fig. 4. As shown in Fig. 4 (a) and (b), the GPA performs object detection well on the RGB image but misses most of the vehicles in the 8-bit raw image. This result indicates that GPA [5] cannot perform domain adaptation to different data formats. However, as illustrated in Fig. 4 (c) and (d), the proposed method performs domain adaptation from the RGB format to the raw format.
Performance evaluation for student and teacher models of AT [17] was performed on Cityscapes 8-bit raw data. The performance of the cross-domain student model was as low as that of the GPA. On the other hand, the teacher model outperforms the student model, but not the simple-adaptation model (SA model) of the proposed method.

3) EXPERIMENT ON LOW-BRIGHTNESS SCENES
To prove that our method can be applied to various environments, we collected data from low-brightness environments. Since there are very few vehicles with little lighting at night, it may take a considerable amount of time to collect a sufficient amount of nighttime scenes, compared to daytime scenes. Therefore, we collected scene in tunnels, whose brightness is similar to that of nighttime scenes. We attempted to train the detection network using only in-tunnel scenes without any day-time scenes, but in-tunnel scenes were not enough to train a model. Hence, we mixed in-tunnel and daytime scenes into both unlabeled training sets and evaluated the detection performance in the in-tunnel validation set. Table 4 shows the quantitative results for the in-tunnel scenes and Fig. 5 shows the qualitative results for low-brightness scenes. The detection performance of the proposed method in tunnel scenes is lower than that in daytime scenes the proposed method however improves the detection performance, as in the case of daytime scenes. Fig. 5 presents that the detector trained on in-tunnel scenes can be used for low-brightness scenes of both in-tunnel and urban nighttime scenes.

V. CONCLUSION
In this paper, we proposed a fast and easy sensor-adaptation method. In the proposed method, a simple-adaptation process transfers a knowledge of a source model to a target model suitable for the target domain dataset, which was captured by a different sensor, by updating the parameters of the BN layers. To cope with the noisy labeling problem of pseudo labels, the proposed self-training method utilized an ensemble training method. Through experiments, we demonstrated that the proposed method enhanced the detection performance without the high cost and time consuming the GTlabeling process. In ablation studies, we adjusted the ensemble weights and showed that our method had a stable detection performance, independent of the ensemble weights. Through comparisons with the Cityscapes dataset, we showed that the proposed method performed domain adaptation between different data formats, whereas the conventional method performed domain adaptation only within the RGB format.

ACKNOWLEDGMENT
(Jinhyuk Choi and Byeongju Lee contributed equally to this work.) JINHYUK CHOI received the B.S. degree in computer science and engineering from Korea University, Seoul, South Korea, in August 2011.
Since 2019, he has been a Staff Researcher with the Samsung Advanced Institute of Technology (SAIT), Suwon, South Korea. His research interests include visual object tracking, multiobject tracking, LiDAR-based object detection, deep learning, computer vision, machine learning, and autonomous driving.
BYEONGJU LEE was born in Daejeon, South Korea, in 1987. He received the B.S. degree in electrical and computer engineering and the Ph.D. degree in electrical engineering and computer science from Seoul National University, South Korea, in 2013 and 2020, respectively.
Since 2020, he has been a Staff Researcher with the Samsung Advanced Institute of Technology (SAIT), Suwon, South Korea. His research interests include visual object tracking, multi-object tracking, LiDAR-based object detection, deep learning, computer vision, machine learning, and autonomous driving. He is currently a Project Leader of the Autonomous Driving Team, SAIT. His current research interests include deep learning, computer vision, ADAS and autonomous driving, nonlinear control, and robotics. He has published over 30 international papers in these areas. VOLUME 11, 2023