Recognizing New Classes with Synthetic Data in the Loop: Application to Traffic Sign Recognition

On-board vision systems may need to increase the number of classes that can be recognized in a relatively short period. For instance, a traffic sign recognition system may suddenly be required to recognize new signs. Since collecting and annotating samples of such new classes may need more time than we wish, especially for uncommon signs, we propose a method to generate these samples by combining synthetic images and Generative Adversarial Network (GAN) technology. In particular, the GAN is trained on synthetic and real-world samples from known classes to perform synthetic-to-real domain adaptation, but applied to synthetic samples of the new classes. Using the Tsinghua dataset with a synthetic counterpart, SYNTHIA-TS, we have run an extensive set of experiments. The results show that the proposed method is indeed effective, provided that we use a proper Convolutional Neural Network (CNN) to perform the traffic sign recognition (classification) task as well as a proper GAN to transform the synthetic images. Here, a ResNet101-based classifier and domain adaptation based on CycleGAN performed extremely well for a ratio ∼1/4 for new/known classes; even for more challenging ratios such as ∼4/1, the results are also very positive.

Indeed, CNNs have become very accurate models provided that there is sufficient data (in size and diversity) for their training [25,26]; data refers to both the raw images and the ground truth (GT) that we must associate with them as training supervision. In fact, since CNNs are data hungry and most GT comes from a (cumbersome and error-prone) manual labeling process, providing GT at scale, reducing manual intervention, and/or reusing previous knowledge have become emerging topics for computer vision research in general and for autonomous driving in particular. For instance, active learning techniques [27][28][29][30] focus on automatically finding the a priori best training images for their posterior manual labeling out of a large number of unlabelled ones; self-labeling minimize the lack of real-world labeled data when a computer vision system must be retrained to recognize new classes in a relatively short period.
The rest of the paper is organized as follows. Section 2 reviews the most related work to our proposal. Section 3 elaborates our proposal. Section 4 details the experimental protocol, the obtained results, and the conclusions deduced from them. Finally, Section 5 summarizes the work presented in this paper and suggests future directions of research in line with our conclusions.

Related Work
Learning to recognize new classes falls into the paradigm of lifelong learning [52][53][54], where a perception-based system has to continuously adapt to situations not previously experienced. We can think of three main components of such a learning capability (see Figure 1). The first one consists of the ability to identify unknown classes, which is not trivial, since CNNs tend to be overconfident about their classification decisions. In fact, this is an active research topic known as out-of-distribution detection (which includes novelty and anomaly detection; this paper relates to the novelty case) [55][56][57][58]. The second component consists of a data generation protocol to collect training samples of these unknown classes, provided that the CNN needs to take them into account in the future. Finally, the third component consists of a procedure that allows the CNN to recognize the new classes without deteriorating its accuracy in identifying the classes for which it was initially/previously trained; this is known as learning without forgetting [59][60][61][62] and still is an open and challenging research topic. In fact, the work of [58] focuses on out-of-distribution for our use case, i.e., traffic sign recognition. Finally, we retrain the models to recognize the new classes without forgetting previous ones. In this paper, we focus on the second step, assuming that rather than collecting the samples from the real world, we generate them by using a virtual world.
In this paper, we focus on data generation. We require that we are given the unknown (novel) traffic signs in the form of a few on-board captured images. In addition, to avoid the forgetting problem when retraining the traffic sign recognition CNNs, we will just retrain using all of the available data (known and synthesized); in this way, the paper can really focus on assessing the usefulness of synthesizing samples of the unknown classes. Accordingly, the remainder of this section addresses the use of synthesized visual data for training computer vision models, as well as the use of GANs to perform task-agnostic domain adaptation.
Researchers such as Taylor et al. [63] pioneered the use of videogame data for testing vision-based tracking algorithms. Marin et al. [64] extended the use of this synthetic data to train object detectors performing in real images, while Vazquez et al. [65] raised the attention on the domain gap between virtual and real world images. From there, the use of synthetic visual data generated from virtual environments has kept growing. We found works using synthetic data for object detection/recognition [66][67][68][69], object viewpoint recognition [70], re-identification [71], and human pose estimation [72]; building synthetic cities for autonomous driving tasks such as semantic segmentation [44,73], place recognition [74], object tracking [45,75], object detection [76,77], stixel computation [78], and benchmarking different on-board computer vision tasks [47]; building indoor scenes for semantic segmentation [79], as well as normal and depth estimation [80]; generating GT for optical flow, scene flow, and disparity [81,82]; generating augmented reality images to support object detection [83]; simulating adverse atmospheric conditions such as rain or fog [84,85]; even performing procedural generation of videos for human action recognition [86,87]. Moreover, since robotics and autonomous driving rely on sensorimotor models worthy of being trained and tested dynamically, in the last years, the use of simulators has been intensified beyond datasets [48,49,88,89].
In contrast to this literature, we can leverage the already-annotated real-world images conveying a set of classes known by our current CNN-based classifier, but we have to assess the possibility of using automatically generated synthetic images as samples of classes that are unknown for our current CNN. Therefore, these synthetic images must be used to retrain the CNN to properly classify previous and new classes. We will evaluate two different settings: First, when the synthetic images are used as they come from the virtual environment; second, when the synthetic images are transformed by a GAN to look like the real-world images, i.e., as a type of task-agnostic domain adaptation where, following the domain adaptation terminology, the synthetic world acts as the source domain and the real world as the target domain.
GANs use a generator CNN to transform the appearance of source images to look like target images, and a discriminator CNN which aims at distinguishing between the transformed and the original images in the target domain. The generator-discriminator system is trained until the discriminator is not able to distinguish the origin of the images, which is understood as the point when the source images are similar enough to the target ones. This application of GANs is known as image-to-image translation.
Isola et al. [90] proposed an encoder-decoder as the generator architecture, and a patch-based (patchGAN) approach for the discriminator. Since this approach was only able to work with low-resolution images, other approaches build upon this method to overcome this problem [91]. However, a relevant observation is that these proposals require pixel-level GT about how the generated images should look, which is termed as supervised image-to-image translation. In order to avoid this kind of supervision, Taigman et al. [92] designed an encoder-decoder generator in such a way that the encoder features are indistinguishable for original and transformed images. In other words, for the GT, it is only required to know if the images come from the source domain or the target one, which is always possible at training time. Liu et al. [93] also focused on generators' feature layers. Afterwards, other alternatives were proposed that did not require the mentioned supervision. Some approaches use an auxiliary task to define the loss between input and generated images; for instance, Bousmalis et al. [94] use image-level classification while Hoffman et al. [95] use semantic segmentation as auxiliary tasks. Other approaches focus on appearance of the input and generated images. Shrivastava et al. [96] proposed an identity loss between the input and generated images. One restriction of this approach is that the source and target domain images have similar appearances. Zhu et al. [97] and Kim et al. [98] followed the cycle idea; i.e., from source images, target-style ones are generated which, in turn, are the input to generate new source-style images. The source-to-target and the target-to-source are different generators. In each domain, we have different discriminators. The cycle idea is not only useful because it does not require image GT, but also because the input and transformed images can have a relatively different appearances, especially compared to the approach in [96]. In other words, in contrast to other GAN proposals, a GAN trained according to the cycle idea has the potential of properly transforming the appearance of source images showing content unseen during its training. Accordingly, in this paper, we follow the cycle idea. In particular, since [97] has publicly available code-called CycleGAN-we use it for the experiments in this paper.
Finally, the work with the most similar goal to that of this paper has recently been presented by Beery et al. [99]. The addressed application is animal detection and classification from static cameras. The paper evaluates the use of synthetic data for classifying animals for which it is difficult to have sufficient real-world image samples. Therefore, similarly to us, previous real-world image samples from known classes (animals) are leveraged for retraining their (animal) classifier together with the synthesized images containing the new class samples (they consider deer as a new class). In this paper, rather than focusing on one new class at a time, we also evaluate different balances between known and unknown classes. We also evaluate the difference between using the synthetic images as they come from the virtual environment in contrast to transforming them via GANs. In both cases, since our application falls into fine-grain classification, we also assess the dependency on common visual cues between seen and unseen classes.

Overall Idea
Assume we need a classifier C such that, given an image (e.g., framing a traffic sign), it is able to assign to it a correct label from a given set K of known labels/classes (e.g., traffic sign classes). Let I be a set of images collected for training such a C. For supervised training, we need to assign one class to each image, which is usually done offline by human annotators. Let I K be the corresponding annotated set of images. Then, we can run a supervised machine learning algorithm that uses I K to generate a classifier C I K , which will be used (at testing time) to support the addressed application (e.g., on-board traffic sign recognition). The problem arises when, during the execution of such an application, we realize that there are classes of interest not included in K (e.g., after a warning from an out-of-distribution detection module also running as part of the application). Let us call U this set of new classes such that K ∩ U = ∅. For training a new supervised classifier C I K∪U which takes into account all classes, we need to collect and manually annotate new images covering a sufficient number of instances for each class in U . This may be difficult to do as quickly as we could wish, since we may be facing unusual classes (i.e., for which it is difficult to find corresponding instances by just randomly roaming in the real world) and there will also be a latency due to the manual annotation of the found instances.
Accordingly, the alternative method that we want to explore relies on automatically generating synthetic images to quickly obtain sufficient annotated instances of the new classes for training a new classifier. In addition, as we have already mentioned, training visual models using pure synthetic images can lead to a performance drop when performing in the real world. In order to reduce such a domain gap, GANs are a possible solution; i.e., by directly transforming the images of the source domain (e.g., synthetic world) to have a similar appearance to those of the target domain (e.g., real world). We follow this approach in this study. Now, let I U be a set of synthetically generated images automatically annotated according to the classes in U , and G I U the corresponding set of images transformed by a GAN. We aim to train a classifier C {G I U ,I K } (ideally) performing in the real world as if G I U would consist of real-world images annotated by humans. Yet another question is how the GAN is trained. From the point of view of generating synthetic images, generating images for classes in U is analogous to generating for classes in K. Therefore, we require that besides the set of real-world images I K , we also have a set I K of (automatically) annotated synthetic images for the known classes; i.e., both sets cover the same classes, but for each class, it can be a different number of samples in each set. The GAN is trained to transform images from the set I K into the set I K , but without assuming a one-to-one pairing of the images from both sets. In other words, the GAN will learn to perform domain-to-domain transformations, but not class-specific transformations between domains. Therefore, when we need to transform synthetically generated instances for a new previously unknown class (i.e., in U ), we can apply the previously learned GAN even if it was not exposed to such a class during the training time and, in fact, it will not be exposed at this time, due to the lack of real-world instances of this class. Figure 2 depicts the overall idea.

Data Generation
We start by generating synthetic images with automatic GT for each unknown class. We require a real-world example showing the appearance of an instance of each unknown class (i.e., the example already used to decide that the class must be considered in future versions of the classifier). Then, a designer can create a textured 3D model of it. This model can then be populated in a virtual environment that we have predefined. Next, we can capture as many images as we need containing instances of the new class along with automatic GT, which is done under predefined variations regarding the environmental and image capture conditions. For instance, for the traffic sign recognition study we address in this paper, we perform the following steps: (1) We create a traffic sign 3D model for a given unknown sign; (2) we use the SYNTHIA environment [44] to populate the 3D model in locations predefined for traffic signs; (3) we automatically aim the camera that captures images towards these locations, varying the capturing angle and distance between the camera and the traffic sign, as well as the scene illumination. This procedure ensures visual variability in the collected images due to the fact that environmental shadows influence the captures, as well as global illumination, resolution, etc. The same procedure can be used to capture synthetic images of known classes intended to be used in the training of the domain-adaptation GAN. In the case of the traffic signs, using the pixel-wise semantic segmentation GT provided by the virtual environment (SYNTHIA), we create corresponding 2D bounding boxes, which we crop to obtain the final synthetic image samples.
As we have mentioned before, synthetic images depicting instances of new classes must still be transformed by means of a GAN in order to alleviate domain shift effects. With this aim, we used the publicly available implementation of CycleGAN-as detailed in [97]-which we train using images of known classes taken from the synthetic and real-world domains. The adversarial loss aiming at approaching the appearance of synthetic and real-world images is defined as follows: where A and B are different domains (synthetic or real in our case), G A→B refers to the GAN generator from domain A to domain B, D B refers to the GAN discriminator that distinguishes between images really coming from domain B and those put out by G A→B , and I A is a set of images from domain A. The GAN discriminator is trained according to the following loss: In addition, CycleGAN uses additional losses to force the image appearance to be transformed between domains without affecting the semantic content of the transformed images. In particular, the following cyclical reconstruction loss is used: which is complemented (regularized) with an additional loss aiming at not only ensuring in-domain content reconstruction, but also across-domain content similarity: Now, we can define the total loss function to train the GAN generator that transforms images from a synthetic domain S to a real domain R as follows: At training time, we use I K and I K as I R and I S image sets, respectively. Then, the learned generator G S→R will be the CNN that we use to transform a set of synthetic images I U into G I U .

Experimental Results
The experiments were designed to address two questions. Since we use synthetically generated instances of unknown classes to retrain the current classifier, we will have a domain shift problem. (Q1) Can we reduce this domain shift by applying an image-to-image translation GAN to the samples of the unknown classes, provided that such a GAN was trained only with samples of the known classes? and (Q2) What are the overall classification results when training the classifier using the real-world data of the known classes with the data generated for the new classes following this GAN-based proposal?
Note that question Q1 focuses on classification results in terms of new classes in isolation, while Q2 addresses the ultimate question, since we combine real-world samples from known classes with generated samples from unknown classes for training of the all-classes classifier. In the following, Section 4.1 introduces the synthetic and real-world datasets used in our experiments, and Section 4.2 elaborates the designed experiments to answer these questions, along with the obtained results and corresponding discussion.

Datasets
In order to perform our experiments, we need a dataset based on real-world images of traffic signs as well as another based on synthetic images. We selected the widely used Tsinghua traffic sign dataset [51] and a synthetic analog that we created to perform the research in this paper, which we call SYNTHIA-TS. We briefly describe them in the following.
Tsinghua is a dataset composed of outdoor scenes captured in China while driving a car in urban scenarios. Following the approach proposed in [51], we cropped the traffic signs and removed all the classes with less than 100 samples. The resulting dataset is composed of 21,721 cropped images, representing 42 traffic sign classes. In terms of appearance, these classes can be hierarchically organized as shown in Figure 3, where the first criterion of splitting the dataset is the external shape of the traffic signs, and the second is the textual/graphical content of the signs. Both the shape and content define the semantics of each traffic sign, i.e., the class. SYNTHIA-TS was created by mimicking the 42 classes considered from the Tsinghua dataset, using one textured 3D model per each of those classes. Then, following the protocol explained in Section 3.2, we acquired traffic sign images within the SYNTHIA environment. The generated data is balanced for all image acquisition conditions and classes. We generated 23,222 instances in total, covering the 42 classes. Since the SYNTHIA environment was previously created for multiple purposes, obtaining these instances from it took less than 2 h using a desktop PC based on an INTEL Core i7 CPU and one NVIDIA Geforce GTX 1080 GPU.

Experiments: Design, Results, and Discussion
We have not only considered the Tsinghua and SYNTHIA-TS datasets as a whole, i.e., H0-0 in terms of the hierarchy shown in Figure 3; instead, in order to perform a finer-grained analysis regarding questions Q1 and Q2, we also conducted experiments based on different nodes of this hierarchy, which we call splits. Accordingly, our setup assumes that we have an existing split s 1 of real-world annotated images for training, and that we also want to learn a new split s 2 , for which we have no access to a proper amount of corresponding real-world images and, therefore, we have to synthesize them. It is understood that s 1 and s 2 have no intersection between classes. On the other hand, for the purpose of performing comparative evaluations in our experimental setting, we do in fact have access to the real-world annotated images of split s 2 .
Since we will be referring to splits coming from synthetic and real-world data, the former sometimes transformed by a GAN, and the latter sometimes used as training or testing data, we have defined the compact notation of Table 1, which will allow us to be precise and concise when describing the multiple experiments we report in this section. Using this notation and given two splits s 1 and s 2 , an example of an experiment for Q2 would consist of using T T s 1 and S T s 2 to train a traffic sign classifier for the known classes in split s 1 together with the new classes in split s 2 , which we would like to be accurate when testing in T C s 1 ∪s 2 , i.e., accurate for all classes. Alternatively, if we use a GAN to transform the synthetic images, then the training of the classifier would be done with T T s 1 and G s 2 s 1 . In fact, we transformed all of the synthetic images at once, which took less than 1.5 h using a desktop PC based on an INTEL Core i7 CPU and one NVIDIA GeForce TITAN X Pascal GPU. As we can see in Figure 3, we have three hierarchical levels: (1) The whole data, (2) two splits based on external shape, and (3) given a shape, different data based on content. Each considered split is defined in Table 2, which specifies their features. We do not consider splits with only one class (i.e., H2-2, H2-4, H2-6, H2-7, and H2-8) since they would not allow the addressing of Q1 (for which at least two classes are needed). However, note that although these splits are not considered in isolation, their data is considered when working with a split corresponding to their parent nodes in the hierarchy. Now, we start the experiments by establishing the upper and lower bounds of different traffic sign classifiers. In these experiments, we use the full Tsinghua and SYNTHIA-TS datasets. Therefore, in this case, we use the split H0-0 ( Figure 3) for Tsinghua, i.e., we use T T H0−0 and T C H0−0 for training and testing, respectively. Both sets have samples of all of the traffic signs that we consider. More specifically, for each class, 60% of the samples are used for training tasks (CycleGAN and traffic sign classifiers) and the remaining 40% for testing traffic sign classifiers. The per-class training/testing sampling is performed randomly and once. Training on S T H0−0 and testing on T C H0−0 acts as the lower bound, since we are using only synthetic images (as they come from the virtual environment); therefore, we must expect a domain shift. Training on T T H0−0 acts as the upper bound, since we are using real-world images from the same distribution (camera and world area) as in the testing set. Table 3 shows these upper and lower bound results for the different architectures that we have considered, namely VGG16 and ResNet101. Moreover, since during the training of CNNs, there is certain amount of randomness (e.g., when sampling the datasets during a mini-batch), we repeat each training five times and report testing accuracy in terms of the mean and standard deviation of the F1 classification score (i.e., F1 = (2TP)/(2TP + FN + FP)) computed on the respective classification results. These results show that: (1) We can achieve a high classification accuracy with the appropriate real-world data; (2) using the synthetic data for training produces a reasonable accuracy (far from random), but there is a dramatic domain shift, with results dropping from 97.59% to 36.05% for VGG16, and from 98.76% to 58.74% for ResNet101.  Tables 4 and 5 report results to answer Q1. We consider paired splits-one is used as the set of known classes (s k ), and the other as the set of unknown classes (s u ). These splits do not intersect, but their union does not necessarily correspond to the full traffic sign hierarchy, because only splits from Table 2 are considered. The pairs were designed to force different global appearances between known and unknown classes. The S T s u columns report the lower bound of classification accuracy for each experiment, i.e., training a classifier for classes in s u with samples in S T s u but testing on the real-world data T C s u . Columns T T s u act as the upper bound, since training is done on real-world samples of s u as if they were actually known. Columns G s u s k report the classification accuracies when training is done with the samples of G s u , i.e., the samples of S T s u transformed by a CycleGAN trained to perform image-to-image translation from S T s k to T T s k . Therefore, the CycleGAN has not seen samples from classes in s u at training time. Finally, we also include the case G s u s u , where the CycleGAN has been trained using samples from the unknown set of classes. Obviously, this is not realistic in our application setting; however, it can be taken as an upper bound of the accuracy, which would be possible to achieve by using CycleGAN to transform the synthetic images. Figure 4 shows examples of the images involved in our experiments: Synthetic, real, and transformed by different CycleGANs.

H2-5 24%
H2-9 12% These results based on splits confirm the observations made for H0-0 according to Table 3; i.e., training and testing (for the unknown classes) with real-world data shows high classification accuracies, while training with the pure synthetic data and testing in the real-world data shows a significant drop of accuracy. Again, ResNet101 is more robust to domain shifts than VGG16. We can see how the gap gets larger as the number of classes based on synthetic data (unknown ones) increases. For instance, the gap for H1-1 is larger than for H1-2, both for VGG16 and ResNet101. Note that H1-1 contains 35 classes and H1-2 only seven (see Table 2). If we analyze the splits of the next hierarchical level (H2-X), the same observations hold; note that H2-3 and H2-5 (10 and 16 classes, respectively) show a larger gap than H2-1 and H2-9 (six and five classes, respectively), both for VGG16 and ResNet101.
On the other hand, CycleGAN indeed helps to significantly reduce the domain shift. When using the H1-1 split as known classes to train the CycleGAN, and applying this GAN to the synthetic images of the unknown classes-i.e., those in H1-2 split-we see ∼9 points of accuracy gain when testing in real-world images of the H1-2 split (9.21 for VGG16 and 9.60 for ResNet101). Changing the roles of these splits, the gain is 10.68 for VGG16 and 4.56 for ResNet101. However, in the two situations (H1-1/H1-2 as known/unknown and vice versa), ResNet101 reports significantly higher accuracies (more than 10 points) after the GAN-based domain adaptation of the synthetic images. In addition, for VGG16 and ResNet101, H1-2 as the split of unknown classes shows significantly higher accuracies (more than 20 points) than when it is H1-1, which is just a consequence of starting with similar accuracy differences before domain adaptation. Looking to the H2-X splits, we can see that the GAN-based domain adaptation reports significantly higher accuracy in most of the experiments. In fact, it is more interesting to analyze when it is not the case. For instance, when split H2-1 is used to train the CycleGAN, we obtain either very low accuracy gains (e.g., for VGG16, 1.43 when the unknown classes are in H2-5 and 2.19 for H2-9) or even negative adaptation (e.g., -2.93 for H2-3 with VGG16, and for H2-3/5/9 with ResNet101). We think that, when using H2-1 to train the CycleGAN, the learned image-to-image transform is too biased towards a blue background, which is a color not present in the rest of the considered H2-X splits (in the role of unknown classes). When exchanging the roles between H2-1 and the rest of the considered H2-X splits, the conclusion is the same for VGG16. However, ResNet101 is still able to extract the most from the domain-adapted images, showing significant accuracy gains with respect to using the synthetic images as they come directly from the virtual environment. Figure 5 presents some visual hints. For instance, when split H2-1 is used to train the CycleGAN, this adds a bluish color to the transformed images; when the CycleGAN is trained with the H2-9 split, the added color is yellowish. The former is more marked than the latter, which may be the reason behind some of the previously mentioned cases of poor domain adaptation. We can see other effects, like blue background images going to black backgrounds. According to the reviewed results, ResNet101 seems more robust to this effect than VGG16 (see the case of s u = H2-1 in Tables 4 and 5).
Tables 4 and 5 help to analyze results in scenarios where there are significant visual differences among the known/unknown classes. We are also interested in analyzing different balances between known and unknown classes. Analogously to Table 1, we will define splits denoted by a percentage of classes; e.g., 100% would be H0-0. Each of these splits also has a complementary one with the remaining classes. When forming the new splits, in order to be sure that we do not degenerate in the previous hierarchy-based experiments, the classes are not sampled from H0-0, but they are proportionally and randomly sampled from all of the H2-X splits. For example, 50% would consider half of the H2-1, H2-3, H2-5, and H2-9 classes, added to H2-2, H2-6, and H2-8, which only have one class. Tables 6 and 7 present the corresponding results. We can see how previous observations are confirmed, namely: (1) Domain gap increases with the number of synthetic classes (the unknown ones) to be covered by the traffic sign classifier, but still, the obtained accuracies are reasonable; (2) CycleGAN is able to dramatically reduce the domain shift for the unknown classes, recovering from ∼10 to even ∼30 points of accuracy; (3) ResNet101 is able to produce the best results before and after domain adaptation.
Overall, to already answer Q1, we see that using known classes to train a GAN-based transformation from synthetic to real-world domains indeed helps to dramatically reduce the classification accuracy gap due to the domain shift for synthetically generated new classes. However, there are scenarios more favorable than others, and there is still room for improvement. First, the CNN used matters. Here, Resnet101 shows significantly better classification accuracies than those of VGG16; i.e., ResNet101 is more robust to this kind of known/unknown class setting. We can see this by looking at Tables 4 and 5. Note how, for H2-X splits, when the split of classes used to train CycleGAN is the same split as that used to train the traffic sign classifier (G s u s u columns), then the classification accuracies of VGG16 and ResNet101 are similar, and VGG16 even outperforms ResNet101 several times. A similar effect can be appreciated in Tables 6 and 7. Hence, ResNet101 seems to be more robust to image imperfections introduced by CycleGAN. In this favorable but unrealistic setting, the domain-adapted images show fewer artifacts (see Figures 4 and 5). Table 6. Experiments to support Q1 (see main text). All tests are done in T C s u . Average and standard deviations of F1 scores are reported, since each experiment is performed five times. The column G s u s k − S T s u just stands for the subtraction of the means of the respective columns. Second, the most adverse scenario is indeed when known and unknown classes show very different appearances combined with a low known/unknown class ratio. In a reasonable case, as for ∼(88%/12%) splits of randomly selected known/unknown classes, using ResNet101, we can see how we obtain a classification accuracy of 97.24 on average (Table 7) where training with real-world data reaches 100.00. In the most imbalanced case, ∼(12%/88%), ResNet101 reaches a classification accuracy of 77.85, still far from the 98.82 when training with real-world images. Note that in this scenario, there is room to improve GAN-based image-to-image translation, since even using the classification classes to train the CycleGAN, the obtained accuracy is 84.93, still far from 98.82. Although in this paper, these are just intermediate results on our way to address Q2, this analysis is already useful if our goal is to perform transfer learning for the traffic sign classifier; i.e., if we want to train a classifier that only needs to operate in a new set of traffic signs for which we do not have enough samples, and we want to leverage knowledge from the known classes even if these are not going to be used for classification anymore-for instance, in particular environments with specialized traffic signs, like in some closed infrastructures or industrial facilities. In fact, the vision system does not need to be onboard a vehicle; it can even be on a humanoid or any other robotic platform. However, a probable requisite in this case would be to use the same camera sensor model to classify new classes as that used for collecting the real-world images involved in the training of the GAN-based domain adapter.
Finally, in order to address Q2, we performed the experiments shown in Tables 8 and 9. In these tables, columns S T H0−0 + T T s k refer to training jointly with synthetic and real-world data. S T H0−0 is the full SYNTHIA-TS set, i.e., covering known (k) and unknown (u) classes. Therefore, we use all of the synthetic data available for the classes we want to classify. On the other hand, T T s k refers to the training set of all known real-world classes; in these experiments, k ∪ u are the 42 considered classes of split H0-0. Therefore, T T s k + T T s u equals the full Tsinghua training set (T T H0−0 ). Columns G H0−0 s k + T T s k are analogous to previous ones, but in this case, rather than using the synthetic data as gathered from the virtual environment, we use it transformed by a CycleGAN. These GANs are trained only using the known classes in each experiment, i.e., in this case, T T s k and S T s k . Accordingly, G H0−0 s k is composed by G s k s k , used as pure data augmentation, as well as G s u s k , which is really needed, since we assume that we do not have real-world training samples for the split s u . Table 8. Experiments to support Q2 (see main text), all done in T C H0−0 . Average and standard deviations of F1 scores are reported, since each experiment is performed five times. This is done for the all-classes classification problem, but we also show detailed results for known and unknown classes. Table 3 shows the lower and upper bounds for these experiments, i.e., training only on either SYNTHIA-TS or Tsinghua data. In terms of average F1, these bounds are 36.05 and 97.59, respectively.   Table 9. Experiments to support Q2 analogously to We see that the observations done in the context of Q1 apply here too: (1) The domain gap increases with the number of synthetic classes (the unknown ones); (2) CycleGAN is able to significantly reduce the domain shift; (3) ResNet101 performs better than VGG16 before and after domain adaptation; even when using the full Tsinghua training set, they report similar upper bounds (97.59 for VGG16, 98.76 for ResNet101). In the case of ResNet101, the best cases almost reach the upper bound: (1) When the known classes are those in split H1-1 (∼83% of classes) and the unknown in split H1-2 (∼17% of classes), we obtain 97.57, which is very close to 98.76; (2) when classes are directly selected randomly on the H2-X hierarchy level, the case of 12% of unknown classes reaches 95.54, which is also a very high accuracy; even the 24% case still reports 92.57. Moreover, in all of these cases, the accuracy of the known classes keeps over 95 and over 84 for the unknown ones (91.55 for H1-2, 88.95 for the 12%, 84.78 for the 24%). Overall, we can conclude that with ResNet101, the proposed method works well when the ratio of unknown/known classes is of ∼1/4. In order to reach the upper bound, we can investigate if we can still improve CycleGAN, but in this ∼ 1/4 regime, the last mile can probably be covered by adding a small number of real-world collected and annotated samples from the unknown classes. As the vision system keeps performing in the real-world, the samples falling in the new classes can be kept; then, we can replace the synthesized and transformed samples by these self-annotated ones in a future retraining of the classifier. In fact, these self-annotation cycles can be also a good approach for more challenging ratios of unknown/known classes; note that for the 76% of unknown classes, the results are over 70, and over 50 for the 88%.

Conclusions
There are situations where a computer vision system may need to recognize new (previously unknown) classes, but the lack of samples from such classes (i.e., raw images with annotations) may seriously delay this possibility. In this paper, we have explored how to address this situation by using synthetic data and leveraging samples from the classes already known to the system. Since there is a domain shift between synthetic and real worlds, addressing the problem involves incorporating some kind of domain adaptation. To solve the problem of the lack of data, we have proposed to teach a GAN using (the already available) samples from the known classes, and to apply it to adapt synthetic samples from the new classes. As a proof of concept, we have focused on traffic sign recognition. We have used the publicly available Tsinghua dataset and we have created a synthetic dataset (SYNTHIA-TS) for designing the experiments presented in this paper. In particular, the experiments have been designed to address two questions. First, we addressed the intermediate question can we reduce the synthetic-to-real domain shift by applying an image-to-image translation GAN to the unknown classes, provided that such a GAN was trained only with the known classes? After an extensive set of experiments and results that we have presented and discussed, the answer was positive, which leads us to the main question, namely, what are the overall classification results when training the classifier using the real-world data of the known classes with the data generated for the new classes following our proposal? Again, the obtained results allow us to conclude that the proposed method is indeed effective, provided that we use a proper CNN to perform the classifications task, as well as a proper GAN to transform the synthetic images. Here, a ResNet101-based classifier and domain adaptation based on CycleGAN performed extremely well for ratios of unknown/known classes of even ∼1/4. For more challenging ratios such as ∼4/1, the results are also very positive. As a matter of fact, instead of focusing on improving the components of the presented method, as future work, we plan to augment this method with complementary techniques such as self-annotation, i.e., using the classifier generated with our current method to self-annotate real-world samples of the new classes for a posterior retraining/fine-tuning of the classifier. The datasets of synthetic traffic signs used in this paper are publicly available at www.synthia-dataset.net. Funding: Antonio and Gabriel acknowledge the financial support by the Spanish project TIN2017-88709-R (MINECO/AEI/FEDER, UE). Joost acknowledges the financial support by the Spanish project TIN2016-79717-R. Antonio thanks ICREA under the ICREA Academia Program for the financial support. Finally, as CVC members, the authors thank the Generalitat de Catalunya CERCA Program and its ACCIO agency.