Three approaches to facilitate invariant neurons and generalization to out-of-distribution orientations and illuminations

The training data distribution is often biased towards objects in certain orientations and illumination conditions. While humans have a remarkable capability of recognizing objects in out-of-distribution (OoD) orientations and illuminations, Deep Neural Networks (DNNs) severely suffer in this case, even when large amounts of training examples are available. Neurons that are invariant to orientations and illuminations have been proposed as a neural mechanism that could facilitate OoD generalization


Introduction
The object recognition performance of Deep Neural Networks (DNNs) dramatically degrades when the train and test distributions are not identical due to dataset bias (Torralba & Efros, 2011), i.e., when tested in out-of-distribution (OoD) conditions. There is a big gap between DNNs and humans when evaluated in OoD conditions. This issue has been getting much interest in recent years (Beery, Horn, & Perona, 2018;Geirhos et al., 2019;Hendrycks et al., 2021;Recht, Roelofs, Schmidt, & Shankar, 2018, as it severely compromises the safety and fairness of AI applications. One of the most prominent factors of dataset bias is that objects may appear in a constrained range of orientation and illumination conditions (Alcorn et al., 2019;Barbu et al., 2019). While generalization to OoD orientations and illumination conditions has been long studied in both biological and artificial neural networks, e.g., Anselmi, Rosasco, and Poggio (2016), Sinha and Poggio (1996) and Ullman (1996), the computational mechanisms that facilitate such generalization remain as a key outstanding question. Recently, Madan et al. (2022) and Zaidi et al. (2020) have shown that DNNs are capable to overcome bias by transferring the generalization ability obtained from objects seen in  test accuracy and OoD accuracy for late-stopping applied to the MiscGoods-illuminations dataset (medium InD data diversity). OoD accuracy converges much later than InD accuracy. (b) Learning curves of the OoD accuracy with and without tuning batch normalization momentum (tuned BN) in the CarsCG-Orientations, dataset (medium InD data diversity). It can be seen that tuning the momentum reduces the oscillation of the OoD accuracy and improves the performance. (c) Left: Conceptual diagram of the invariance loss. Pairs of images that belong to the same category are fed into the DNN. The invariance loss is based on the Euclidean distance between the pairs of the last ReLU activity. The classification loss is calculated with the network output as usual. The total loss is the weighted sum of the invariance and classification losses. Right: Learning curve of OoD accuracy in MiscGoods-Illuminations dataset (medium InD data diversity) when the invariance loss is applied. The OoD accuracy increases by about 20% compared to the baseline. The solid lines in the plots are the mean value. The lighter semitransparent colors surrounding the solid lines indicate 95% confidence interval. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) a richer set of conditions to the objects seen in biased conditions. Also, the emergence of representations at the individual neuron level in the intermediate layers of the DNN that are selective to categories and invariant to the OoD conditions has been identified as a mechanism that may facilitate such OoD generalization. Invariant neural representations have been studied during decades, e.g., Anselmi et al. (2016), and here they appear as the mechanism that allows OoD generalization. This begs the question whether we can further encourage the emergence of invariant neural representations in DNNs in order to further improve OoD generalization.
In this paper, we investigate factors that drive the emergence of invariant neurons and as a result substantially boost the DNN ability to recognize objects in OoD orientations and illuminations. In particular, we discover that the following factors, summarized in Fig. 1, have a remarkable impact: 1. Late-stopping: DNNs are usually trained until the validation recognition accuracy (which is in-distribution) converges. We found that in many cases the OoD recognition accuracy improves slowly, yet consistently, after the validation (indistribution) accuracy has converged. This finding is surprising as classic machine learning theory suggests earlystopping as a regularization mechanism (Yao, Rosasco, & Caponnetto, 2007b), and we found that the opposite is beneficial to improve OoD generalization in DNNs. We call this approach ''late-stopping''.
2. Tuning the batch normalization parameter: Batch normalization (BN) is known to have an impact in OoD recognition accuracy (Schneider et al., 2020). We found that tuning the only hyperparameter of BN, i.e., the momentum, yields substantial gains of OoD recognition accuracy. This approach is denoted as ''tuned BN''. 3. Neural activity invariance loss: Motivated by the aforementioned finding in previous works that invariant neural representations leads to improvements of the OoD recognition accuracy, we include an additional term in the loss function to encourage this phenomenon. This loss term takes the Euclidean distance between neural activity corresponding to pairs of images from the same category on an intermediate layer. By minimizing this loss term, the neural activity tends to be invariant for objects of the same category even in different viewing conditions. We do not consider that pairs of images from different categories should have distinguishable neural activity, since the classification loss term already encourages this. We call this approach ''invariance loss'' in short.
Our results demonstrate that each of these three approaches alone leads to substantial improvements of object recognition in OoD orientations and illumination conditions. Results also corroborate that when any of the three approaches leads to an increase of invariance at the individual neuron level, OoD recognition accuracy improves in the majority of trials. Experiments are performed in four challenging benchmarks, namely modifications of the MNIST dataset (LeCun, Bottou, Bengio, & Haffner, 1998) and iLab dataset (Borji, Izadi, & Itti, 2016) and two novel datasets we introduce, which are the CarsCG and the MiscGoods datasets. CarsCG contains 3D rendered cars from different orientations, and the MiscGoods dataset consists of images of objects taken with a robotic arm from different viewpoints and controlled illumination conditions. These datasets allow to evaluate the DNN generalization ability to recognize objects in OoD orientations and illumination conditions. Also, they allow to analyze the effects of different amounts of bias and are challenging as DNNs perform poorly in OoD conditions. Summary of contributions. In the following, we summarize the three main contributions of this paper: 1. Three different approaches that substantially improve generalization to OoD viewpoints and illuminations, summarized in Fig. 1 6 and Table 2).

Previous works
We now review related works to overcoming dataset bias in terms of orientations and illuminations. First, we review works that have the same goal as our work, i.e., overcoming biased orientations and illuminations, and then, we review works in other areas of OoD generalization.

OoD orientations and illuminations
Our results add to the growing body of literature to improve the generalization ability of DNNs to OoD orientations and illumination conditions. These are fundamental aspects at the core of object recognition, that are present in all object recognition tasks. Prior efforts leverage synthesized sources of training data (Cubuk, Zoph, Shlens, & Le, 2019;Halder, Lalonde, & Charette, 2019;Kim, Uddin, & Bae, 2021;Qiao, Zhao, & Peng, 2020), 3D models of objects (Angtian, Kortylewski, & Yuille, 2021), specific characteristics of the target domain (Chidester, Zhou, Do, & Ma, 2019;Qi et al., 2018;Sabour, Frosst, & Hinton, 2017), or sensing approaches such as omnidirectional imaging (Cohen, Geiger, Köhler, & Welling, 2018). These approaches add preconceived components to the DNN that need to be adjusted at hand for new objects and conditions. Here, we focus for the first time on a pure learning-based strategy for OoD orientations and illuminations, which is not constrained to specific objects and conditions and can be automatically adjusted to new datasets.
Our investigation build upon theories of biological neural mechanisms for OoD generalization, namely, neural invariance  (c) high InD data diversity performed on four datasets. Each experiment is conducted five times, and the mean and 95% confidence interval are reported. Sharp performance degradation in OoD accuracy is observed (e.g., between 40% to almost 80% is observed when the InD data diversity is low). These result shows the impact of a distribution shift from InD to OoD to the performance of a DNN. (Anselmi et al., 2016;Quiroga, Reddy, Kreiman, Koch, & Fried, 2005;Riesenhuber & Poggio, 1997;Rust & DiCarlo, 2010). Recent works have shown that those mechanisms also emerge in artificial neural networks and facilitate OoD generalization (Madan et al., 2022;Zaidi et al., 2020). In this paper, we show that the emergence of invariant neurons can be encouraged during training and this leads to substantial improvements of OoD generalization.

Other aspects of OoD generalization
To the best of our knowledge, this paper is the first to investigate learning-based approaches to overcome bias of object's orientations and illumination conditions. Yet, there are other strands of research that live in neighboring areas, which investigate generalization to new domains and also, overcoming spurious correlations between image features and categories. These research areas use related techniques and concepts to our work, and in the following we review them.
There is a plethora of works that consists on learning representations in several domains that can be easily transferred to new domains, e.g., Carlucci, D'Innocente, Bucci, Caputo, and Tommasi (2019), Dou, Castro, Kamnitsas, and Glocker (2019), Ghifary, Bastiaan Kleijn, Zhang, and Balduzzi (2015), Guo et al. (2020), Jia, Zhang, Shan, andChen (2020), Li, Pan, Wang, and Kot (2018), Li et al. (2018), Li, Yang, Song andHospedales (2018) and Volpi et al. (2018). The problem of domain generalization is similar to the problem overcoming dataset bias in our study in the sense that representations that facilitate generalization to novel conditions should be learned. However, in domain generalization the learner has access to multiple domains during training that can be leveraged for generalization, while in the problem of overcoming dataset bias only one training set is available. Recently, several works in domain generalization (Chattopadhyay, Balaji, & Hoffman, 2020;Ilse, Tomczak, Louizos, & Welling, 2020;Rame, Dancette, & Cord, 2021;Xiao, Shen, Zhen, Shao, & Snoek, 2021) highlighted the need of invariant representations to obtain further improvements in generalization, which further motivates investigating invariance for dataset bias.
Many datasets are biased in a way that a specific image feature consistently appears in images of the same category. DNNs tend to learn that those features are informative of the category (Geirhos et al., 2020). This form of dataset bias is different from the bias in the object orientation and illumination conditions, which do not necessarily lead to spurious correlations. Recently, there have been several works that address spurious correlations. These are based on automatically detecting the features that spuriously correlate with the category, and encourage the DNN not to rely on those features (Arjovsky, Bottou, Gulrajani, & Lopez-Paz, 2019;Sagawa, Koh, Hashimoto, & Liang, 2020). Ahmed, Bengio, van Seijen, and Courville (2021) introduced PGI, which is a method that effectively alleviates the effect of spurious correlation caused by biased object background. This work exploits the assumption that the training distribution also contains examples without spurious correlations. CMMD  is another method, which uses idea of maximum mean discrepancy (MMD). CMMD and PGI employ EIIL (Creager, Jacobsen, & Zemel, 2021) to classify the images of an category with the features that spuriously correlate with the category and without them. Then, invariance is encouraged across these two groups of images. Thus, invariance appears once more as a facilitator of generalization.
Recently, several researchers have pointed out that no single OoD algorithm can achieve high performance for all problem domains (Gulrajani & Lopez-Paz, 2021;Hendrycks et al., 2021;Wiles et al., 2022). It also has been reported that even very simple methods can achieve performance beyond state-of-theart (Djolonga et al., 2021;Wiles et al., 2022). We show that for orientations and illuminations, which are fundamental aspects of object recognition, the approaches we introduce in this paper are superior to the most successful aforementioned generic methods for OoD generalization. Also, we provide insights about the neural mechanisms that facilitate such improvements.

Performance degradation on OoD conditions
In this section, we introduce the methodology to evaluate the accuracy of the DNN in OoD conditions. First, we describe the procedure of the bias-controlled experiment. Next, we introduce the four datasets used in this study and finally, we evaluate the performance degradation that occurs in OoD conditions in these four datasets.

Bias-controlled experiments
In a dataset there could be multiple biasing factors at the same time that can cause performance degradation. In the datasets in this study, we analyze either the orientation or illumination condition, as it allows to more clearly understand the effect of each individual factor. Thus, the datasets that we use contain several combinations of categories and conditions. We use C to denote the set of all categories and N the set of all orientation or illuminations conditions. Let x (k) be an image of the dataset and let y (k) := (c (k) , n (k) ) be a tuple representing the groundtruth category (i.e., c (k) ∈ C), and the orientation or illuminations condition (i.e., n (k) ∈ N ).
In order to evaluate the DNN's OoD generalization capabilities, we train them in a dataset that follows a distribution that only contains a subset of all possible combinations, i.e., a subset of C×N . Then, the DNN is evaluated with images from combinations that were not included in the training distribution. Let I ⊂ C × N be the set of combinations used to generate the InD combinations. We ensure that I contains all categories and all conditions at least once (but not all combinations), such that we have images from all image categories and conditions in a balanced manner.
We use D (InD) to denote the set of images that are InD, i.e., images whose label is in I, y (k) ∈ I. Namely, the InD images dataset, D (InD) , is defined as in the following: (1) D (InD) is further divided into train dataset and validation dataset, which we denote as D (2) The term OoD accuracy refers to the accuracy on the OoD dataset D (OoD) . Fig. 3 also illustrates how to split all dataset to OoD dataset and InD dataset. Appendix A elaborates this bias-controlled experiments.
We also define the InD data diversity of a dataset as #(I)/#(C × N ), where #(·) denotes a number of elements. Thus, the data diversity measures the portion of combinations included in the training distribution. To directly compare the effect of the InD data diversity on the OoD accuracy, we vary the InD data diversity such that the combinations in the distributions of lower InD data diversity are included in the combinations of higher InD data diversity, while keeping the training set size constant, i.e., #(D (InD) train (I)) is constant for all InD data diversity. These restrictions allows us to evaluate the performance of the DNN only by the difference in InD data diversity, not by the difference in the amount of combinations or training examples.

Datasets
There is a plethora of benchmarks for OoD generalization by now, e.g., Gulrajani and Lopez-Paz (2021), Hendrycks et al. (2021) and Koh et al. (2021). However, only few of these datasets are useful to investigate generalization to novel orientations and illuminations, as only a few provide labels for category, orientation and illumination, and cover a wide range of conditions. We use the following four datasets, two of them are introduced in this paper. See Appendix B for further details than the ones provided in the following.

MNIST-Positions.
It is based on the MNIST dataset (LeCun et al., 1998). We created a dataset of 42 × 42 pixels with nine numbers by resizing images to 14 × 14 and placing them in one of nine possible positions in a 3 × 3 empty grid. We call this dataset the MNIST-Positions dataset. In our experiments, the digits are considered to be the category set, and the positions where the digits are placed is considered as the orientation. We use nine digits and nine positions. Samples are shown in Fig. 2(a). We used 54k images for D (InD) train , 8K images for D (InD) val and 8K images for D (OoD) . Low, medium, and high InD data diversity are set to be 2/9, 4/9, and 8/9, respectively.
iLab-Orientations. iLab-2M is a dataset created from iLab-20M dataset (Borji et al., 2016). The dataset consists of images of 15 categories of physical toy vehicles photographed in various orientations, elevations, lighting conditions, camera focus settings and backgrounds. The image size is 256 × 256 pixels. From the original iLab-2M dataset, we chose six categories (bus, car, helicopter, monster truck, plane, and tank) and six orientations. We call it iLab-Orientations. Samples are shown in Fig. 2(b). We resized each image to 64 × 64 pixels. We used 18K images for D (InD) train , 8k images for D (InD) val and 8k images for D (OoD) . Low, medium, and high InD data diversity are set to be 2/6, 3/6, and 5/6, respectively.

CarsCG-Orientations.
CarsCG-Orientations is a new dataset that consists of images of ten types of cars in various conditions rendered by Unreal Engine. It includes ten orientations, three elevations, ten body colors, five locations and three time frames (daytime, twilight, night). We synthesize images with 1920 × 1080 pixels and resize them as 224 × 224 pixels for our experiment.
We chose ten types of cars as categories and ten orientations for each of them. Samples are shown in Fig. 2(c). More samples are provided in Appendix B. In the experiment, we used 3400 images for D (InD) train , 450 images for D (InD) val and 800 images for D (OoD) . Low, medium, and high InD data diversity are set to be 2/10, 5/10, and 9/10, respectively.

MiscGoods-illuminations.
MiscGoods-Illuminations is a subset of DAISO-10, a novel dataset collected for this study. The dataset consists of ten physical miscellaneous goods photographed using a robotic arm with five controlled illumination conditions, two ways of object placement, twenty object orientations, and five camera angles. Each image is 640 × 480 pixels in size. We chose five categories (stuffed dolphin, stuffed whale, metal basket, imitation plant and cup) and five illumination conditions as shown in Fig. 2 and 400 images for D (OoD) . Low, medium, and high InD data diversity are set to be 2/5, 3/5 and 4/5, respectively.

OoD accuracy results
We now demonstrate that these four datasets are extremely challenging for DNNs as these achieve low accuracy in OoD conditions. We examine the performance degradation in three InD data diversity: low, medium, and high. Recall that we evaluate InD accuracy in D (InD) val and the OoD accuracy in D (OoD) . We use ResNet-18 (He, Zhang, Ren, & Sun, 2016) train . The experimental setup is introduced in Section 6.1. Fig. 4 shows the OoD accuracy degradation regarding the four datasets ranging low to high InD data diversity. While the InD accuracy is more than 80% for all four datasets at almost all data diversities (except for MNIST-positions), the OoD accuracy showed a substantial degradation when the DNN was trained with low and medium InD data diversities. Between 20% to 70% performance degradation is observed in low InD data diversity in all four datasets. In medium InD data diversity, large performance degradation ranging from 10% to 50% is observed, and for high InD data diversity, there is more than 10% performance degradation in CarsCG-Orientations and MiscGoods-Illuminations datasets. Thus, dramatic drops of accuracy are observed in OoD conditions, which confirms that these benchmarks are very challenging for DNNs.
OoD accuracy is often overlooked in standard computer vision benchmarks and only InD is usually reported. This is usually due to the difficulty of measuring OoD accuracy. Our datasets enable evaluating OoD accuracy in a controlled way that facilitates understanding the different factors that may affect the OoD accuracy. The performance degradation in OoD conditions is expected when deploying application of deep learning. Recently, it has been reported that even a small amount of data bias can cause major performance degradation (Recht et al., 2018), and this is reconfirmed for our four datasets. Also, the drop of accuracy in our datasets is dramatic, specially for low InD data diversity. Our datasets allow to gain an understanding of the specific biasing factors in the dataset, i.e., orientation and illumination conditions, and analyze aspects such as the InD data diversity.

Three approaches to improve OoD accuracy
We now introduce the three approaches to address the performance drop of accuracy in OoD conditions, which are ''latestopping'', ''tuning the batch normalization momentum'' and ''invariance loss''. These three approaches are independent on each other and tackle different aspects of the DNN training.

Late-stopping
The stopping criteria for training is known to have an impact on the DNNs performance (Caruana, Lawrence, & Giles, 2001;Cataltepe, Abu-Mostafa, & Magdon-Ismail, 1999;Yao et al., 2007b). In particular, stopping the training before convergence of the training accuracy, i.e., early stopping, is known to prevent overfitting in shallow classifiers (Prechelt, 1998). However, these results are with respect to InD accuracy, and little is known regarding the relation between the stopping criteria and OoD accuracy. We therefore run experiments with a large number of training epochs (up to 1000 epochs) in order to investigate any patterns. Fig. 1(a) shows the change of InD and OoD accuracy when ResNet-18 is trained with the medium InD data diversity. Surprisingly, the OoD accuracy, unlike the InD accuracy, continued to increase in performance after training during a large number of epochs. While recent work by Papyan et al. has shown that continuing training long after the classification error is zero, leads to important benefits such as improving robustness to adversarial attacks (Papyan, Han, & Donoho, 2020), we report for the first time that it also leads to improvements of OoD generalization. We denote the approach of continuing the training of a DNN after the convergence of InD validation accuracy as ''late stopping".

Tuning batch normalization
Batch normalization (BN) (Ioffe & Szegedy, 2015) is a method used to speed-up and stabilize the training of DNN networks through normalization of the layers' inputs by re-centering and re-scaling them. Batch normalization has also been reported to act as a regularizer (Luo, Wang, Shao, & Peng, 2019). Yet, in OoD conditions the statistics of the dataset may change and hence, the statistics used to normalize the layer may not be valid anymore. Previous works have pointed out that in OoD conditions batch normalization needs to be adjusted (Schneider et al., 2020;Xie et al., 2020). Thus, it is reasonable that batch normalization also needs adjustment to help improving generalization to OoD orientations and illuminations, but this has not been studied so far.
Batch normalization uses the so called moving average to recenter the layer's input. Let v ma (t) be the moving average at training step t. The moving average is updated at each training step in the following way: where v mean (t) is the mean activity over the batch of the tth training step, and β ∈ [0, 1] is called momentum and balances the update of the moving average between v mean (t) and itself. Note that the only hyperparameter available for batch normalization is β, and we use this to adjust it. This value is often fixed, and we found that adjusting it is needed when the distribution of inputs changes. Usually, β is set to 0.9 or 0.99, which is the default value in standard deep learning frameworks. We use the default value 0.99 that is employed by the TensorFlow library (Abadi et al., 2016). We investigated how the OoD generalization performance behaves depending on the value of the batch normalization momentum, β. Fig. 1(b) shows the learning curves of ResNet-18 trained on MiscGoods-illuminations with the medium InD data diversity. Experimentally, we found that the tuning momentum parameter, β, can have a significant positive impact on the OoD generalization performance. Generally, the default value of β = 0.99 was too large for almost all cases in our experiments. We call this approach as tuning batch normalization or ''tuning BN''.

Invariance loss
The ''invariance loss'' approach is intended to increase the invariance score that is introduced in Madan et al. (2022), which we explain in Section 5. This invariance score measures the degree of invariance in the neural activity of intermediate layers, and previous works have shown that DNNs that generalize better to OoD conditions have developed larger degrees of invariance in the intermediate layers.
Concretely, we encourage the emergence of invariant representations by taking pairs of images that belong to the same category and enforce that the neural activity is as similar as possible. To do so, we use the Euclidean distance between the activities of neurons in an intermediate layer caused by the pairs of images, and add this as an additional loss term to the classification loss. and let x ′ be another image that belongs to the same category as x, and is sampled from the training data D (InD) train according to some sampling strategy (in our experiments, we use random sampling with uniform distribution across the training images of the same category). Thus, the invariance loss is expressed as This term is added to the categorical cross entropy loss weighted with a hyperparameter that we call λ, such that the invariance loss term acts as a regularization term. Note that the invariance loss is equivalent to the contrastive loss (Hadsell, Chopra, & LeCun, 2006) for positive examples in the context of metric learning, but it has not been used so far to improve generalization to OoD orientations and illumination conditions.

Invariant neurons for OoD generalization
We now revisit the mechanism at the individual neuron level of intermediate layers that previous works have suggested that facilitates OoD generalization, i.e., individual neurons being invariant to OoD conditions. This mechanism has been shown to explain the improvement in OoD accuracy with increased InD data diversity (Madan et al., 2022;Zaidi et al., 2020).
Neurons are interpreted as features detectors. A neuron is selective to some features when the neuron's output value is high only when those features are present in the image. Invariance of neurons that are selective can be helpful for OoD conditions. Note that a neuron can be simply invariant by always having as an output the same value, which is not helpful for generalization. We refer to invariant neurons to those neurons that have a high degree of selectivity to some image features and invariance to OoD conditions, i.e., selectivity is assumed as a precondition to invariance.
For a given intermediate layer of the DNN, let α j cn be the average activity for the jth neuron over all images with the cth category and the nth orientation or illumination condition.
For neuron j, the activity is 0 − 1 normalized. Let c * j be the category that a neuron j is most active on average, i.e., c * j := argmax c ∑ n α j cn . This is called preferred category. The selectivity score S j is defined as where,α j denote the average activity for the preferred category and for the remaining categories, respectively. This selectivity score ranges from zero to one and takes its maximum value in the case that the neuron average activity, α j cn , is 0 for all categories except for the preferred category, i.e., the neuron is only active for the preferred category. The invariance score I j is defined as and it also ranges from zero to one and takes the maximum in the case that the average activity, α j cn , takes the same value for the preferred category regardless of the orientation and illumination conditions.
Finally, we define the SI score of a neuron as the geometric mean of the selectivity and invariance scores, i.e., √ S j I j . Neurons that have a larger SI score are active for specific categories independently on the orientation and illumination conditions. Networks with neurons that have larger SI scores have been observed to generalize better in OoD conditions. In order to provide a score that summarizes the SI score across all neurons in the layer, we use the upper 20 percentile of the scores among all neurons. This is because not all neurons are required to have larger SI to improve OoD generalization, and we just take into account a portion of neurons with the highest SI score. In the experiments, we use this summary of the SI score across neurons to assess whether the three approaches we introduce yield improved OoD accuracy through improving selectivity and invariance.
Finally, note that there are other ways to analyze the neural activity, such as the popular t-SNE visualization (van der Maaten & Hinton, 2008). Our neural activity analysis is unique with respect to previous visualization works in that it can quantitatively assess the neural activity and directly relate it to OoD accuracy.

Experiments and analysis
We first introduce the experimental setup, and then report the OoD accuracy facilitated by the three approaches explained in Section 4. Finally, we analyze whether this boost of OoD accuracy is driven by selective and invariance mechanism revisited in Section 5.

Experimental setting
We apply the three approaches to improve OoD accuracy to ResNet-18 (He et al., 2016) and evaluate its effectiveness in the aforementioned datasets (MNIST-Positions, iLab-Orientations, CarsCG-Orientations, and MiscGoods-Illuminations). Standard ResNet-18 is adopted as the network for all experiments and we trained it in the standard manner. Namely, all neurons employ the ReLU activation function g(z) = max{0, z} (Dahl, Sainath, & Hinton, 2013) and Glorot uniform initializer (Glorot & Bengio, 2010) is adopted for the network weights initialization for all experiments. Adam (Kingma & Ba, 2015) is employed as the optimization algorithm. The pixels of images are normalized within 0 to 1 as a preprocessing for all datasets.
We run five trials in all cases and report mean accuracy and its 95% confidence interval. In each trial, the InD combinations are chosen randomly as long as they satisfy the conditions explained in Section 3.1, and the OoD combinations are created accordingly. Each of the four approaches, including baseline, is subjected to a hyper-parameter search before performing the five trials. We select the hyper-parameters in a different trial from the ones used to report OoD accuracy. In this reserved trial, we select the hyper-parameters with the highest OoD accuracy by grid search. For all tested approaches, we selected a learning rate in {0.1, 0.01, 0.001, 0.0001, 0.00001}, and other hyper-parameters depending in the approach. In Appendix C, we show that the results are not much sensitive to the hyper-parameter choice. In the following we detail the experimental setting of the different approaches.
Late-stopping. The epoch size is set to 1000 epochs for late stopping, and 100 epochs for the other approaches, including baseline. We confirmed that 100 epochs are sufficient for convergence in InD accuracy by the preliminary experiments. For late stopping, we run as many epochs as computing resources allow (about a week of training).
Tuning batch normalization. For tuning batch normalization, we perform a grid-search for β = {0.01, 0.1, 0.5, 0.9, 0.99} in addition to the learning rate. For the other approaches, we use 0.99 as a momentum parameter β for batch normalization layer, which is the default value in TensorFlow.
Invariance loss. Invariance loss is applied to the last ReLU activation layer ''activation_17'' which has 512 neurons. We keep fixed the pairs of images in which invariance is enforced, and we randomize the pairs from time to time. We perform a grid search to determine how frequently we randomize the pairs of images (the choices are randomizing every {10, 20, 50, 100} epochs). The weight of the invariance loss term, λ, is also selected via a grid search among the following values: λ = {1.0, 0.1, 0.01, 0.001, 0.0001}. For more details we refer the reader to Appendix D.  that the three approaches increase the mean OoD accuracy at almost all the data diversities. Comparing the three approaches, late stopping and invariance loss both achieves the best improvement rate in some combinations, and batch norm momentum does not achieve the best improvement in any combination. The highest improvement of 22.2% is achieved by late stopping with a high InD data diversity. The performance improvement across datasets and data diversities is remarkable. Only in iLab-Orientations dataset is relatively small, but for high InD data diversity in this dataset, all three approaches achieve better OoD accuracy than the baseline approach. For MNIST-Positions, all three approaches showed an improvement in performance with medium InD diversity. In Appendix E we report the learning curves and the InD accuracy for a more detailed depiction of the effects of the three approaches during training.

Improvement of OoD accuracy
We also investigated whether the three approaches combined together are more effective than the best of three approaches applied alone. Thus, we trained networks using late-stopping, tuned BN and invariance loss together. We call this approach ''three approaches together''. Another way of combining the three approaches is training networks with each approach alone and then selecting the best of the approaches using a validation set. We call this approach ''best of three approaches alone''. The hyper-parameter tuning method of these combined approaches is detailed in Appendix F. Table 1 shows the comparison between these two combination approaches and also the baseline, i.e., the network trained without any approach to improve the OoD accuracy. The table reports the number times a method outperformed another method across all datasets and InD data diversity. The results show that using the best of the three approaches alone obtains the best results in the vast majority of experiments. Interestingly, the three approaches together performs worse than the baseline for more than half of the experiments. This indicates that the three approaches together interfere with each other and should not be used.
Finally, we compare the three approaches with state-of-theart methods for OoD generalization, namely PGI (Ahmed et al., 2021) and CMMD . Note that these methods were not introduced for OoD orientations and illuminations, but as a generic approach to OoD generalization. Recently, several researchers have pointed out that no single OoD algorithm can achieve high performance for all problem domains (Gulrajani & Lopez-Paz, 2021;Hendrycks et al., 2021;Wiles et al., 2022). In Appendix G we show that our three approaches outperform PGI and CMMD in OoD orientations and illuminations. We also show that PGI and CMMD can be combined with our approaches and lead to substantial improvements of PGI and CMMD's accuracy. This result shows that our three approaches tackle complementary aspects from state-of-the-art methods for OoD generalization. Fig. 6 shows the relationship between the SI score of the last ReLU layer and the OoD accuracy for all combinations of dataset, InD data diversity, and approach (details are provided in Fig. E.9). We can see that there is a large correlation between SI score and OoD accuracy (Pearson's correlation coefficient is 0.891). While it has already been shown in Madan et al. (2022) that increasing the InD data diversity improves the OoD accuracy and the SI score, here we show for the first time that approaches that targets improving the OoD accuracy also yield increases of the SI score.

Table 1
Comparison of ways to combine the three approaches. We compare the best of the three approaches alone (i.e., training a network different times each with one of the three approaches alone, and then selecting the best of the three in a validation set), training with the three approaches together (i.e., training a network using the three approaches together), and the baseline (i.e., training the network without using any approach). Results compare how many times each of these strategies outperformed another strategy, across InD data diversities in each of the four datasets.

Table 2
Analysis of the dependency between improvements of OoD accuracy and SI score. This table shows the relative frequency of improvement (+) or degradation (−) of the mean OoD accuracy ∆ acc or mean SI score ∆ SI . Relative frequency P(x) is calculated by counting the number of cases that satisfy the condition x ∈ {∆ + acc , ∆ + SI }, and normalize it by total number of cases (i.e., 12, 3 possible InD data diversity × 4 datasets). Conditional relative frequency P(y|x) is also calculated by counting the number combinations satisfying y ∈ {∆ + acc , ∆ − acc } in the condition of x ∈ {∆ + SI , ∆ − SI }, and divide it by the number of combinations satisfying x. The first and second columns show the proportion of cases where the mean OoD accuracy ∆ + acc and the mean SI score ∆ + SI increased, respectively. The third column shows the proportion of cases where the mean OoD accuracy increased ∆ + acc when the mean SI score increased ∆ + SI . The fourth column shows the proportion of cases where the mean OoD accuracy increased ∆ − acc when the mean SI score increased ∆ − SI . Approach 75.0 (9/12) 50.0 (6/12) 83.3 (5/6) 66.6 (4/6) Tuned BN (%) 75.0 (9/12) 83.3 (10/12) 80.0 (8/10) 50.0 (1/2) Invariance loss (%) 91.7 (11/12) 83.3 (10/12) 100.0 (10/10) 50.0 (1/2) Total (%) 80.6 (29/36) 72.2 (26/36) 88.4 (23/26) 60.0 (6/10) Next, we analyze the relationship between improvements of OoD accuracy and increases of the SI score. We investigate whether increases of the SI score always precede improvement of OoD accuracy, which serves to assess whether invariant representations drive OoD generalization in a more stringent way than the correlational analysis presented before. Let P(∆ + acc ) be the probability that the OoD accuracy increases when using one of the three approaches to train the network, compared to not using it. Also, let P(∆ + SI ) be the probability that the SI scores increases when using one of the three approaches, compared to not using it. The conditional probabilities between these two events provides insights regarding whether increases of the SI score precedes the improvements of the OoD accuracy. We calculate the probabilities by evaluating the frequency that the events happen across datasets and InD data diversities. We report them in Table 2.
We observe by analyzing P(∆ + acc ) that the OoD accuracy increases very often with the three approaches, at least 75% of the cases. In particular, the OoD accuracy increased 91.7% of the cases for the invariance loss. The analysis of P(∆ + SI ) shows that tuned BN and invariance loss increase the SI score 83.3% of the cases. This suggests that these two approaches tend to improve the SI score. For late-stopping this trend is not as strong. Yet, when analyzing P(∆ + acc |∆ + SI ), we observe that for the three approaches, increases of the SI score precede the improvements of OoD accuracy (this is in 83.3% (5/6), 80.0% (8/10) and 100% (10/10) of the cases for late-stopping, tuned BN and invariance loss, respectively). Note that the invariance loss directly encourages to increase the SI score, and when the SI score in fact increases, the OoD accuracy always has improved. Late stopping and tuning batch normalization momentum do not directly encourage to increase the SI score, but we observe that they do increase the SI score most of the cases, and when this happens, the OoD accuracy is also improved in more than 80.0% of the cases. Thus, these results suggest that the improvement of OoD accuracy is strongly driven by the increase of the SI score.
Finally, we observe by analyzing P(∆ + acc |∆ − SI ), that when the SI score has not increased after applying one of the three approaches, the OoD accuracy still improves in a non-negligible number of cases. This suggests the existence of another mechanism that can improve the OoD accuracy even if the selectivity and invariance mechanisms did not emerge. However, one possible limitation of this interpretation is that selectivity and invariance may have emerged but have not been captured by the SI score, because the SI score may not quantify the emergence of these mechanisms in the most precise way. Thus, we cannot make any assertion beyond the fact that it is unclear what are the neural mechanisms that facilitate OoD generalization when the three approaches do not manage to increase the SI score. This result motivates follow-up investigations.
In summary, in this study we provided evidence that the invariance and selectivity mechanism drives OoD generalization. Also, we found cases in which improvements of OoD generalization may not be preceded by the strengthening of the selectivity and invariance mechanism in the neural representations, which requires future work proposing novel mechanisms to explain these cases. We believe our experimental framework will facilitate such future discoveries.
Finally, in Appendix H we visualize the neural activity using t-SNE (van der Maaten & Hinton, 2008). This serves to illustrate the advantages of our analysis over this popular visualization tool. We observe that t-SNE displays neural invariance by having the samples of the same object categories close to each other. The trend is that for higher data diversity the samples are closer to each other, which is consistent with our analysis of neural invariance. Yet, our analysis of invariance provides more granular insights at the individual neuron level, rather than for an entire DNN layer as in t-SNE. Also, our analysis provides a quantitative assessment that directly relates with the OoD accuracy, unlike t-SNE, which only provides a qualitative assessment.

Conclusion
We have shown that late-stopping, tuning the batch normalization momentum parameter, and optimizing the invariance loss during learning lead to substantial improvements of the DNN recognition accuracy of objects in OoD orientations and illuminations (in some cases more than 20%). These improvements are consistent across four datasets, and different degrees of dataset bias. We also corroborated that the neural mechanisms of selectivity to a category and invariance to orientations and illuminations, at the individual neuron level, lead to the aforementioned improvements of OoD recognition accuracy. Namely, we found that in the majority of trials where any of the three approaches yield an increase of selectivity and invariance, resulted in improvements of the OoD recognition accuracy.
Nonetheless, our analysis also revealed that other mechanisms different from selectivity and invariance may also exists, as we observed that gains of OoD recognition accuracy were not preceded by an increase of the SI score in some trials. What are the neural mechanisms that drive OoD generalization in these cases remains as an open question for future work. Furthermore, there are also other novel questions derived from our results that motivate future works: Is there any effective way of combining the three approaches investigated in this paper that leads to even more improvements of OoD generalization? Are these approaches applicable to other factors beyond orientations and illumination conditions? How these approaches relate to biological learning systems? This paper is rather limited in providing answers to these fascinating questions that have cropped up ahead of us, and we hope that the substantial improvements of OoD recognition accuracy that we demonstrated in this paper motivate new research to address them.
Finally, we would like to highlight that poor OoD generalization is one of the issues of machine learning that needs to be urgently addressed in order to allow for safe and fair AI applications. We hope that this research serves as a basis for further improvements of OoD generalization.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
The source code used in this study is publicly available in the following GitHub repository: https://github.com/FujitsuResearch/ three-approaches-ood. All datasets used in this study are publicly available (see Appendix B).

Acknowledgments
We are grateful to Tomaso Poggio and Hisanao Akima for their insightful advice and warm encouragement. We thank Shinichi Matsumoto and Shioe Kuramochi for their assistance to create CarsCG and DAISO-10 datasets, respectively. This work was supported by Fujitsu Limited (Contract No. 40008819 and 40009105) and by the Center for Brains, Minds and Machines funded by NSF (National Science Foundation) STC (Science and Technology Centers) award CCF-1231216. PS and XB are partially supported by the R01EY020517 grant from the National Eye Institute (NIH), and SM and HP are partially funded by NSF grant IIS-1901030.

Appendix A. Formulation of bias-controlled experiments
In bias-controlled experiment, we prepare the InD dataset D (InD) that consists of images belonging to certain category (C) and orientation or illuminations condition (N ); we denote the complement set of D (InD) as D (OoD) . Let x (k) be an image and y (k) = (c (k) , n (k) ) be the corresponding label. We divide our dataset as follows. First, we select certain combinations of category and condition I ⊂ C × N . InD dataset is sampled as while the sampling process to make the combination I is forced to satisfy the conditions described below. Let us define I| C and I| N as follows: We impose conditions I| C = C and I| N = N to ensure that D (InD) contains all categories and all conditions like the bottom left part of Fig. 3. Training dataset D train is sampled from D (InD) so that each combination of category and condition has the same number of images. Validation dataset D (InD) val is sampled from the rest of D (InD) in the same way.
The OoD dataset is defined as InD data is sampled from D (InD) so that each combination of category and condition has the same number of images.
We also define (InD) data diversity as #(I)/#(C × N ). To examine the dependency of the generalization on data diversity, we increase data diversity of I by creating a set I ′ with data diversity #(I ′ )/#(C × N ), so that I is contained in I ′ , i.e., For each z ∈ N , we define I| C (z) ⊂ C as follows: (A.7) Also, for each z ∈ C, we define I| N (z) ⊂ N : In our experiments, we only treat the cases where and keep the following additional condition to balance the combinations: , for all z and z ′ .

B.2. iLab-Orientations
iLab-2M is a dataset created from iLab-20M dataset (Borji et al., 2016): freely and publicly available at https://bmobear. github.io/projects/viva/ (Last access: Dec. 15, 2021). The dataset consists of images of 15 categories of physical toy vehicles photographed in various orientations, elevations, lighting conditions, camera focus settings and backgrounds. It has 1.2M training images, 270k validation images, 270k test images, and each image is 256 × 256 pixels. We chose from the original iLab-2M dataset six categories -bus, car, helicopter, monster truck, plane, and tank as C and six orientations as N . We call it iLab-Orientations. Fig. B.2 shows samples of the all categories and orientations of iLab-Orientations dataset.

B.3. CarsCG-Orientations
CarsCG-Orientations is a new dataset that consists of images of ten models of cars in various conditions rendered by Unreal Engine version 4.25.3; this dataset is publicly available at http://dataset.jp.fujitsu.com/data/carscg/index.html. The conditions consist of ten orientations, three elevations, ten body colors, five locations and three time slots. Fig. B.3 shows the all car models (categories) and orientations (conditions) in the grid form. The details of these are as follows.  Fig. B.4). We used the whole car models as categories C. Therefore the number of categories is #(C) = 10 in the experiments conducted in this study.    To create variety of samples for each combination of the categories (car models) and conditions (orientations), we added other conditions as follows.
• Elevations: The virtual camera was located at three elevation angles, namely, 10, 15, and 30 degrees, during the rendering process. Sample images taken from each angle are shown in Fig. B.6.
• Body colors: Each car model is rendered with ten colors, namely, black, light blue, green, red, white, beige, dark blue, orange, plum, and silver by using Automotive Materials (a library for Unreal Engine). Fig. B.7 shows sample images of Nissan Rouge rendered with these colors.
• Locations: We used a sample environment of an urban park contained in City Park Environment Collection. We chose five locations from the sample environment and modified them for our experiments. Sample images taken at each location are shown in Fig. B.8.
• Time slots: We used Ultra Dynamic Sky 3D model set to synthesize the three different times slots, namely, daytime, twilight, and night. Fig. B.9 shows the samples of these three time slots.
The number of images and the image size are as follows.
• Number of images and image size: The total number of images of this dataset is 45k = 10 (categories) × 10 (orientations) × 3 (elevations) × 10 (body colors) × 5 (locations) × 3 (time slots). The images are rendered in 3840 × 2160 pixels and then resized to 1920 × 1080 pixels for the sake of anti-aliasing.

B.4. MiscGoods-Illuminations
MiscGoods-Illuminations is a subset of DAISO-10, a novel dataset constructed for this study; this dataset is publicly available at http://dataset.jp.fujitsu.com/data/daiso10/index.html. The dataset consists of images of ten physical miscellaneous goods taken with five illumination conditions, two ways of object placement, twenty object orientations, five camera angles. Images were taken with a robot arm (Fig. B.10). Fig. B.11 shows the all miscellaneous goods (categories) and illumination conditions in the grid form. The details of these are as follows.
• Categories: As shown in Fig. B.11, DAISO-10 has ten types of miscellaneous goods -stuffed dolphin, stuffed whale, metal basket, imitation plant, cup, cleaning brush, winding tape, lace yarn, bottled imitation tomatoes, and bottled imitation green apples. In this study, we selected the following five miscellaneous goods from DAISO-10 as the categories Cstuffed dolphin, stuffed whale, metal basket, imitation plant and cup. Therefore the number of categories is #(C) = 5 in the experiments conducted in this study.
• Illumination conditions: As the conditions, we created five illumination conditions (lighting conditions); one is created with ceiling lights, and the rest are with a colored spotlight. All illumination conditions are shown in Fig. B.11. For spotlight conditions, the light source (PIXEL G1S™ RGB Video Light) was placed 23 cm in front of the object (See Fig. B.10).
The parameters of the light source were H217/S141 = 8500k (white light), H0/S100 (red light), H120/S100 (green light), and H240/S100 (blue light). These parameters were set so that the condition of the illumination makes a sufficient difference in the learning experiments. We used whole illumination conditions N . Thus the number of the conditions is #(N ) = 5 in the experiments conducted in this study.
As we did for CarCGs-Orientations, we added other conditions to create variety of samples for each combination of the categories and illumination conditions as follows.
• Object poses (ways of object placement and orientations): In this dataset,we placed each object in two representative ways of object placement for each lighting condition. Fig. B.12 shows the two ways of object placement of all objects. For additional diversity, we rotated the object every 18 degrees from 0 to 342 degrees (Fig. B.13). In total, there are 40 patterns in object pose conditions.
• Camera angles: To capture the images automatically, we created a robotic image capture system (see Fig. B.10). A camera (Intel ® Realsense D435) was attached to a robot arm (COBOTTA ® ), and the system captured images from five camera angles for each lighting and object pose condition (Fig. B.14). The postures were defined so that the acquired image shows the entire object pose. The series of operations from robot control to image acquisition is automated by utilizing ROS kinetic.
The number of images and the image size are as follows.
• Number of images and image size: The number of images of whole dataset is 10k = 10 (categories) × 5 (illuminations) × 2 (ways of object placement) × 20 (orientations) × 5 (camera angles), and each image size is 640 × 480 pixels. learning process and reducing the number of training epochs. We do not use any data augmentations. Invariance loss is applied to the last fully-connected layer ''activation_17'' with 512 neurons shown in Fig. D

Appendix E. Additional results of experiments
InD accuracy and OoD accuracy learning curves with all dataset and all diversity corresponding to Fig. 1(a)                Overall results between IS scores and all combinations of (data diversity, dataset, approaches).

Appendix H. Visualization of the latent space
In this appendix, we visualize the latent spaces obtained by the baseline method and three approaches. Table H.1 shows the results of applying t-SNE to the latent space of the last fully connected layer of ResNet-18 trained by the CarsCG dataset. Three approaches are confirmed to increase cluster concentration. This result is consistent with the improvement of the SI score by three approaches shown in Section 6.3.  Latent space visualization using t-SNE. Each cell of this table shows the t-SNE visualization of the activity of the latent layer of the network that each approach has been applied (row), trained on each data diversity (column). Each color expresses a car model.