A too-good-to-be-true prior to reduce shortcut reliance

Highlights • Challenging machine learning problems are unlikely to have trivial solutions.• Solutions from low-capacity models are likely shortcuts that won’t generalize.• One inductive bias for robust generalization is to avoid overly simple solutions.• A low-capacity model can identify shortcuts to help train a high-capacity model.


Introduction
"If you would only recognize that life is hard, things would be so much easier for you."-Louis D. Brandeis Deep convolutional neural networks (DCNNs) have achieved notable success in image recognition, sometimes achieving human-level performance or even surpassing it (He et al., 2015). However, DCNNs often suffer when out-of-distribution (o.o.d.) generalization is needed, that is, when training and test data are drawn from different distributions (Beery et al., 2018;Geirhos et al., 2019Geirhos et al., , 2020. This limitation has multiple consequences, such as susceptibility to adversarial interventions (Szegedy et al., 2013;Goodfellow et al., 2014; or to previously unseen types of noise (Geirhos et al., 2019;Hendrycks & Dietterich, 2019).
Failure to generalize o.o.d. may reflect the tendency of modern network architectures to discover simple features, so-called "shortcut" features (Geirhos et al., 2020;Shah et al., 2020). While the perils of overly-complex solutions are well appreciated, overly-simplistic solutions should be viewed with equal skepticism. In this work, we assume that features that are easy to learn are likely too good to be true. For instance, a green background may be highly correlated with the "horse" category, but green grass is not a central feature. A horse detector relying on such simplistic features-i.e. shortcuts-may perform well when applied in Spain-where the training set originates-but will fail when deployed in snow-covered Siberia. In effect, shortcuts are easily discovered by a network but may be inappropriate for classifying items in an independent set where superficial features are distributed differently than in the training set. Thus, the sensitivity to shortcuts may have far-reaching and dangerous consequences in applications, like when the pneumonia predictions of a system were based on a metal token placed in radiographs (Zech et al., 2018).
In general, one cannot a priori know whether shortcuts will be helpful or misleading, nor can shortcut learning be reduced to overfitting the training data. While overfitting can be estimated using an available test set from the same distribution, assessing shortcuts depends on all possible unseen data. A model relying on shortcuts can show remarkable human-level results on test sets where shortcut features are distributed identically to the training set (i.i.d.), but fail dramatically on o.o.d. test sets where shortcuts are missing or misleading (Recht et al., 2019).
Shortcuts can adversely affect generalization even when they are not perfectly predictive. Because shortcuts are easily learned by DCNNs, they can be misleading even in the presence of more reliable but complex features (Hermann & Lampinen, 2020). To illustrate, shape may be perfectly predictive of class membership but networks may rely on color or other easily accessed features like texture (Geirhos et al., 2018;Brendel & Bethge, 2019) when tested on novel cases (see Figure 1A).
Although what is and is not a shortcut cannot be known with perfect confidence, all shortcuts are simple. We find it unlikely that difficult learning problems will have trivial solutions when they have not been fully solved by brains with billions of neurons shaped by millions of years of natural selection nor by engineers working diligently for decades. Based on this observation, we are skeptical of very simple solutions to complex problems and believe they will have poor o.o.d. generalization. This inductive bias, which we refer to as the "too-good-to-be-true prior", can be incorporated into the training of DCNNs to reduce shortcut reliance and promote o.o.d. generalization. At its heart, the too-good-to-be true prior is a belief about the relationship between the world, models, and machine learning problems, which places limits on Occam's razor.
Several recent contributions on o.o.d. generalization are consistent with the too-good-to-be-true prior (Clark et al., 2020;Nam et al., 2020;Sanh et al., 2020). In various ways, these authors suggest that simple solutions should be treated with caution and avoided.
How does one identify solutions that are probably too-good-to-be-true? Here we suggest to make use of a learning system wittingly simplistic for the problem at hand and capable of only trivial solutions, which includes shortcuts. First, we show that such low-capacity systems can be used to detect shortcuts in a dataset. Second, we suggest a simple and general method aimed at discarding training examples that are suspected of containing shortcuts. We hypothesize that, in order to prevent shortcut learning by a high-capacity network (HCN), the predictions of a much simpler, low-capacity network (LCN) could be used to guide the training of the target network. Namely, a trained LCN would provide high-probability predictions precisely for the training items containing these shortcuts. Such probabilistic predictions can be transformed into importance weights (IWs) for training items, and these IWs can be further used in a loss function for training an HCN by downweighting the shortcut items ( Figure 1B). We demonstrate our method's efficiency by applying it to all possible CIFAR-10-based binary classification problems with synthetic shortcuts, permitting well-controlled experiments.

Related Work
Shortcut learning and robust generalization. Multiple approaches have been suggested for preventing shortcut reliance and increasing generalization robustness in deep neural networks (Geirhos et al., 2020). We succinctly summarize eleven studies which we find most relevant to our work in Table 1, using two criteria: whether an approach (1) assumes that simple solutions are probably Figure 1: The standard and too-good-to-be-true prior approaches to learning. (A) In the standard approach, a single high-capacity network (HCN) is trained and is susceptible to shortcuts, in this case relying on color as opposed to shape. Such a network will generalize well to i.i.d. test items but fail on o.o.d. test items (the last item for each class; shown in red). (B) In contrast, implementing the too-good-to-be true prior by pairing a low-capacity network (LCN) with an HCN leads to successful i.i.d. and o.o.d. generalization. Items that the LCN can master, which may contain shortcuts, are downweighted when the HCN is trained, which should reduce shortcut reliance and promote use of more complex and invariant features by the HCN.
shortcuts and (2) requires a priori knowledge of the shortcut. We also specify the task domains considered. Table 1: Overview of approaches to preventing shortcut reliance most relevant to the present study. Task abbreviations: IR -action recognition, AR -action recognition, NLI -natural language inference, QA -question answering, VQA -visual question answering.

Approach
Criteria Simple solutions are shortcuts Requires knowledge of a shortcut Task Shape-based representations (Geirhos et al., 2018) No Yes IR DRiFt (He et al., 2019) Yes Yes NLI Don't take the easy way out (Clark et al., 2019) Yes Yes QA, VQA, NLI REPAIR (Li & Vasconcelos, 2019) No No IR, AR Learning not to learn (Kim et al., 2019) No Yes IR RUBi (Cadene et al., 2019) No Yes VQA ReBias (Bahng et al., 2020) No Yes IR, AR LfF (Nam et al., 2020) Yes No IR DIBS (Sinha et al., 2020) No No IR Learning from others' mistakes (Sanh et al., 2020) Yes No QA, NLI MCE (Clark et al., 2020) Yes No IR, VQA, NLI Our approach Yes No IR In contrast to Nam et al. (2020), we use an LCN, not a full-capacity target model, to identify shortcuts and train the LCN separately from the target HCN. In comparison to Clark et al. (2020), our approach is less demanding computationally and, again, an LCN and HCN are trained singly. We demonstrate that, in our particular implementation of the too-good-to-be-true prior, the limited capacity of a secondary model plays a key role, thus complementing the results of Sanh et al. (2020). We also extend the findings of Sanh et al., who introduced a similar de-biasing approach in the language domain, to the domain of image recognition. In contrast to all of the aforementioned studies, we show that an LCN can be employed to detect the presence of a shortcut in a dataset. Further, we empirically examine the relationship between the difficulty of a classification problem and the effectiveness of a shortcut-avoiding training (our two-stage LCN-HCN procedure). Huang et al. (2020) suggested a heuristic, Representation Self-Challenging (RSC), to improve o.o.d. generalization in image recognition. This method impedes predicting class from features most correlated with it and thus encourages a DCNN to rely on more complex combinations of features. RSC, however, is not directly designed to prevent shortcut learning but rather attempts to expand the set of features learned.
Sample weighting. Re-weighting of data samples is a well-known approach to guiding the training of DCNNs and machine learning models in general, and corresponding methods differ in terms of which examples must be downweighted/emphasized. Some authors suggested to mitigate the impact of easy examples and focus on hard ones (Malisiewicz et al., 2011;Shrivastava et al., 2016). In contrast, in other research directions, such as curriculum learning (Bengio et al., 2009;Hacohen & Weinshall, 2019;Wu et al., 2020) and self-paced learning (Kumar et al., 2010;Meng et al., 2015), it is recommended to stress easy examples early in training. It was also shown that the self-paced and curriculum learning can be combined .
Although in our two-stage LCN-HCN procedure we assign weights to the training items, this method is fundamentally different from typical re-weighting schemes. Stemming from the too-good-to-betrue prior, our approach exploits not the predictions of the target network itself but of an independent simpler network (LCN). In other words, we are not interested in the difficulty of an item per se, but in whether this item can be mastered through simple means.
3 Example applications of the too-good-to-be-true prior Below, in the context of image recognition tasks, we illustrate the too-good-to-be-true prior with two example applications: (1) detecting the presence of a shortcut in a dataset and (2) training a de-biased model. Both examples rely on an LCN being limited to learning a superficial shortcut as opposed to a deeper invariant.

The performance of a low-capacity network as an early warning signal
Considering that an LCN is only able to discover simple, probably shortcut, solutions, high performance on a dataset may indicate the presence of a shortcut. When an LCN achieves a performance level comparable to an HCN, this should serve as a warning signal that the HCN may have succumbed to a shortcut (Hermann & Lampinen, 2020). In such cases, the HCN will likely fail to generalize robustly.
Here we present an illustrative example of such an application of the too-good-to-be-true prior: We trained an LCN (softmax regression) and an HCN (56-layer ResNet; He et al., 2016) to classify the colored MNIST dataset. The latter was implemented exactly as in Li & Vasconcelos (2019), with the standard deviation of color set to 0.1. Both networks were trained with a stochastic gradient descent for 50 epochs (initial learning rate was set to 0.1 and mini-batch size was set to 256). We also observed the same pattern for a stylized version of Tiny ImageNet (Wu et al., 2017), where we introduced a texture shortcut: the LCN and HCN revealed relatively close high accuracies on stylized data (0.701 and 0.880, respectively) while performing dramatically different on regular data (0.085 and 0.414, respectively). The complete results for both datasets, as well as training details and architectures used for Tiny ImageNet, are in Appendix A.
3.2 Utilizing predictions of a low-capacity network to navigate the training of a high-capacity network Next, we demonstrate that it is possible to make use of an LCN to avoid learning the shortcut by an HCN. Reliable features necessary for a robust generalization are relatively high-level and shortcuts are usually low-level characteristics of an image. Given this assumption, the LCN will primarily produce accurate and confident predictions for images containing shortcuts.
Given a training dataset D = {x i , y i }, the corresponding IW (w i ) for a training image x i is its probability of misclassification as given by an LCN, IWs are then employed while training an HCN: for every training image, the corresponding loss term is multiplied by the IW of this image. We normalize IWs with respect to a mini-batch: IWs of samples from a mini-batch are divided by the sum of all IWs in that mini-batch. The mini-batch training loss L B is thus the following: where L k indicates the loss of the kth sample in the mini-batch. The mini-match normalized IW is Overview of experiments. Generally, whether a dataset contains shortcuts is not known beforehand.
In order to overcome this issue and test the too-good-to-be-true prior, we introduced synthetic shortcuts into a well-known dataset (cf. Malhotra et al., 2020). We then applied our approach and investigated whether it was able to avoid reliance on these shortcuts while learning the deeper structure. This testing strategy allowed us to run well-controlled experiments and quantify the effects of our method.
We ran a set of experiments on all possible pairs of classes from the CIFAR-10 dataset (Krizhevsky et al., 2009). In every classification problem, a synthetic shortcut was introduced in each of the two classes. In order to have a better understanding of our method's generalizability, we investigated two opposite types of shortcuts as well as two HCN architectures, ResNet (He et al., 2016) and VGG-11 (Simonyan & Zisserman, 2015). Note that our too-good-to-be-true prior is readily applicable to multi-class problems.
For both shortcut types and both HCN architectures, we expected the two-stage LCN-HCN procedure to discard the majority of shortcut images. Therefore, compared to the ordinary training procedure, better performance should be observed when shortcuts in a test set are misleading (i.e., o.o.d. test set). We also expected that the two-stage LCN-HCN procedure may suppress some non-shortcut images. Thus, a slightly worse performance was expected for a test set without shortcuts as well as for a test set with helpful shortcuts (i.e., i.i.d. test set).
The main objective of these experiments was to compare an ordinary and a weighted training procedure in terms of the susceptibility of resulting models to the shortcuts. However, crucially for our idea of the too-good-to-be-true prior, it was also important to validate our reasoning concerning the key role of a network's low capacity in the derivation of useful IWs. For this purpose, we introduced another training condition where IWs were obtained from probabilistic predictions of the same HCN architecture as the target network. We refer to the IWs obtained from an HCN as HCN-IWs and to the IWs obtained from an LCN as LCN-IWs. We expected HCN-IWs either to fail to suppress shortcut images, resulting in poor performance on a test set with misleading shortcuts (o.o.d. test set), or to equally suppress both shortcut and non-shortcut images, resulting in poor performance on any test data. Using HCN-IWs mirrors approaches that place greater emphasis on challenging items.
Shortcuts. For the sake of generality, we introduced two shortcut types: the "local" was salient and localized, and the "global" was subtle and diffuse. The local shortcut was intended to capture real-world cases such as a marker in the corner of a radiograph (Zech et al., 2018) and the global was intended to capture such situations as subtle distortions in the lens of a camera.
The local shortcut was a horizontal line of three pixels, red for one class and blue for the other (Figure  2, left). The location of the line was the same for all images: upper left corner. The shortcut was present in randomly chosen 30% of training as well as validation images in each class.
The global shortcut was a mask of Gaussian noise, one per class (Figure 2, right). The mask was sampled from a multivariate normal distribution with zero mean and isotropic covariance matrix, with variance set to 25 × 10 −4 , and then added to randomly chosen 30% of training and validation images of a corresponding class.  LCN and HCN architectures. The LCN consisted of a single convolutional layer followed by a fully-connected softmax classification layer. The convolutional layer included 4 channels with 3-by-3 kernels, a linear activation function and no downsampling.
In two separate sets of simulations, we tested two different HCN architectures: the 56-layer ResNet for CIFAR-10 (He et al., 2016) and VGG-11 (Simonyan & Zisserman, 2015). The first two fullyconnected layers of VGG-11 had 1024 units each and no dropout was used.
Training details. Network weights were initialized according to Glorot and Bengio (2010). We used stochastic gradient descent to train both LCN and HCN. The initial learning rate was set to 0.01 for the LCN. The HCN's initial learning rate was set to 0.01 for VGG (Simonyan & Zisserman, 2015) and to 0.1 for ResNet (He et al., 2016). The HCNs were trained with a momentum of 0.9 and a weight decay of 5×10 -4 for 150 epochs. To avoid overfitting, the HCN's performance on validation data (see below) was tested at each epoch, and best-performing parameters were chosen as the result of training. The LCN was trained for 40 epochs. For both LCN and HCN, the learning rate was decreased by a factor of 10 on epochs corresponding to 50% and 75% of the total duration of the network's training. Mini-batch size for both networks was set to 256.
For each class, the original 5,000 images from the CIFAR-10 training set were divided into 4,500 training images and 500 validation images. Thus, the training set of every class pair included 9,000 images and the validation set included 1,000 images.
IWs were introduced to the training process as described in the beginning of this section, and for every mini-batch a weighted-average loss was calculated. During ordinary training without IWs, a simple average loss was calculated.
All the results reported below are the averages from 10 independent runs on all class pairs and shortcut types.
Results. The overall pattern of results was in accord with our predictions-downweighting training items that could be mastered by a low-capacity network reduced shortcut reliance in a high-capacity  ; the lowest LCN-IWs correspond almost exclusively to shortcut images. As predicted, the LCN has the capacity to master images containing shortcuts but few other images, providing IWs for an HCN that reduce shortcut reliance, thereby implementing the too-good-to-be true prior.
Effects of the training condition (ordinary, HCN-IWs, and LCN-IWs) on how well the HCN performs on every test set (incongruent, neutral, and congruent) are presented in Figure 5. The general patterns of results are the same for ResNet and VGG-11, so here we focus on ResNet; the analogous results for VGG-11 can be found in Appendix C. HCNs are prone to rely on our shortcuts, as evidenced by low incongruent accuracies and very high congruent accuracies after the ordinary training. Incongruent accuracies are improved after the LCN-IWs training, compared to those after the ordinary training. Importantly, after LCN-IWs training, incongruent, neutral, and congruent accuracies are all similarly high. Together, these results suggest that LCN-IWs are successful in reducing shortcut reliance in the target network.
Although exceeding performance on the incongruent test set after the ordinary training condition, incongruent accuracies after the HCN-IWs training are substantially lower than after the LCN-IWs training. The neutral and congruent accuracies are lower than in both ordinary and LCN-IWs training conditions. At the same time, incongruent accuracies are still noticeably lower than neutral and congruent conditions. These results indicate that HCN-IWs are not effective in resisting shortcut learning due to, at least to some extent, suppressing typical class examples containing useful and well-generalizable features (Figure 4). Figure 5: Accuracies on incongruent, neutral, and congruent test sets after ordinary and HCN-/LCN-weighted training; ResNet as the HCN. Across shortcut types, LCN-IWs result in almost identically high accuracy on all three test sets and thus, are successful in avoiding shortcut reliance. HCN-IWs constantly result in accuracies inferior to LCN-IWs; moreover, on neutral and congruent test sets, accuracies after HCN-weighted training are lower than after ordinary training. HCN-IWs, thus, are not as effective as LCN-IWs in avoiding shortcut reliance and also result in suppressing useful features. Together, these results indicate that the LCN-HCN two-stage approach is a valid representative of the too-good-to-be-true prior. Figure 6: Effects of the LCN-/HCN-IWs training procedure for individual class pairs depending on their respective difficulty; ResNet as the HCN. The effects of training are represented by the Overall Benefit measure (gain + loss; see text); the difficulty of a pair is represented by the neutral test accuracy after ordinary training on the training set. Recapitulating previous results, LCN-IWs are more effective than HCN-IWs. Furthermore, the easier learning problem, the less Overall Benefit from IWs because the relatively higher capacity of IW network leads to downweighting non-shortcut items.
The main results shown in Figure 5 indicate that LCN-IWs reduce shortcut reliance with little cost to performance on other items, whereas HCN-IWs are less effective because they remove non-shortcut items as well (see Figure 4). Key to the LCN-IW results is properly matching network capacity to the learning problem. Out of the 45 classification pairs considered, there should be natural variation in problem difficulty that affects target network performance. In particular, we predict that overall benefit will be lower when the LCN performs better on a class pair, indicating that its capacity is sufficient to learn non-shortcut information.
We compute average OB for each class pair and contrast those against corresponding neutral test accuracies after ordinary training. The latter are introduced to reflect the default classification difficulty of each class pair. These comparisons are shown in Figure 6. Two evident trends are important. First, recapitulating the previous results, LCN-IWs result in greater OB than HCN-IWs. OB corresponding to LCN-IWs is almost always positive, while OB corresponding to HCN-IWs is often negative. Second, OB is negatively correlated with the neutral test accuracy after ordinary training; that is, as the difficulty of a classification problem increases, benefits of using IWs generally increase as well. One possibility is that for easy to discriminate pairs, such as frog and ship, the LCN was able to learn non-shortcut information which reduced the overall benefit of the LCN-IWs.

Discussion
In general, using Occam's razor to favor simple solutions is a sensible policy. We certainly do not advocate for adding unnecessary complexity. However, for difficult problems that have evaded a solution, it is unlikely that a trivial solution exists. The problems of interest in machine learning have taken millions of years for nature to solve and have puzzled engineers for decades. It seems implausible that trivial solutions to such problems would exist and we should be skeptical when they appear.
For such difficult problems, we suggest adopting a too-good-to-be-true prior that shies away from simple solutions. Simple solutions to complex problems are likely to rely on superficial features that are reliable within the particular training context, but are unlikely to capture the more subtle invariants central to a concept. To use a historic example, people had great hopes that the Perceptron Rosenblatt (1958), a one-layer neural network, would master computer vision to only have their hopes dashed Minsky & Papert (1969). When such simple systems appear successful, including on held-out test data, they are most likely relying on shortcuts that will not generalize out of sample on somewhat different test distributions, such as when a system is deployed.
We proposed and evaluated two simple applications of the too-good-to-be-true inductive bias. First, we made use of a low-capacity network (LCN) to detect the presence of a shortcut in a dataset. Second, we used an LCN to establish importance weights (IWs) to help train a high-capacity network (HCN). The idea was that the LCN would not have the capacity to learn subtle invariants but instead be reduced to relying on superficial shortcuts. For the second application, by downweighting the items that LCN could master, we found that the HCN was less susceptible to shortcuts and showed better o.o.d. generalization at little cost when misleading shortcuts were not present.
Although we evaluated the de-biasing application of the too-good-to-be-true prior on CIFAR-10 images, the basic method of using an LCN to establish IWs for an HCN is broadly applicable. We considered two network architectures for the HCN, ResNet and VGG-11, which both showed the same overall pattern of performance. Interestingly, ResNet appeared more susceptible to shortcuts, perhaps because its architecture contains skip connections that are themselves a type of shortcut allowing lower-level information in the network to propagate upwards absent intermediate processing stages.
One key challenge in our approach is matching the complexity of the LCN to the learning problem.
When the LCN has too much capacity, it may learn more than shortcuts and downweight information useful to o.o.d. generalization (see Figure 6). It is for this reason that LCN-IWs are much more effective than HCN-IWs (see Figure 5). Unfortunately, there is no simple procedure that guarantees selecting the appropriate LCN. The choice depends on one's beliefs about the structure of the world, the susceptibility of models to misleading shortcuts, and the nature of the learning problem. Nevertheless, reasonable decisions can be made. For example, we would be skeptical of a Perceptron that successfully classified medical imagery, so it could serve as an LCN.
Since the too-good-to-be-true prior is a general inductive bias, our two-stage LCN-HCN approach is just one specific implementation of it and other techniques may be developed.  We constructed a stylized version of Tiny ImageNet by following a generalization 1 of the procedure described in Geirhos et al. (2018). Each of 200 classes was assigned its unique style and thus had a prominent texture shortcut.
The LCN was represented by a single 4-channel convolutional layer (3-by-3 kernels, linear activation function, no downsampling) followed by a fully-connected softmax classification layer. The HCN was represented by a 10-layer ResNet designed for Tiny ImageNet (Wu et al., 2017). Both networks were trained for 40 epochs with a stochastic gradient descent, a momentum of 0.9, and a weight decay of 5×10 -4 . A mini-batch size was set to 256. The initial learning rate was set to 0.001 and 0.1 for the LCN and HCN, respectively, and was decreased by a factor of 10 on epochs corresponding to 50% and 75% of the total duration of the network's training.  Figure 7: Typical observed distributions of HCN-IWs (A) and LCN-IWs (B). The lowest HCN-IWs correspond to shortcut images and non-shortcut images depicting examples of a high typicality; the lowest LCN-IWs correspond almost exclusively to shortcut images. As predicted, the LCN has the capacity to master images containing shortcuts but few other images, providing IWs for an HCN that reduce shortcut reliance, thereby implementing the too-good-to-be true prior.

B Example images for various IWs
C Results for VGG-11 as the HCN Figure 8: Accuracies on incongruent, neutral, and congruent test sets after ordinary and HCN-/LCN-weighted training; VGG-11 as the HCN. Across shortcut types, LCN-IWs result in almost identically high accuracy on all three test sets and thus, are successful in avoiding shortcut reliance. HCN-IWs constantly result in accuracies inferior to LCN-IWs; moreover, on neutral and congruent test sets, accuracies after HCN-weighted training are lower than after ordinary training. HCN-IWs, thus, are not as effective as LCN-IWs in avoiding shortcut reliance and also result in suppressing useful features. Together, these results indicate that the LCN-HCN two-stage approach is a valid representative of the too-good-to-be-true prior. Figure 9: Effects of the LCN-/HCN-IWs training procedure for individual class pairs depending on their respective difficulty; VGG-11 as the HCN. The effects of training are represented by the Overall Benefit measure (gain + loss; see text); the difficulty of a pair is represented by the neutral test accuracy after ordinary training on the training set. Recapitulating previous results, LCN-IWs are more effective than HCN-IWs. Furthermore, the easier learning problem, the less Overall Benefit from IWs because the relatively higher capacity of IW network leads to downweighting non-shortcut items.