The Notorious Difficulty of Comparing Human and Machine Perception

With the rise of machines to human-level performance in complex recognition tasks, a growing amount of work is directed towards comparing information processing in humans and machines. These works have the potential to deepen our understanding of the inner mechanisms of human perception and to improve machine learning. Drawing robust conclusions from comparison studies, however, turns out to be difficult. Here, we highlight common shortcomings that can easily lead to fragile conclusions. First, if a model does achieve high performance on a task similar to humans, its decision-making process is not necessarily human-like. Moreover, further analyses can reveal differences. Second, the performance of neural networks is sensitive to training procedures and architectural details. Thus, generalizing conclusions from specific architectures is difficult. Finally, when comparing humans and machines, equivalent experimental settings are crucial in order to identify innate differences. Addressing these shortcomings alters or refines the conclusions of studies. We show that, despite their ability to solve closed-contour tasks, our neural networks use different decision-making strategies than humans. We further show that there is no fundamental difference between same-different and spatial tasks for common feed-forward neural networks and finally, that neural networks do experience a"recognition gap"on minimal recognizable images. All in all, care has to be taken to not impose our human systematic bias when comparing human and machine perception.


Introduction
How biological brains infer environmental states from sensory data is a long-standing question in neuroscience and psychology. In recent years, a new tool to study human visual perception has emerged: artificial deep neural networks (DNNs). They perform complex perceptual inference tasks like object recognition (Krizhevsky, Sutskever, & Hinton, 2012) or depth estimation (Eigen & Fergus, 2015) at human-like accuracies. These artificial networks may therefore encapsulate some key aspects of the information processing in human brains and thus invite the enticing possibility that we may learn from one system by studying the other (Hassabis, Kumaran, Summerfield, & Botvinick, 2017;Jozwik, Kriegeskorte, Cichy, & Mur, 2018;Kar, Kubilius, Schmidt, Issa, & DiCarlo, 2019;Kriegeskorte & Douglas, 2018;Lake, Ullman, Tenenbaum, & Gershman, 2017).
However, many subtle issues exist when comparing humans and machines, which can substantially alter or even invert the conclusions of a study. To demonstrate these difficulties we discuss and analyze three case studies: 1. Closed Contour Detection Distinct visual elements can be grouped together by the human visual system to appear as a "form" or "whole", as described by the Gestalt principles of prägnanz or good continuation. As such, closed contours are thought to be prioritized by the human perceptual system and to be important in perceptual organization (Elder & Zucker, 1993;Koffka, 2013;Kovacs & Julesz, 1993;Ringach & Shapley, 1996;Tversky, Geisler, & Perry, 2004). Starting from the hypothesis that contour integration is difficult for DNNs, we here test how well humans and neural networks can separate closed from open contours. Surprisingly, we find that both humans and our DNN reach high accuracies (see also X. Zhang, Watkins, and Kenyon (2018)).
However, our further analyses reveal that our model performs this task in ways very different from humans and that it does not actually understand the concept of closedness. This case study highlights that several types of analyses are crucial to investigate the strategies learned by a machine model and to understand differences in inferential processes when comparing humans and machines.

Synthetic Visual Reasoning Test
The Synthetic Visual Reasoning Test (SVRT) (Fleuret et al., 2011) consists of problems that require abstract visual reasoning (cf. Figure 2A). Several studies compared humans against varying machine learning algorithms on these tasks (Ellis, Solar-Lezama, & Tenenbaum, 2015;Fleuret et al., 2011;J. Kim, Ricci, & Serre, 2018;Stabinger, Rodríguez-Sánchez, & Piater, 2016). A key result was that DNNs could solve tasks involving spatial arrangements of objects but struggled to learn the comparison of shapes (so-called same-different tasks). This lead J. Kim et al. (2018) to argue that feed-back mechanisms including attention and perceptual grouping would be key computational components underlying abstract visual reasoning. We show that the large divergence in task difficulty is fairly specific to the minimal networks chosen in the latter study, and that common feed-forward DNNs like ResNet-50 experience little to no difference in task difficulty under common settings. While certain differences do exist in the low-data regime, we argue that this regime is not suited for drawing conclusions about differences between human and machine visual systems given the large divergence in prior visual experiences and many other confounding factors like regularization or training procedures. In other words, care has to be taken when drawing general conclusions that reach beyond the tested architectures and training procedures. Ullman et al. (2016) investigated the minimally necessary visual object information by successively cropping or reducing the resolution of a natural image until humans failed to identify the object. The study revealed that recognition performance dropped sharply if the minimal recognizable image patches were reduced any further. They refer to this drop in performance as recognition gap. This recognition gap was much smaller in the tested machine vision algorithms and the authors concluded that machine vision algorithms would not be able to "explain [humans'] sensitivity to precise feature configurations" (Ullman et al., 2016). In a similar study, Srivastava et al. (2019) identified "fragile recognition images" with a machine-based procedure and found a larger recognition gap for the machine algorithms than for humans. We here show that the differences in recognition gaps identified by Ullman et al. (2016) can at least in part be explained by differences in the experimental procedures for humans and machines and that a large recognition gap does exist for our DNN. Put differently, this case study emphasizes that humans and machines should be exposed to equivalent experimental settings.

Related work
Comparative psychology and psychophysics have a long history of studying mental processes of nonhuman animals and performing cross-species comparisons. For example, they investigate what can be learned about human behavior and perception by examining model systems such as monkeys or mice and describe challenges of comparing different systems (Boesch, 2007;Haun, Jordan, Vallortigara, & Clayton, 2011;Koehler, 1943;Köhler, 1925;Romanes, 1883;Tomasello & Call, 2008). With the wave of excitement about DNNs as a new model of the human visual system, it may be worthwhile to transfer lessons from this long comparative tradition.
A growing body of work discusses this on a higher level. Majaj and Pelli (2018) provide a broad overview how machine learning can help vision scientists to study biological vision, while Barrett, Morcos, and Macke (2019) review methods how to analyze representations of biological and artificial networks. From the perspective of cognitive science, Cichy and Kaiser (2019) stress that Deep Learning models can serve as scientific models that not only provide both helpful predictions and explanations but that can also be used for exploration. Furthermore, from the perspective of psychology and philosophy, Buckner (2019) emphasizes often-neglected caveats when comparing humans and DNNs such as human-centered interpretations and calls for discussions regarding how to properly align machine and human performance. Chollet (2019) proposes a general Artificial Intelligence benchmark and suggests to rather evaluate intelligence as "skill-acquisition efficiency" than to focus on skills at specific tasks.
In the following, we give a brief overview of studies that compare human and machine perception.
A great number of studies focus on manipulating tasks and/or models. Researchers often use generalization tests on data dissimilar to the training set (Wu et al., 2019;X. Zhang et al., 2018) to test whether machines understood the underlying concepts. In other studies, the degradation of object classification accuracy is measured with respect to image degradations (Geirhos, Temme, et al., 2018) or with respect to the type of features that play an important role for human or machine decisionmaking (Brendel & Bethge, 2019;Geirhos, Rubisch, et al., 2018;Kubilius, Bracci, & de Beeck, 2016;Ritter, Barrett, Santoro, & Botvinick, 2017;Ullman et al., 2016). A lot of effort is being put into investigating whether humans are vulnerable to small, adversarial perturbations in images (Dujmović, Malhotra, & Bowers, 2020;Elsayed et al., 2018;Han et al., 2019;Zhou & Firestone, 2019) -as DNNs are shown to be (Szegedy et al., 2013). Similarly, in the field of Natural Language Processing, a trend is to manipulate the data set itself by for example negating statements to test whether a trained model gains an understanding of natural language or whether it only picks up on statistical regularities (McCoy, Pavlick, & Linzen, 2019;Niven & Kao, 2019).
Further work takes inspiration from biology or uses human knowledge explicitly in order to improve DNNs. Spoerer, McClure, and Kriegeskorte (2017) found that recurrent connections, which are abundant in biological systems, allow for higher object recognition performance, especially in challenging situations such as in the presence of occlusions -in contrast to pure feed-forward networks. Furthermore, several researchers suggest (J. Kim et al., 2018;X. Zhang et al., 2018) or show (Barrett et al., 2018;Santoro et al., 2017;Wu et al., 2019) that designing networks' architecture or features with human knowledge is key for machine algorithms to successfully solve abstract (reasoning) tasks.
Despite a multitude of studies, comparing human and machine perception is not straightforward. An increasing number of studies assesses other comparative studies: Dujmović et al. (2020), for example, show that human and computer vision are less similar than claimed by Zhou and Firestone (2019) as humans cannot decipher adversarials: Their judgment of the latter depends on the experimental settings, i.e. specifically the choice of stimuli and the labels. Another example is the study by Srivastava et al. (2019) which performs an experiment similar to Ullman et al. (2016) but with swapped roles for humans and machines. In this case, a large recognition gap is found for machines but only a small one for humans.

Methods
In this section, we summarize the required data sets as well as the procedures for the three case studies: (1) Closed Contour Detection (2) Synthetic Visual Reasoning Test, and (3) Recognition Gap. All code is available at https://github.com/bethgelab/notorious difficulty of comparing human and machine perception.

Data sets
Closed Contour Detection We created a data set with images of size 256 × 256 px that each contained either one open or one closed contour, which consisted of 3 − 9 straight line segments, as well as several flankers with either one or two line segments ( Figure 1A). The lines were black and the background was uniformly gray. More details on the stimulus generation can be found in Appendix A.1.
Additionally, we constructed 15 variants of the data set to test generalization performance ( Figure   1A). Nine variants consisted of contours with straight lines. Six of these featured varying line styles like changes in line width (1, 2, 3) and/or line color (4, 5). For one variant (6), we increased the number of edges in the main contour. Another variant (7) had no flankers, and yet another variant (8) featured asymmetric flankers. For variant 9, the lines were binarized (only black or gray pixels instead of different gray tones).
In another six variants, the contours as well as the flankers were curved, meaning that we modulated a circle with a radial frequency function. The first four variants did not contain any flankers and the main contour had a fixed size of 50 px (10), 100 px (11) and 150 px (12). For another variant (13), the contour was a dashed line. Finally, we tested the effect of different flankers by adding one additional closed, yet dashed contour (14) or one to four open contours (15).

Synthetic Visual Reasoning Test
The SVRT (Fleuret et al., 2011) consists of 23 different abstract visual reasoning tasks. We used the original C-code provided by Fleuret et al. (2011) to generate the images. The images had a size of 128 × 128 pixels. For each problem, we used up to 28, 000 images for training, 5, 600 images for validation and 11, 200 images for testing.

Recognition Gap
We used two data sets for this experiment. One consisted of ten natural, color images whose grayscale versions were also used in the original study by Ullman et al. (2016). We discarded one image from the original data set as it does not correspond to any ImageNet class. For our ground truth class selection, please see Appendix C.3. The second data set consisted of 1000 images from the ImageNet (Deng et al., 2009) validation set. All images were pre-processed like in standard training of ResNet (i.e. resizing to 256 ×256 pixels, cropping centrally to 224 × 224 pixels and normalizing).

Closed Contour Detection
Fine-tuning and Generalization tests We fine-tuned a ResNet-50 (He, Zhang, Ren, & Sun, 2016) pre-trained on ImageNet (Deng et al., 2009), on the closed contour task. We replaced the last fully connected, 1000-way classification layer by a layer with only one output neuron to perform binary classification with a decision threshold of 0. The weights of all layers were fine-tuned using the optimizer Adam (Kingma & Ba, 2014) with a batch size of 64. All images were pre-processed to have the same mean and standard deviation and were randomly mirrored horizontally and vertically for data augmentation. The model was trained on 14, 000 images for 10 epochs with a learning rate of 0.0003. We used a validation set of 5, 600 images.
To determine the generalization performance, we evaluated the model on the test sets without any further training. Each of the test sets contained 5, 600 images. To account for the distribution shift between the original training images and the generalization tasks, we optimized the decision threshold (a single scalar) for each data set (see Appendix A.3).
Adversarial Examples Loosely spoken, an adversarial example is an image that -to humansappears very similar to a correctly classified image, but is misclassified by a machine vision model. We used the python package foolbox (Rauber, Brendel, & Bethge, 2017) to find adversarials on the closed contour data set (parameters: CarliniWagnerL2Attack, max iterations=1000, learning rate=10e-3).

BagNet-based Model and Heatmaps
We fine-tuned the weights of an ImageNet-pre-trained BagNet-33 (Brendel & Bethge, 2019). This network is a variation of ResNet-50, where most 3 × 3 kernels are replaced by 1 × 1 kernels and therefore the receptive field size at the top-most convolutional layer is restricted to 33 × 33 pixels. We replaced the final layer to map to one single output unit and used the optimizer RAdam (Liu et al., 2019) with an initial learning rate of 1 × 10 −4 . The training images were generated on-the-fly, which meant that new images were produced for each epoch. In total, the fine-tuning lasted 100 epochs. Since BagNet-33 yields log-likelihood values for each 33 × 33 pixels patch in the image -which can be visualized as a heatmap -we could identify exactly how each patch contributed to the classification decision. Such a straight-forward interpretation of the contributions of single image patches is not possible with standard DNNs like ResNet (He et al., 2016) due to their large receptive field sizes in the top layers.

Synthetic Visual Reasoning Test
For each of the SVRT problems, we fine-tuned the ResNet-50-based model (as described in section 3.2.1). The same pre-processing, data augmentation, optimizer and batch size as for the closed contour data set were used.
Varying Number of Training Images To fine-tune the models, we used subsets containing either 28, 000, 1000 or 100 images. The number of epochs depended on the size of the training set: The model was fine-tuned for respectively 10, 280 or 2800 epochs. For each training set size and SVRT problem, we used the best learning rate after a hyper-parameter search on the validation set, where we tested the learning rates [6 × 10 −5 , 1 × 10 −4 , 3 × 10 −4 ].

Initialization with Random Weights
As a control experiment, we also initialized the model with random weights and we again performed a hyper-parameter search over the learning rates [3 × 10 −4 , 6 × 10 −4 , 1 × 10 −3 ].

Recognition Gap
Model In order to evaluate the recognition gap, the model had to be able to handle small input images. With standard networks like ResNet (He et al., 2016), there is no clear path how to do that.
In contrast, BagNet-33 (Brendel & Bethge, 2019) allows to straightforwardly analyze images as small as 33×33 pixels and hence was our model of choice for this experiment. For more details on BagNet-33, see Section 3.2.1. Ullman et al. (2016), we defined minimal recognizable images or configurations (MIRCs) as those patches of an image for which an observer -by which we mean an ensemble of humans or one or several machine algorithms -reaches ≥ 50% accuracy, but any additional 20% cropping of the corners or 20% reduction in resolution would lead to an accuracy < 50%. MIRCs are thus inherently observer-dependent. The original study only searched for MIRCs in humans. We implemented the following procedure to find MIRCs in our DNN: We passed each preprocessed image through BagNet-33 and selected the most predictive crop according to its probability.

Minimal recognizable images Similar to
See Appendix C.2 on how to handle cases where the probability saturates at 100% and Appendix C.1 for different treatments of ground truth class selections. If this probability of the full-size image for the ground-truth class was ≥ 50%, we again searched for the 80% subpatch with the highest probability.
We repeated the search procedure until the class probability for all subpatches fell below 50%. If the 80% subpatches would be smaller than 33 × 33 pixels, which is BagNet-33's smallest natural patch size, the crop was increased to 33 × 33 pixels using bilinear sampling. We evaluated the recognition gap as the difference in accuracy between the MIRC and the best-performing sub-MIRC. This definition was more conservative than the one from Ullman et al. (2016) who considered the maximum difference between a MIRC and its sub-MIRCs. Please note that one difference between our machine procedure and the psychophysics experiment by Ullman et al. (2016) remained: The former was greedy, whereas the latter corresponded to an exhaustive search under certain assumptions.

Closed Contour Detection
In this case study, we compared humans and machines on a closed contour detection task. For humans, a closed contour flanked by many open contours perceptually stands out. In contrast, detecting closed contours might be difficult for DNNs as they would presumably require a long-range contour integration.
Humans identified the closed contour stimulus very reliably in a two-interval forced choice task. Specifically, participants achieved a performance of 88.39% (SEM = 2.96%) on stimuli whose generation procedure was identical to the training set. For stimuli with white instead of black lines, the performance was 90.52% (SEM = 1.58%). The psychophysical experiment is described in Appendix A.2.
Our ResNet-50-based model also performed well on the closed contour task. On the test set, our model reached an accuracy of 99.95% (cf. Figure 1A To gain a better understanding of the strategies and features used by our ResNet-50-based model to solve the task, we performed three additional experiments: First, we tested how well the model generalized to modifications of the data set such as different line-widths. Second, we looked at the minimal modifications necessary to flip the decision of our model. And third, we employed a BagNet-33-based model to understand whether the task could be solved without global contour integration. Generalization We found that our trained model generalized well to many but not all modified stimulus sets (cf. Figure 1A and B). Despite the severe transition from straight-lined polygons in the training data to curvy contours in test sets, the model generalized to curvy contours (11) perfectly as long as the contour remained below a diameter of 100 px. Also, adding a dashed, closed contour (14) as a flanker did not lower performance. The classification ability of the model remained similarly high for the no flankers (7) and the asymmetric flankers condition (8). When testing our model on main contours that consisted of more edges than the ones presented during training (6), the performance was also hardly impaired. It remained high as well when multiple curvy open contours were added as flankers (15).
The following variations seemed more difficult for our model: If the size of the contour got too large, a moderate drop in accuracy was found (12). For binarized images, our model's performance was also reduced (9). And finally, (almost) chance performance was observed when varying the line width (1, 2, 3), when changing the line color (4, 5) or when using dashed curvy lines (13).

Minimal adversarial modifications
We found that small changes to the image, which are hardly recognizable to humans, were sufficient to change the decision of the model ( Figure 1B). These small changes did not alter the perception of the contours to humans and suggested that machines would not use the same features to classify closed contours.
BagNet A BagNet-33-based model, which by construction cannot integrate contours larger than 33 × 33 pixels, still reached close to 90% performance. In other words, contour integration was not necessary to perform well on the task. The heatmaps of the model (cf. Figure 1C

Synthetic Visual Reasoning Test
For each SVRT subtask, we fine-tuned a pre-trained ResNet-50-based model on 28, 000 training images (in contrast to one million images as used by J. Kim et al. (2018)) and reached above 90% accuracy on all sub-tasks, including tasks that required same-different judgments ( Figure 2B). This finding is contrary to the original result by J. Kim et al. (2018), which showed a gap of around 33% between same-different and spatial reasoning tasks.
The performance on the test set decreased for our model, when reducing the number of training images. In particular, we found that the performance on same-different tasks dropped more rapidly than on spatial reasoning tasks. If the ResNet-50 was trained from scratch (i.e. weights were randomly initialized instead of loaded from pre-training on ImageNet), the performance dropped only slightly on all but one spatial reasoning task. Larger drops were found on same-different tasks.

Recognition Gap
We tested our model on machine-selected minimal recognizable patches (MIRCs) to evaluate the recog-

Discussion
We examined three case studies comparing human and machine visual perception. Each case study illustrates a potential pitfall in these comparisons.

Closed Contour Detection -Human-biased judgment might lead to wrong conclusion
We find that both humans and our ResNet-50-based model can reliably tell apart images containing the ResNet-50-based model is unclear, but its existence highlights the possibility that our previously stated assumption -namely that this task would only be solvable with contour integration -is misleading. In fact, as humans, we might easily miss the many statistical subtleties by which a given task could be solved. In this respect, BagNets proved to be a useful tool to test purportedly "global" visual tasks for the presence of local artifacts.
Altogether, we applied three methods to analyze the classification process adopted by a machine learning model in this case study: (1) testing the generalization of the model to non-i.i.d. data sets involving the same visual inference task; (2) generating adversarial example images; and (3) training and testing a model architecture (BagNet) that is designed to be interpretable. These techniques provide complementary ways to investigate the strategies learned by a machine learning model and to better understand differences in inferential processes compared to humans. To avoid premature conclusions about what models did and did not learn, we advocate for the routine use of such analysis techniques.

Synthetic Visual Reasoning Test -Generalizing conclusions from specific architectures and training procedures is difficult
Previous studies (J. Kim et al., 2018;Stabinger et al., 2016) explored how well deep neural networks can learn visual relations by testing them on the Synthetic Visual Reasoning Test (Fleuret et al., 2011).
Both studies found a dichotomy between two task categories: While a high accuracy was reached on spatial problems, the performance on same-different problems was poor. In order to compare the two types of tasks more systematically, J. Kim et al. (2018) developed a parameterized version of the SVRT data set called PSVRT. Using this dataset, they found that for same-different problems, an increase in the complexity of the data set could quickly strain their model. The DNNs used by J. Kim et al. (2018) consisted of up to six layers. From these results the authors concluded that same-different problems would be more difficult to learn than spatial problems. More generally, these papers have been perceived and cited with the broader claim of feed-forward DNNs not being able to learn same-different relationships between visual objects (Schofield, Gilchrist, Bloj, Leonardis, & Bellotto, 2018;Serre, 2019).
The previous findings of J. Kim et al. (2018) were based on rather small neural networks: They consisted of up to six layers. However, typical network architectures used for object recognition consist of more layers and have larger receptive fields. When testing a representative of such DNNs, namely ResNet-50, we find that feed-forward models can in fact perform well on same-different tasks (see also concurrent work of Messina, Amato, Carrara, Falchi, and Gennaro (2019)). In total, we used fewer images (28, 000 images) than J. Kim et al. (2018) (1 million images) and Messina et al. (2019) (400,000 images) to train the model. Although our experiments in the very low data regime (with 1000 samples) show that same-different tasks require more training samples than spatial reasoning tasks, this cannot be taken as evidence for systematic differences between feed-forward neural networks and the human visual system. In contrast to the neural networks used in this experiment, the human visual system is naturally pre-trained on large amounts of abstract visual reasoning tasks, thus making the low-data regime an unfair testing scenario from which it is almost impossible to draw solid conclusions about differences in the internal information processing. In other words, it might very well be that the human visual system trained from scratch on the two types of tasks would exhibit a similar difference in sample efficiency as a ResNet-50.
Furthermore, the performance of a network in the low-data regime is heavily influenced by many factors other than architecture, including regularization schemes or the optimizer, making it even more difficult to reach conclusions about systematic differences in the network structure between humans and machines.

Recognition
Gap -Humans and machines should be exposed to equivalent experimental settings Ullman et al. (2016) showed that humans are sensitive to small changes in minimal images. More precisely, humans exhibit a large recognition gap between minimal recognizable images -so-called MIRCs -and sub-MIRCs. For machine algorithms, in contrast, these authors identified only a small recognition gap. However, they tested machines on the patches found in humans -despite the fact that the very definition of MIRCs is inherently observer-dependent. This means that MIRCs look different depending on whether an ensemble of humans or one or several machine algorithms selects them. Put another way, it is likely for an observer to use different features for recognition and thus to have a lower recognition rate on MIRCs identified by a different observer and hence a lower recognition gap. The same argument is true for a follow-up study (Srivastava et al., 2019), which selected "fragile recognition images" (defined similarly but not identically to human-selected MIRCs by Ullman et al. (2016)) in machines and finds a moderately high recognition gap for machines, but a low one for humans. Unfortunately, the selection procedures used in Ullman et al. (2016)  These results highlight the importance of testing humans and machines on the exact same footing and of avoiding a human bias in the experiment design. All conditions, instructions and procedures should be as close as possible between humans and machines in order to ensure that all observed differences are due to inherently different decision strategies rather than differences in the testing procedure.

Conclusion
We described notorious difficulties that arise when comparing humans and machines. Our three case studies illustrated that confirmation bias can lead to misinterpreting results, that generalizing conclusions from specific architectures and training procedures is difficult, and finally that unequal testing procedures can confound decision behaviors. Addressing these shortcomings altered the conclusions of previous studies. We showed that, despite their ability to solve closed-contour tasks, our neural networks use different decision-making strategies than humans. In addition, there is no fundamental difference between same-different and spatial tasks for common feed-forward neural networks, and they do experience a "recognition gap" on minimal recognizable images.
The overarching challenge in comparison studies between humans and machines seems to be the strong internal human interpretation bias. Not only our expectations whether or how a machine algorithm might solve a task, but also the human reference point can confound what we read into results. Appropriate analysis tools and extensive cross checks -such as variations in the network architecture, alignment of experimental procedures, generalization tests, adversarial examples and tests with constrained networks -help rationalizing the interpretation of findings and put this internal bias into perspective. All in all, care has to be taken to not impose our human systematic bias when comparing human and machine perception.  Varying Contrast of Background An image from the ImageNet data set was added as background to the line drawing. We converted the image into LAB color space and linearly rescaled the pixel intensities of the image to produce a normalized contrast value between 0 (gray image with the RGB values [118,118,118]) and 1 (original image) (cf. Figure 6A).   Figure 4A. To get a contour with n edges, we generated n points which were defined by a randomly sampled angle α n and a randomly sampled radius r n (between 0 and 128 px). By connecting the resulting points, we obtained the closed contour. We used the python PIL library (PIL 5.4.1, python3) to draw the lines that connect the endpoints.

Author contributions
For the corresponding open contour, we sampled two radii for one of the angles such that they had a distance of 20 px-50 px from each other. When connecting the points, a gap was created between the points that share the same angle. This generation procedure could allow for very short lines with edges being very close to each other. To avoid this we excluded all shapes with corner points closer to 10 px from non-adjacent lines.
The position of the main contour was random, but we ensured that the contour did not extend over the border of the image.

A.1.2 Line-drawing with Curvy Lines as Main Contour
For some of the generalization sets, the contours consisted of curvy instead of straight lines. These were generated by modulating a circle of a given radius r c with a radial frequency function that was defined by two sinusoidal functions. The radius of the contour was thus given by with the frequencies f 1 and f 2 , (integers between 1 and 6), amplitudes A 1 and A 2 (random values between 15 and 45) and phases θ 1 and θ 2 (between 0 and 2π). Unless stated otherwise, the diameter (diameter = 2 × r c ) was a random value between 50 and 100 px, and the contour was positioned in the center of the image. The open contours were obtained by removing a circular segment of size φ o = π 3 at a random phase (see Figure 4B).
For two of the generalization data sets we used dashed contours which were obtained by masking out 20 equally distributed circular segments each of size φ d = π 20 .

A.1.3 More Details on Generalization Data Sets
As described in the methods (Section 3.1), we used 15 variants of the data set as generalization data sets. Here, we provide some more details on some of these data sets: Black-White-Black lines (5). Black lines enclosed a white one in the middle. Each of these three lines had a thickness of 1.5 px which resulted in a total thickness of 4.5 px.

A.2 Psychophysical Experiment: Closed Contour Detection
To estimate how well humans would be able to distinguish closed and open stimuli, we performed a psychophysical experiment in which observers reported which of two sequentially presented images contained a closed contour (two-interval forced choice ("2-IFC") task).

A.2.1 Stimuli
The images of the closed contour data set were used as stimuli for the psychophysical experiments.
Specifically, we used the images from the test sets that were used to evaluate the performance of the models. For our psychophysical experiments, we used two different conditions: the images contained either black (i.i.d. to the training set) or white contour lines. The latter was one one of the generalization test sets.

A.2.4 Procedure
On each trial, one closed and one open contour stimulus were presented to the observer (cf. Figure 5 A).
The images used for each trial were randomly picked, but we ensured that the open and closed images shown in the same trial were not the ones that were almost identical to each other (see "Generation of Image Pairs" in Appendix A.1). Thus, the number of edges of the main contour could differ between the two images shown in the same trial. Each image was shown for 100 ms, separated by a 300 ms interstimulus interval (blank gray screen). We instructed the observer to look at the fixation spot in the center of the screen. The observer was asked to identify whether the image containing a closed contour appeared first or second. The observer had 1200 ms to respond and was given feedback after each trial.
The inter-trial interval was 1000 ms. Each block consisted of 100 trials and observers performed five blocks. Trials with different line colors and varying background images (contrasts including 0, 0.4 and 1) were blocked. Here, we only report the results for black and white lines of contrast 0. Upon the first time that a block with a new line color was shown, observers performed a practice session with 48 trials of the corresponding line color.

A.3 Optimized Decision Criterion
In our tests of generalization, poor accuracy could simply result from a sub-optimal decision criterion rather than because the network would not be able to tell the stimuli apart. To find the optimal threshold for each data set, we subdivided the interval, in which 95% of all logits lie, into 100 subpoints and picked the threshold that would lead to the highest performance. A.4 Additional Experiment: Increasing the Task Difficulty by Adding a Background Image We performed an additional experiment, where we tested if the model would become more robust and thus generalized better if we trained on a more difficult task. This was achieved by adding an image to the background, such that the model had to learn how to separate the lines from the task-irrelevant background.
In our experiment, we fine-tuned our ResNet-50-based model on images with a background image of a uniformly sampled contrast. For each data set, we evaluated the model separately on six discrete contrast levels {0, 0.2, 0.4, 0.6, 0.8, 1} (cf. Figure 6A). We found that the generalization performance did not increase substantially compared to the experiment in the main body (cf. Figure 6B).

Appendix B SVRT
B.1 Model Accuracy on the Individual Problems Figure 7 shows the accuracy of the models for each problem of the SVRT data set. Problem 8 is a mixture of same-different task and spatial task. In Figure 2 this problem was assigned to the spatial tasks.

C.1 Analysis of Different Class Selections and Different Number of Descendants
Treating the ten stimuli from Ullman et al. (2016) in our machine algorithm setting required two design choices: We needed to both pick suitable ground truth classes from ImageNet for each stimulus as well as choose if and how to combine them. The former is subjective and using relationships from WordNet Hierarchy (Miller, 1995)  to joint classes and corner crops. Finally, besides analyzing the recognition gap, we also analyzed the sizes of MIRCs and the fractions of images that possess MIRCs for the mentioned conditions. Figure 8A shows that all options result in similar values for the recognition gap. The trend of smaller MIRC sizes for stride-1 compared to four corner crops shows that the search algorithm can find even smaller MIRCs when all crops are possible descendants (cf. Figure 8B). The final analysis of how many images possess MIRCs (cf. Figure 8C)  reliably classify three out of the nine stimuli from (Ullman et al., 2016) can partly be traced back to the oversimplification of single-class-attribution in ImageNet as well as to the overconfidence of deep learning classification algorithms (Guo, Pleiss, Sun, & Weinberger, 2017): They often attribute a lot of evidence to one class, and the remaining ones only share very little evidence.

C.2 Selecting Best Crop when Probabilities Saturate
We observed that many crops had very high probabilities and therefore used the "logit"-measure logit(p(x c )) (Ashton, 1972), where p(x c ) is the probability of the correct class c. Note that this measure is different from what the deep learning community usually refers to as "logits", which are the values before the softmax-layer. The logit logit(p(x c )) is monotonic w.r.t. to the class probabilities, meaning that the higher the probability p(x c ), the higher the logit logit(p(x c )). However, while p(x c ) saturates at 100%, logit(p(x c )) is unbounded and thus yields a more sensitive discrimination measure between image patches that all have p(x i ) = 1.
This is a short derivation for the logit logit(p(x c )): The probability of the correct class c can be obtained by plugging the logits x i into the softmax-formula: Since we are interested in the probability of the correct class, it holds that p c (x) = 0. Thus, in the regime of interest, we can invert both sides of the equation. After simplifying, we get: A B C And finally, when taking the negative logarithm on both sides, we obtain: The above formulations regarding one correct class hold when adjusting the experimental design to accept several classes k as correct predictions. In brief, the logit logit(p(x C )), where C stands for several classes, then states: C.3 Selection of ImageNet Classes for Stimuli of Ullman et al. (2016) Note that this selection is different from the one used by Ullman et al. (2016). We went through all classes for each image and selected the ones that we considered sensible. The tenth image of the eye does not have a sensible ImageNet class, hence only nine stimuli from Ullman et al. (2016)