ImageNet-trained deep neural network exhibits illusion-like response to the Scintillating Grid

Deep neural network (DNN) models for computer vision are now capable of human-level object recognition. Consequently, similarities in the performance and vulnerabilities of DNN and human vision are of great interest. Here we characterize the response of the VGG-19 DNN to images of the Scintillating Grid visual illusion, in which white dots are perceived to be partially black. We observed a significant deviation from the expected monotonic relation between VGG-19 representational dissimilarity and dot whiteness in the Scintillating Grid. That is, a linear increase in dot whiteness leads to a non-linear increase and then, remarkably, a decrease (non-monotonicity) in representational dissimilarity. In control images, mostly monotonic relations between representational dissimilarity and dot whiteness were observed. Furthermore, the dot whiteness level corresponding to the maximal representational dissimilarity (i.e. onset of non-monotonic dissimilarity) matched closely with that corresponding to the onset of illusion perception in human observers. As such, the non-monotonic response in the DNN is a potential model correlate for human illusion perception.

: Non-monotonicity of non-normalized representational dissimilarity in Scintillating Grid and controls. (a-h) Non-normalized representational dissimilarity as a function of disk luminance for the same experiments described in Fig. 2. Figure S3: Deviation from monotonicity when using L 2 representational dissimilarities for . In all other analyses in this work, we measured representational dissimilarity as the L 1 distance between two VGG-19 representations. The L 2 (Euclidean norm) is another method for quantifying representational distances (see Methods). Panels (a-c) follow Fig. 3, showing the results when representational distances are measured using the L 2 metric instead of the L 1 metric. Results using L 2 distance produced similar results as L 1 (compare with Fig. 3). Namely, the Scintillating Grid images had the greatest non-monotonicity as compared to reduced illusion and non-illusion control images. Figure S4: Deviation from monotonicity for weight-scrambled . Panels (a-c) follow Fig. 3, showing the results when model connection weights were randomly permuted within layer. Unlike the results with the ImageNet-trained VGG-19 (Fig. 3), these results show deviations from monotonicity for all stimuli sets, i.e. not specific to Scintillating Grid images causing illusion perception in humans.   Figure S8: Representational dissimilarity with disks changed from white to middle gray to black. The changes progress with each of the white disks being changed to middle gray (µ = 0.5) and then each of the middle gray disks being changed to black.

APPENDIX B: CONTROLLING FOR A NON-MONOTONICITY DUE TO CONTRAST
As seen in the main Results section, the relation between representational dissimilarity and disk 4 luminance in the Scintillating Grid shows a peak in deviation magnitudes around µ = 0.5, which is 5 the cause of the observed non-monotonicity (Fig. 2a). A plausible explanation of this peak is that 6 around µ = 0.5, the disk luminance is most similar to the bar luminance, leading to loss of contrast 7 and/or disk contours and results in maximal dissimilarity with the reference image, which has in-8 tact disk contrast and contour. To dissociate this hypothesis from the predictions based on illusion 9 perception, we considered manipulations of disk luminance in images with a gray (luminance 0.5) 10 background and no bars (Fig. S10a). In these Gray Background images, there was no illusion effect 11 when µ = 1, and also complete loss of shape/contrast when µ = 0.5 with the absence of the disk 12 border. Therefore, if the non-monotonicity is mediated by shape/contrast loss, we expect stronger 13 non-monotonicity for the Gray Background as compared to the Scintillating Grid. If, however, the 14 non-monotonicity indicates illusion-like response, we expect the opposite. In this manipulation we 15 also considered the width of the border around the disks to be 0, 1, 2, or 3 pixels (in the main 16 text, all experiments used a one-pixel disk border). For increased border width, there should be a 17 decreased loss of shape around µ = 0.5, consistent with the results showing a sharp peak in the 18 representational dissimilarity at µ = 0.5 for Gray Background images with no disk border, and a 19 less pronounced peak for larger border widths (see Fig. SS9). Consequently, the manipulation of 20 border width provides an additional tool to dissociate between the illusion perception and the shape 21 loss hypotheses.

22
Gray Background images with no disk border produced significantly larger deviation magnitudes 23 than the corresponding images with the standard one-pixel disk border in both VGG-19 and ResNet-24 101 (see gray line in Fig. S10bc). This suggests that the presence of a disk border greatly reduces the 25 effect of contrast or shape loss. In comparison, the Scintillating Grid images produced slightly higher 26 deviation magnitudes in the presence of the one-pixel disk border (see green line in Fig. S10bc).

27
This difference in the dependence of deviation magnitude on disk border indicates that the non-28 monotonic deviation from the Scintillating Grid cannot be fully explained by loss of contrast or 29 shape contours (as in the Gray Background control). Furthermore, these results suggest that a one-30 pixel disk border is able to preserve the illusion-like representation in DNNs while greatly reducing 31 the non-monotonic effect of contrast or shape loss. 32 Figure S9: Mean representational dissimilarity for the Gray Background control image set when disk border width is (a) 0 pixels, (b) 1 pixel, (c) 2 pixels, (d) 3 pixels. Insets show example Gray Background images with µ = 0. A sharply non-monotonic peak is observed near µ = 0.5 for the unbordered gray background image in panel a, and the peak depresses with increasing disk border width. Shaded region corresponds to the interquartile range. Note that the y-axis scaling is different in the different panels. in the Gray Background than in Scintillating Grid. For VGG-19 (Fig. S11a), the Gray Background 38 showed deviation magnitudes that were first positive at layer conv3 1 and soon reached a maximum 39 at conv3 3, while the Scintillating Grid was positive first at conv3 2 with a gradual increase there-40 after (we refer to the layer outputs after ReLU). Importantly, the conv3 1, conv3 2, and conv3 3 41 convolutions have receptive field sizes of 24X24, 32X32, and 40X40 pixels respectively. Therefore, 42 considering that the used disks were of size 9, 11, or 13 pixels, the initial increase for both Gray 43 Background and Scintillating Grid corresponds to double of the disk size. For ResNet-101, mea-44 sured deviation magnitudes roughly resembled those of VGG-19 (compare Fig. S11a to Fig. S11c).

45
For the one-pixel disk border images, the relationship is reversed: the Scintillating Grid images have 46 higher deviation magnitudes than the Gray Background images (Fig. S11bd). In both DNNs, the deviation magnitude of one-pixel border images gradually increases across deeper computational 48 stages for Scintillating Grid images and is constantly low for Gray Background images. For VGG-49 19 (Fig. S11b), the first layer to have positive deviation magnitude was conv3 2 for Scintillating Grid 50 and conv3 1 for Gray Background (for the border-absent case). The computational stage with max-51 imum deviation magnitude was conv5 2 for Scintillating Grid and conv3 3 for Gray Background, 52 similar to the border-absent case. It can be observed that high deviation magnitudes for the Gray 53 Background were almost only found at the convolutions with receptive field sizes two to four times 54 the disk size, arguably in line with contrast detectors. For ResNet-101, results were roughly similar 55 to VGG-19 (compare Fig. S11b to Fig. S11d).

56
These differences in the deviation magnitudes across the DNN computational stage hierarchy sug-57 gest that the one-pixel disk border significantly reduces the contribution of contrast and shape loss 58 to deviation magnitude and either preserves or augments illusion-specific deviation magnitude. Al-59 together, these findings support the conclusion that the observed non-monotonicity in DNN repre-60 sentations of the Scintillating Grid cannot be fully explained by contrast/shape loss. In this experiment, the accuracy saturated for most tasks for µ < 0.4 (see Fig. S12bc). As a 80 result, we focused on the median response times for correct discrimination trials as a proxy for 81 representational distance (i.e. longer response times correspond to greater discrimination difficulty 82 and thus smaller representational distance from the unaltered Scintillating Grid). For both subjects, 83 the response times typically decreased monotonically with decreased disk luminance in the No Bars 84 control experiment (Fig. S12e). In the Scintillating Grid experiment, there was clear non-monotonic 85 response times with disk luminance for one subject and no significant non-monotonic response 86 times for the other subject. In the latter case, the lack of non-monotonicity is likely due to 87 low signal-to-noise ratio from the low difficulty in the discrimination task (as evidenced by the 88 saturating response time). To resolve these effects better, we computed the normalized response  Grid, unaltered No Bars, altered No Bars. In plots (b-c), the points correspond to accuracy at each luminance level and in plots (d-e), the median response times for correct answers at each luminance level. (f) Normalized response time difference between Scintillating Grid and No Bars tasks across different luminance levels. In all plots, the lines correspond to smoothed trends and the shaded regions correspond to the smoothed interquartile range.