A neural mechanism for contextualizing fragmented inputs during naturalistic vision

With every glimpse of our eyes, we sample only a small and incomplete fragment of the visual world, which needs to be contextualized and integrated into a coherent scene representation. Here we show that the visual system achieves this contextualization by exploiting spatial schemata, that is our knowledge about the composition of natural scenes. We measured fMRI and EEG responses to incomplete scene fragments and used representational similarity analysis to reconstruct their cortical representations in space and time. We observed a sorting of representations according to the fragments' place within the scene schema, which occurred during perceptual analysis in the occipital place area and within the first 200 ms of vision. This schema-based coding operates flexibly across visual features (as measured by a deep neural network model) and different types of environments (indoor and outdoor scenes). This flexibility highlights the mechanism's ability to efficiently organize incoming information under dynamic real-world conditions.

exploiting spatial schemata, that is our knowledge about the composition of natural 23 scenes. We measured fMRI and EEG responses to incomplete scene fragments and used 24 representational similarity analysis to reconstruct their cortical representations in space 25 and time. We observed a sorting of representations according to the fragments' place 26 within the scene schema, which occurred during perceptual analysis in the occipital place 27 area and within the first 200ms of vision. This schema-based coding operates flexibly 28 across visual features (as measured by a deep neural network model) and different types 29 of environments (indoor and outdoor scenes). This flexibility highlights the mechanism's 30 ability to efficiently organize incoming information under dynamic real-world conditions. 31 32

IMPACT STATEMENT 33
In scene-selective occipital cortex and within 200ms of processing, visual inputs are sorted 34 according to their typical spatial position within a scene. 35

INTRODUCTION 37
During natural vision, the brain continuously receives incomplete fragments of information 38 that need to be integrated into meaningful scene representations. Here, we propose that 39 this integration is achieved through contextualization: the brain uses prior knowledge about 40 where information typically appears in a scene to meaningfully sort incoming information. 41 A format in which such prior knowledge about the world is represented in the brain 42 is provided by schemata. First introduced to philosophy to explain how prior knowledge 43 enables perception of the world (Kant, 1781), schemata were later adapted by psychology 44 (Barlett, 1932;Piaget, 1926) and computer science (Minsky, 1975) as a means to 45 formalize mechanisms enabling natural and artificial intelligence, respectively. 46 In the narrower context of natural vision, scene schemata represent knowledge 47 about the typical composition of real-world environments (Mandler, 1984). Scene We tested two hypotheses about this sorting process. First, we hypothesized that 73 this sorting occurs during perceptual scene analysis, which can be spatiotemporally 74 pinpointed to scene-selective cortex (Baldassano et al., 2016;Epstein, 2014) (Mandler and Parker, 1976) are more robustly 77 observed along the vertical dimension, where the scene structure is more rigid (i.e., the 78 sky is almost always above the ground), we hypothesized that the cortical sorting of 79 information should primarily occur along the vertical dimension. 80 To test these hypotheses, we used a novel visual paradigm in which participants 81 were exposed to fragmented visual inputs, and recorded fMRI and EEG data to resolve 82 brain activity in space and time. In our study, we experimentally mimicked the fragmented nature of naturalistic visual 87 inputs by dissecting scene images into position-specific fragments. Six natural scene 88 images (Fig. 1a) were each split into six equally-sized fragments (3 vertical  2 horizontal), 89 resulting in 36 conditions (6 scenes  6 fragments). In separate fMRI (n=30) and EEG 90 (n=20) experiments, participants viewed these fragments at central fixation while 91 performing an indoor/outdoor categorization task to ensure engagement with the stimulus 92 ( Fig. 1b). Critically, this design allowed us to investigate whether the brain sorts the 93 fragments with respect to their place in the schema in the absence of explicit location 94 differences (Fig 1c). 95 To quantify the sorting of fragments during cortical processing we used  The stimulus set consisted of six natural scenes (three indoor, three outdoor). Each scene 110 was split into six rectangular fragments. b, During the fMRI and EEG recordings, 111 participants performed an indoor/outdoor categorization task on individual fragments. 112 Notably, all fragments were presented at central fixation, removing explicit location 113 information. c, We hypothesized that the visual system sorts sensory input by spatial 114 schemata, resulting in a cortical organization that is explained by the fragments' within-115 scene location, predominantly in the vertical dimension: Fragments stemming from the 116 same part of the scene should be represented similarly. Here we illustrate the 117 hypothesized sorting in a two-dimensional space. A similar organization was observed in 118 multi-dimensional scaling solutions for the fragments' neural similarities (see Figure 1 -119 Figure Supplement 1 and Video 1). In subsequent analyses, the spatiotemporal 120 emergence of the schema-based cortical organization was precisely quantified using 121 representational similarity analysis (Fig. 2). 122 7 shared the same location or not. We additionally constructed a category model RDM, 126 which reflected whether pairs of fragments stemmed from the same scene or not. 127 Critically, if cortical information is indeed sorted with respect to scene schemata, we 128 should observe a neural clustering of fragments that stem from the same within-scene 129 locationin this case, the location RDM should predict a significant proportion of the 130 representational organization in visual cortex. 131 To test this, we modeled neural RDMs as a function of the model RDMs using 132 general linear models, separately for the fMRI and EEG data. The resulting beta weights 133 indicated to which degree location and category information accounted for cortical 134 responses in the three ROIs and across time. 135 The key observation was that the fragments' vertical location predicted neural location, consistent with more rigid spatial scene structure in the vertical dimension 143 (Mandler and Parker, 1976). This result provides a first characterization of where and 144 when incoming information is organized in accordance with scene schemata: in OPA and 145 rapidly after stimulus onset, scene fragments are sorted according to their origin within the 146

environment. 147
The schema-based organization co-exists with a prominent scene-category 148 organization: In line with previous findings (Lowe et  To test where and when the visual system sorts incoming sensory information by spatial 156 schemata, we first extracted spatially (fMRI) and temporally (EEG) resolved neural 157 representational dissimilarity matrices (RDMs). In the fMRI, we extracted pairwise neural 158 dissimilarities of the fragments from response patterns across voxels in the occipital place 159 area (OPA), parahippocampal place area (PPA), and early visual cortex (V1). b, In the 160 EEG, we extracted pairwise dissimilarities from response patterns across electrodes at 161 every time point from -200ms to 800ms with respect to stimulus onset. c, We modelled the 162 neural RDMs with three predictor matrices, which reflected their vertical and horizontal 163 positions within the full scene, and their category (i.e., their scene or origin). d, The fMRI 164 data revealed a vertical-location organization in OPA, but not V1 and PPA. Additionally, data showed that both vertical location and category predicted cortical responses rapidly, 167 starting from around 100ms. These results suggest that the fragments' vertical position 168 within the scene schema determines rapidly emerging representations in scene-selective 169 occipital cortex. Significance markers represent p<0.05 (corrected for multiple 170 comparisons). Error margins reflect standard errors of the mean. In further analysis, we 171 probed the flexibility of this schematic coding mechanism (Fig. 3). 172

173
To efficiently support vision in dynamic natural environments, schematic coding 174 needs to be flexible with respect to visual properties of specific scenes. The absence of 175 vertical location effects in V1 indeed highlights that schematic coding is not tied to the 176 analysis of simple visual features. To more thoroughly probe this flexibility, we additionally 177 conducted three complementary analyses (Fig. 3). 178 First, we tested whether schematic coding is tolerant to stimulus features relevant 179 for visual categorization. Categorization-related features were quantified using a deep 180 neural network (DNN; ResNet50), which extracts such features similarly to the brain (  With V1 housing precise low-level feature representations, this measure should very well 210 capture the features extracted during the early processing of simple visual features. 211 However, removing the V1 dissimilarity structure did neither abolish the schematic coding 212 effects in the OPA nor in the EEG data (see Figure 3 - Figure Supplement 3). This shows 213 that even if we had control models that approximated V1 representations extremely well -214 as well as the V1 representations approximate themselvesthese models could not 215 explain vertical location effects in downstream processing. Together, these results provide 216 converging evidence that low-level feature processing cannot explain the schematic 217 coding effects reported here. to clarify which of these computations mediate the schema-based coding described here. 272 As the current study is limited to a small set of scenes, more research is needed to 273 explore whether schema-based coding generalizes to more diverse contents. It is 274 conceivable that schema-based coding constitutes a more general coding strategy that Under this view, formatting perceptual information according to real-world structure may 288 allow cognitive and motor systems to efficiently read out visual information that is needed 289 for different real-world tasks (e.g., immediate action versus future navigation). As the 290 schema-based sorting of scene information happens already during early scene analysis, 291 many high-level processes have access to this information. To conclude, our findings provide the first spatiotemporal characterization of a 302 neural mechanism for contextualizing fragmented visual inputs. By rapidly organizing 303 visual information according to its typical role in the world, this mechanism may contribute 304 to the optimal use of perceptual information for guiding efficient real-world behaviors, even 305   Stimuli 319 scenes (bakery, classroom, kitchen) and three images of outdoor scenes (alley, house, 321 farm). Each image was split horizontally into two halves, and each of the halves was 322 further split vertically in three parts, so that for each scene six fragments were obtained. 323 Participants were not shown the full scene images prior to the experiment. 324

Experimental design 325
The fMRI and EEG designs were identical, unless otherwise noted. Stimulus presentation 326 was controlled using the Psychtoolbox (Brainard, 1997; RRID:SCR_002881). In each trial, 327 one of the 36 fragments was presented at central fixation (7° horizontal visual angle) for 328 200ms (Fig. 1b). Participants were instructed to instructed to maintain central fixation and 329 categorize each stimulus as an indoor or outdoor scene image by pressing one of two 330

buttons. 331
In the fMRI experiment, the inter-trial interval was kept constant at 2,300ms, 332 irrespective of the participant's response time. In the EEG experiment, after each response 333 a green or red fixation dot was presented for 300ms to indicate response correctness; 334 participants were instructed to only blink after the feedback had occurred. For this analysis, we correlated the patterns across the two sets, both within-condition (i.e., the 400 patterns stemming from the two same fragments and from different sets) and between-401 conditions (i.e., the patterns stemming from the two different fragments and from different sets).   In an additional analysis, we sought to eliminate properties specific to either the 440 indoor or outdoor scenes, respectively. We therefore constructed RDMs for horizontal and 441 vertical location information which only contained comparisons between the indoor and 442 outdoor scenes. These RDMs were constructed in the same way as explained above, but 443 all comparisons within the same scene type of scene were removed (Fig. 3d).

Statistical testing 501
For the fMRI data, we tested the regression coefficients against zero, using one-tailed, one-502 sample t-tests (i.e., testing the hypothesis that coefficients were greater than zero). Multiple-503 comparison correction was based on Bonferroni-corrections across ROIs. A complete report 504 of all tests performed on the fMRI data can be found in Supplementary file 1. For the EEG 505 data, we used a threshold-free cluster enhancement procedure (Smith and Nichols, 2009)        Vertical location effects across experiment halves. We interpret the vertical location 779 organization in the neural data as reflecting prior schematic knowledge about scene 780 structure. Alternatively, however, the vertical location organization could in principle result 781 from learning the composition of the scenes across the experiment. In the latter case, one 782 would predict that vertical location effects should primarily occur late in the experiment 783 (e.g., in the second half), and less so towards the beginning (e.g., in the first half). To test 784 this, we split into halves both the fMRI data (three runs each) and the EEG data (first 785 versus second half of trials) and for each half modeled the neural data as a function of the 786 vertical and horizontal location and category predictors. a, For the fMRI data, we found 787    Controlling for task difficulty. a, To control for task difficulty effects in the indoor/outdoor 822 classification task, we computed paired t-tests between all pairs of fragments, separately 823 for their associated accuracies and response times. We then constructed two predictor 824 RDMs that contained the t-values of the pairwise tests between the fragments: For each 825 pair of fragments, these t-values corresponded to dissimilarity in task difficulty (e.g., 826 comparing two fragments associated with similarly short categorization response times 827 would yield a low t-value, and thus low dissimilarity). This was done separately for the 828 fMRI and EEG experiments (matrices from the EEG experiment are shown). The accuracy 829 and response time RDMs were mildly correlated with the category RDM (fMRI: accuracy: 830 r=0.10, response time: r=0.15; EEG: accuracy: r=0.17, response time: r=0.16), but not with 831 the vertical location RDM (fMRI: both r<0.01, EEG: both r<0.01). After regressing out the 832 task difficulty RDMs, we found highly similar vertical location and category information as 833 in the previous analyses (Fig. 3b/c). b, In the fMRI, only category information in OPA was 834 significantly reduced when task difficulty was accounted for. c, In the EEG, towards the 835 end of the epochwhen participants respondedlocation and category information were 836 decreased. This shows that the effects of schematic codingemerging around 200ms 837 after onsetcannot be explained by differences in task difficulty. The dashed significance 838 markers represent significantly reduced information (compared to the main analyses, Fig.  839 3b/c) at p<0.05 (corrected for multiple comparisons). 840 Categorical versus Euclidean vertical location predictors. We defined our vertical 844 location predictor as categorical, assuming that top, middle, and bottom fragments are 845 coded distinctly in the human brain. An alternative way of constructing the vertical location 846 predictor is in terms of the fragments' Euclidean distances, where fragments closer 847 together along the vertical axis (e.g., top and middle) are represented more similarly than 848 fragments further apart (e.g., top and bottom). a, For the fMRI data, we found that the 849 categorical and Euclidean predictors similarly explained the neural data, with no statistical 850 differences between them (all t[29]<1.15, p>0.26). b, For the EEG data, we found that both 851 predictors explained the neural data well. However, the categorical predictor revealed 852 significantly stronger vertical location information from 75ms to 340ms, suggesting that, at 853 least in the EEG data, the differentiation along the vertical axis is more categorical in 854 nature. Significance markers represent p<0.05 (corrected for multiple comparisons). Error 855 margins reflect standard errors of the mean. 856 category information non-significant in fMRI and EEG signals. However, we still found 865 vertical location information in OPA and from 65ms to 375ms. c-e, When additionally 866 restricting the analysis to comparisons between indoor and outdoor scenes, the fragments' 867 vertical location still predicted neural activations in OPA and from 95ms to 375ms. In sum, 868 these results are highly similar to the results obtained with the ResNet50 model ( Fig.  869 3b/c/h/i). Significance markers represent p<0.05 (corrected for multiple comparisons).  Low-level control models. We used three control models that explicitly account for low-884 level visual features: a pixel-dissimilarity model, GIST descriptors, and the fragments' 885 neural dissimilarity in V1. Critically, all three models did not account for the fragments' 886 vertical location organization. Moreover, unlike the DNN models, the low-level models 887 were also unable to account for the fragments' categorical organization. a/b, Results after 888 regressing out the pixel dissimilarity model, which captured the fragments' pairwise 889 dissimilarity in pixel space (i.e., 1-the correlation of their pixel values). c/d, Results after 890 regressing out the GIST model, which captured the fragments' pairwise dissimilarity in 891 GIST descriptors (i.e., in their global spatial envelope). e/f, Results after regressing out the 892 V1 model, which captured the fragments' pairwise neural dissimilarity in V1 (i.e., the 893 averaged RDM across participants) and thereby provides a brain-derived measure of low-894 level feature similarity. Significance markers represent p<0.05 (corrected for multiple 895 comparisons).
Error margins reflect standard errors of the mean. 896