Using population receptive field models to elucidate spatial integration in high-level visual cortex

While spatial information and biases have been consistently reported in high-level face regions, the functional contribution of this information toward face recognition behavior is unclear. Here, we propose that spatial integration of information plays a critical role in a hallmark phenomenon of face perception: holistic processing, or the tendency to process all features of a face concurrently rather than independently. We sought to gain insight into the neural basis of face recognition behavior by using a voxelwise encoding model of spatial selectivity to characterize the human face network using both typical face stimuli, and stimuli thought to disrupt normal face perception. We mapped population receptive fields (pRFs) using 3T fMRI in 6 participants using upright as well as inverted faces, which are thought to disrupt holistic processing. Compared to upright faces, inverted faces yielded substantial differences in measured pRF size, position, and amplitude. Further, these differences increased in magnitude along the face network hierarchy, from IOGto pFusand mFus-faces. These data suggest that pRFs in high-level regions reflect complex stimulusdependent neural computations that underlie variations in recognition performance.


Introduction
High-level visual processing in the ventral 'what' pathway is thought to be abstracted from and invariant to spatial properties of objects, like their location or size in the visual field; this allows us to efficiently recognize objects across the near-infinite number of 2D images they might project onto the retina (DiCarlo & Cox, 2007). However, this classic account is challenged by consistent findings of spatial biases and information in the ventral temporal cortex (VTC), the end-stage of the ventral stream (Hasson, Levy, Behrmann, Hendler, & Malach, 2002;Kobatake & Tanaka, 1994;Levy, Hasson, Avidan, Hendler, & Malach, 2001;Sayres & Grill-Spector, 2008). Recently, our group has used population receptive field (pRF) methods to quantify voxel-wise spatial selectivity in face-selective regions in the human ventral temporal cortex (VTC). However, it is unknown if and how pRF properties in VTC contribute to visual perception. That is, do these representations of space exert only a general influence on where in the visual field high-level recognition is optimal, or do they play a specific role in determining how we recognize objects?
In the current work, we used the compressive spatial summation (CSS) pRF model (Kay, Weiner, & Grill-Spector, 2015;Kay, Winawer, Mezer, & Wandell, 2013) to relate spatial representations in face-selective areas to a hallmark of face recognition behavior: holistic processing. A prominent theory of face perception describes that efficient and accurate recognition requires the holistic processing of an entire face at once, rather than processing individual facial features. While human behavior supports this theory (Richler, Cheung, & Gauthier, 2011;Richler, Palmeri, & Gauthier, 2012;Yin, 1969;Young, Hellawell, & Hay, 1987), the neural mechanisms underlying holistic processing have proven elusive and computationally underspecified. We posit that, at a minimum, holistic processing requires spatial integration of information across features, and that pRF measurements may 413 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 reflect such integration. To evaluate the relationship between spatial representations and face recognition behavior, we mapped pRFs in early visual cortex (V1-hV4) and three constituent regions of the human face network: one region on the inferior occipital gyrus, IOG-faces (sometimes termed OFA), and two regions on the fusiform gyrus, pFus-faces (posterior fusiform; FFA-1) and mFus-faces (mid fusiform; FFA-2).

Scanning protocol
Six experienced fMRI participants (3 women) ages 24-30 participated in the experiment. All fMRI experiments were conducted on a 3T GE fMRI scanner equipped with an eye tracker, which monitored subjects' fixation. We used a 32channel head coil and single-shot EPI with 2.2 mm isotropic voxels and 2s TR. Standard preprocessing (motion correction, linear trend removal) was performed using mrVista tools. Retinotopic and faceselective regions were defined in a separate mapping session using standard phase-encoded retinotopic mapping (Engel, Glover, & Wandell, 1997) and a functional localizer (Stigliani, Weiner, & Grill-Spector, 2015), respectively.

Population receptive field (pRF) mapping
Background The pRF model (Dumoulin & Wandell, 2008;Wandell & Winawer, 2015) is a method for estimating the location and area in the visual field to which a particular voxel responds. Our model assumes a circular 2D Gaussian pRF with a compressive nonlinearity (Kay et al., 2013). PRF properties for a voxel reflect the combined receptive fields of the neural population in the voxel and are well-aligned with single-neuron RF properties (Wandell & Winawer, 2015).
Mapping procedure Each pRF mapping run presented participants with randomly intermixed upright and inverted faces at 25 spatial locations, as shown in Figure 1b. Timing for each run and participant was randomized, and faces appeared in 4s blocks at 2Hz at each spatial location. Participants completed 8-12 mapping runs, performing a difficult 1back task on letters at fixation throughout.
Model implementation Prior to comparing pRF properties mapped with upright vs. inverted faces, we sought to account for low-level visual differences between the two stimulus conditions. To do so, we coded the stimulus location as an absolute contrast image reflecting visual features of the faces, rather than as a simple binary mask indicating face location (see Figure 1c). The implementation of the CSS model is depicted in Figure 1c and described further in (Kay et al., 2013). To fit the CSS model, we first estimated Betas for the inverted and upright conditions in 25 spatial locations via standard general linear modeling. We then optimized pRF parameters to best fit the Betas for each voxel, separately for the upright and inverted conditions. This yielded, for each condition type, estimates of the voxel's pRF position (X,Y), size (σ), gain, and exponent n.

Results & Discussion
In early visual areas V1-hV4, we found no differences in pRF properties mapped with upright vs. inverted faces (Table 1). However, at late stages of the faceprocessing hierarchy, we observed for inverted faces: (a) substantial shifts of pRF centers downward, (b) reductions in pRF size, (c) reductions in gain, and (d) reductions in model goodness-of-fit (r 2 ). Moreover, the magnitude of these differences increased hierarchically from IOG-faces, where differences between upright and inverted mapping were not significant, to pFus-and mFus-faces ( Figure 3; Table  1). A simulation analysis revealed that changes in signal-to-noise ratio could not account for the observed shift and size changes; that is, reducing the gain of responses in the upright condition and adding noise until the r 2 matched that of the inverted condition does not produce systematic changes in position or size. The observed changes in size, position, and gain are not present in any retinotopic visual areas, nor are they seen at the earliest node of the face network, IOG-faces. This implies that the current implemented model, which utilizes the spatial location of visual features in the mapping stimulus (e.g. Figure 4, top left), is sufficient to account for spatial responses to both upright and inverted faces in early regions. However, we see that a modification to the model is needed to characterize spatial representations in fusiform regions: ideally, a good candidate model should produce convergent pRF estimates across both the upright and inverted conditions. Prior work indicates the importance of internal face features, and in particular the eyes, in both driving responses in face-selective cortex (Issa & Dicarlo, 2012) and in determining behavioral performance (Royer et al., 2018). To evaluate whether the location of the eyes or of external features better predicts spatial responsiveness in mFus-and pFus-faces, we re-fit the CSS pRF model using these alternative codings of the mapping stimuli (Figure 4).
Notably, the internal features coding (right column) produces equivalent estimates of pRF position across the mapping stimuli. This implies that spatial responses in fusiform face-selective regions are selectively driven by the location of internal face features, rather than all contrast in the image (left column) or simply the location of the eyes (middle column). However, this model does not provide a fully parsimonious account of pRFs across upright and inverted faces, as differences in size and gain persist. Additional work is needed to refine this model and to evaluate whether differential weighting of facial features may yield more equivalent estimates across upright and inverted stimuli; human recognition has been shown to differentially weight certain facial features, e.g. the eyes and mouth are more informative than the nose for recognition (Schyns, Bonnar, & Gosselin, 2002).
The current work points to a neural mechanism for a prominent, but computationally underspecified, aspect of high-level face recognition: holistic processing across features, which is thought to underlie normal face recognition. We mapped pRFs in the face network with stimuli thought to disrupt holistic processing, e.g. inverted faces, and observed substantial differences in their position, size, and gain relative to those mapped with upright faces. As this occurred even when participants did not actively engage in face perception, performing a challenging   task away from the mapping stimuli, spatial representations in the VTC may functionally constrain recognition behavior, rather than simply reflect it.