Stereo Viewing Modulates Three-Dimensional Shape Processing During Object Recognition: A High-Density ERP Study

The role of stereo disparity in the recognition of 3-dimensional (3D) object shape remains an unresolved issue for theoretical models of the human visual system. We examined this issue using high-density (128 channel) recordings of event-related potentials (ERPs). A recognition memory task was used in which observers were trained to recognize a subset of complex, multipart, 3D novel objects under conditions of either (bi-) monocular or stereo viewing. In a subsequent test phase they discriminated previously trained targets from untrained distractor objects that shared either local parts, 3D spatial configuration, or neither dimension, across both previously seen and novel viewpoints. The behavioral data showed a stereo advantage for target recognition at untrained viewpoints. ERPs showed early differential amplitude modulations to shape similarity defined by local part structure and global 3D spatial configuration. This occurred initially during an N1 component around 145–190 ms poststimulus onset, and then subsequently during an N2/P3 component around 260–385 ms poststimulus onset. For mono viewing, amplitude modulation during the N1 was greatest between targets and distracters with different local parts for trained views only. For stereo viewing, amplitude modulation during the N2/P3 was greatest between targets and distracters with different global 3D spatial configurations and generalized across trained and untrained views. The results show that image classification is modulated by stereo information about the local part, and global 3D spatial configuration of object shape. The findings challenge current theoretical models that do not attribute functional significance to stereo input during the computation of 3D object shape.

Although binocular disparity has been shown to contribute to the perception of surface properties such as slant, tilt, and curvature (e.g., Ban & Welchman, 2015;Norman et al., 1995;Norman et al., 2009;Welchman et al., 2005;Wexler & Ouarti, 2008;Wismeijer, Erkelens, van Ee, & Wexler, 2010), its role in the recognition of complex 3D object shape remains unclear. Indeed, it has been argued that although stereo information (i.e., local depth disparity) facilitates processing of 3D surfaces properties this does not, in itself, establish a functional link between stereo vision and the perception (and recognition) of complex (i.e., multipart) 3D object shape per se (Li et al., 2009;Pizlo, 2008;Pizlo et al., 2010). This issue has been investigated in previous studies by assessing the effects of stereo disparity on the perceptual matching of object shape across changes in viewpoint. The results provide a mixed picture with stereo advantages reported in some studies (e.g., Bennett & Vuong, 2006;Burke, 2005;Burke, Taubert, & Higman, 2007;Chan et al., 2006;Edelman & Bülthoff, 1990;Hong Liu, Ward, & Young, 2006;Lee & Saunders, 2011;Rock & DiVita, 1987;Simons, Wang, & Roddenberry, 2002), but not in others (Humphrey & Khan, 1992;Pasqualotto & Hayward, 2009). Recently, Cristino et al. (2015) have proposed that stereo information is computed during the visual perception of object shape. It is more likely to be used to supplement shape information derived from mono-ocular cues when object recognition (i.e., target/nontarget discrimination or view generalization) is facilitated by the derivation of 3D object structure. In support of this hypothesis, they showed that stereo input facilitates the classification of complex multipart 3D objects across large, but not small, changes in depth rotation. In other recent work, Pegna et al. (2016) have found early perceptual sensitivity to stereo versus mono input in a perceptual matching task using event-related potentials (ERPs). In that study, ERPs were recorded while observers made shape equivalence judgments about pairs of sequentially presented novel 3D objects under conditions of stereo or mono viewing. The results showed an early perceptual sensitivity to the mode of input shown by a negative amplitude modulation between 160 and 220 ms poststimulus onset. The results also showed later modulation of ERP amplitude during an N2 component between 240 and 370 ms for stereo and mono input that was linked to the perceptual matching of shape. 1 The aim of the current study was to determine whether stereo disparity contributes to object processing during the recognition of 3D object shape. The rationale was based on recent work by Leek et al. (2016), who found evidence for early differential sensitivity of ERP amplitudes to local part structure and global shape configuration of complex 3D objects in mono displays. In that study ERPs were recorded while observers made shape matching judgments to sequentially presented pairs of novel objects under conditions of mono viewing. Different object pairs could either share local parts but differ in global shape configuration, share global shape configuration but have different local parts, or share neither. The results showed differential N1 sensitivity to local and global shape similarity between stimulus pairs occurring around 170 ms poststimulus onset. These findings provide evidence that mental representations of complex 3D object shapes comprise both local higher-order parts, and the global spatial configuration of these parts-consistent with theoretical models, and other empirical evidence, supporting this distinction (e.g., Arguin & Saumier, 2004;Behrmann, Peterson, Moscovitch, & Suzuki, 2006;Behrmann & Kimchi, 2003;Biederman, 1987;Hummel, 2013;Hummel & Stankiewicz, 1996;Marr & Nishihara, 1978). We hypothesized that one way in which stereo disparity may contribute to recognition is by facilitating the computation of 3D object representations via depth information. These representations could augment a range of shape information including surface depth gradients and curvature, higher-order part boundaries, and the 3D spatial configuration of (volumetric) object parts. Of relevance to the current study is whether stereo input might differentially modulate the sensitivity of object recognition processes to local part and global 3D spatial configuration information. For example, under some structural description accounts, object parts are computed directly from 2D image-based input derived from local edge relations (e.g., nonaccidental properties or NAPs- Biederman, 1987). This level of representation may be sufficient where object recognition can be based on a parts-based description of object identity, or where the discrimination of target and nontarget ob-jects can be achieved based on part composition. In other situations, it may be beneficial to compute a global 3D object model which specifies (among other attributes) the spatial configuration of local object parts-for example, where recognition depends on discrimination among objects with similar parts but different spatial configurations.
To test this prediction we used ERPs, which have been previously shown by Leek et al. (2016) to show differential amplitude sensitivity to local and global shape structure. Unlike earlier work, we also wanted to examine this issue in the context of an object recognition task rather than the perceptual matching of sequentially presented stimuli. Object recognition differs from perceptual matching in that the former requires indexing a (stored) long-term memory representation of object shape. We used a recognition memory task in which observers had to first memorize a subset of complex novel 3D objects (targets) and subsequently discriminate them from visually similar nontarget (not previously memorized) objects. We then contrasted effects of target/nontarget similarity defined by local part and global 3D shape configuration under conditions of stereo and mono viewing. We predicted that stereo presentation would enhance ERP modulations related to object discrimination weighted toward perceptual analysis of 3D global shape configuration.

Method Participants
Forty Bangor University students (24 female, mean age 21.46, SD ϭ 3.16, 3 left-handed) participated for course credit. The sample was recruited through an online participation portal. All participants had normal or corrected-to-normal visual acuity. Ethics approval was granted by the Local Ethics Committee and in accordance with British Psychological Society guidelines. Informed consent was obtained and participants were free to withdraw from the study at any time without penalty.

Apparatus and Stimuli
The stimuli comprised a set of 48 novel computer-generated 3D objects. There were 12 target objects and 36 nontargets (distracters) varying in visual similarity to the targets (see Figure 1). Each stimulus comprised a unique spatial configuration of four different volumetric parts. The parts were defined by variation among nonaccidental properties (NAPs) comprising: edges (straight vs. curved), symmetry of the cross section, tapering (colinearity), and aspect ratio (Biederman, 1987). The object models were produced using Strata 3D CX software (Strata, U.S.A.), then rendered in Matlab using a stereo camera rig programmed with custom code. To create the stereo images left and right eye images were rendered without 'toeing in' using an Inter Pupillary Distance (IPD) of 62 mm. In both mono and stereo viewing conditions, participants wore polarized 3D glasses to view the stimuli presented on a passive interleaved 3D stereo monitor (60Hz 27" AOC 3D monitor (D2769VH), resolution ϭ 1920x1080 pixels). In the stereo condition, participants viewed objects rendered from two viewpoints (left eye and right eye). In the (bi-) mono condition, participants viewed the objects with the same (right eye) rendered image presented to both eyes.
The stimuli were then normalized in size across objects to sustain in average on screen size of 17°ϫ 17°). All stimuli were rendered using a mustard yellow color: R ϭ 227, G ϭ 190, B ϭ 43, and presented on a white background to facilitate figure/ground segmentation. Object models were rendered with shading using a single top-left light source but without (internal or external) cast shadow .
For each of the 12 target objects, 3 corresponding nontargets were designed: one variation was composed of the same parts arranged in a different spatial configuration (SD -Same Parts/Different spatial configuration -'locally similar'); a second variation was composed of different parts arranged in the same configuration as the target (DS -Different parts/Same spatial configuration -'globally similar'); finally, in a third variation comprised different parts and spatial configuration (DD -Different parts/Different spatial configuration -'Dissimilar'). Each object was rendered at six different viewpoints varying by 60 degree rotations in depth around a vertical axis perpendicular to the line of sight.
Measures of target/nontarget image similarity using three models based on (a) Pixel overlap, (b) Gabor filter bank, and (c) HMAX -C1 output layer (Serre, Oliva, & Poggio, 2007) were computed on the 2D mono stimulus images using the Matlab Image Similarity Toolbox (Seibert & Leeds https://github.com/daseibert/image_similarity_toolbox). In the toolbox, the pixel overlap model computes the sums of squared differences in pixel intensity values between images. The Gabor filter bank model projects the image onto a Gabor wavelet pyramid as a model of V1 orientation selectivity (Kay, Naselaris, Prenger, & Gallant, 2008), using filters spanning eight orientations, four sizes (image %), and x, y positions. The Euclidian distance between the resulting vector of filter responses is compared between images. The HMAX model is based on the C1 output layer of the hierarchical feed-forward image classification model of Serre et al. (2007). We use this model to provide an estimate of image-based stimulus similarity between target and nontarget conditions. Table  1 shows the mean normalized similarity values of the three models for both target versus SD (locally similar), DS (globally similar) and DD (dissimilar) distracter image contrasts between trained and untrained viewpoints. A 2 (Viewpoint: Trained, untrained) ϫ 3(Stimulus type: SD; DS; DD) ϫ 3 (Model: pixel overlap; HMAX; Gabor) repeated measures ANOVA, showed no significant main effects. However, there was an interaction between Stimulus type and Model, F(4, 44) ϭ 3, p ϭ .029. Post hoc analyses showed that there were no differences between stimulus types for the pixel overlap or Gabor models. For HMAX there was a significant difference between SD (locally similar) and DS (globally similar) stimulus types (p ϭ .02) driven by the lower mean (normalized) similarity values for trained views of target/DS (globally similar) relative to either target/SD (locally similar) or target/DD (dissimilar) stimulus contrasts.
A 2 (Display: mono/stereo) ϫ 4 (Stimulus type: Target, SD (locally similar), DS (globally similar, DD (dissimilar)) mixed factorial design was used, with Display as a between-subjects factor and Stimulus type as a within-subjects factor. Participants were randomly allocated to either the mono or stereo display group. There were 20 participants in each group. The stereo display group completed a verification task to assess their ability to fuse stereo images using interleaved polarized displays. During this task they were seated 60 cm from the screen and shown a random-dot stereogram with an embedded figure eight that was only perceivable with stereo fusion using polarized glasses. Participants were asked to report what they saw. All participants correctly reported the embedded stereo figure. The main study comprised two phases: learning and test. One group completed both the learning and test phases in mono. The other group completed both the learning and test phases in stereo. This aspect of the design ensured that any observed differences between the viewing conditions during the test phase cannot be due to mismatches in stimulus presentation formats between the learning and test phases. During the learning phase for both Groups 12 objects were memorized. In the learning phase each target was seen at three viewpoints distinguished by rotations of 120 degrees around a vertical (y) axis defined with reference to the object-see Figure  2. In the test phase, each target and nontarget was seen from six different viewpoints distinguished by 60 degree rotations around the y axis. In the learning phase each target was shown at each of three viewpoints three times. In the test phase, the 12 targets were presented at each of six viewpoints three times (216 target trials in total). There were also 36 nontargets (three distracters for each of the 12 targets). Each nontarget was presented once at each of the six test viewpoints (six trials per nontarget ϭ 216 nontarget trials in total, 72 trials per nontarget condition). In total there were 432 trials in the test phase comprising equal numbers of target and nontarget trials. Trial order was randomized.

Procedure
Learning phase. During the learning phase participants in both the stereo and mono groups wore polarized glasses but viewed stereo or mono images depending on the group assignment. The learning phase comprised three identical training sessions  conducted over three days in separate training sessions. The purpose of the learning phase was for participants to memorize each of the 12 targets, and an associated unique stimulus number. Only participants who were able to identify targets to a criterion level of 80% after the three training sessions proceeded to the test phase. Each training session comprised a memorization stage and a verification stage. During the memorization stage target objects were presented centrally (duration ϭ 3s) on a computer monitor sequentially at three different training viewpoints denoted 0°, 120°and 240°(see Figure 2). Target presentation was preceded by an identification number (1-12). Target identification numbers were randomly assigned across the target set but were the same for all participants. There were 36 trials (12 objects ϫ 3 viewpoints) in each block of memorization trials. After the memorization phase, participants completed a verification task in which the 12 targets were shown randomly, one-at-a-time and for unlimited duration (until response), at each of the three viewpoints. After each stimulus, participants provided the associated target number via a key press on a standard PC keyboard. Feedback was given via a 'Correct' or 'Incorrect' message displayed centrally on the monitor. The memorization and verification tasks were repeated three times per training session (9 times across the three training sessions). All participants completed all three training sessions (regardless of whether they reached criterion accuracy earlier). Test phase. During the test phase, participants in both the stereo and mono groups wore polarized glasses but viewed stereo or mono images depending on the group assignment. After the participants had completed three training sessions and had achieved the criterion level of performance in the learning phase, they completed the test phase involving a recognition memory task. The final training session of the learning phase was completed immediately before the test phase. EEGs were recorded during the test phase (see below). Each trial involved presentation of one stimulus (either a target or nontarget) at one of six viewpoints. At the start of each trial a small central fixation cross was presented in the center of the monitor at 0.7°of visual angle. The duration of the fixation cross was jittered randomly in 50 ms increments between 500 and 800 ms. Following onset of the fixation marker the test stimulus was shown for 750 ms. This stimulus was replaced by a response screen (centrally presented question mark). All trial events were separated by an interstimulus interval of one screen refresh (17 ms). Participants were instructed to respond via a button press using a standard PC keyboard ("1" for target and "2" for nontarget-with the fore and middle fingers of the right hand respectively for all participants) indicating whether the stimulus shown was one of the 12 objects that they had previously memorized regardless of its orientation. They were alerted to the fact that the stimuli could be presented at previously seen and novel viewpoints. Participants could only respond following onset of the response screen, and not during presentation of the stimulus. This was done to help reduce potential motor response artifacts in the EEG. The response screen remained until a response was given (see Figure 3). The intertrial interval was a blank screen presented for 1000 ms. For the behavioral data the dependent measure was response accuracy. RTs were not collected because keyboard responses were only acquired from the onset of the response screen. This was done to reduce motor artifacts in the ERPs associated with the stimulus event.
Electrophysiological recording and processing. The electroencephalograph (EEG) was recorded continuously through 128 electrodes placed on an ECI cap (Electro-Cap International, Ohio, U.S.A.) using the Active-Two Biosemi EEG system (Biosemi V.O.F Amsterdam, Netherlands). Eye movements and blinks were corrected using the ICA protocol in Analyser 2 software and segmented data was then visually inspected with trials containing artifacts rejected. Epochs that contained muscle or skin potential artifacts were rejected. Only trials on which participants gave a correct response were included. The mean number of correct trials per subject after artifact rejection was: 189.25 (SS/target), 62.61 (SD/locally similar), and 67.61 (DS/globally similar) and 67.82 (DD/dissimilar). Activity from all electrodes was sampled at a rate of 1024Hz. Offline 30 Hz low pass and 0.1 Hz high-pass filters were applied to the data. Data were rereferenced to an average reference which was then used to generate the grand averages. We used a 100-ms prestimulus interval for the baseline correction. Continuous recording took place during the test phase and trials were epoched/segmented from Ϫ100 ms to stimulus offset (750 ms). All ERP data acquired from onset of the response prompt were discarded.
EEG analyses. Four early visual evoked potential components P1, N1, P2, and an N2-P3 complex were identified based on the topography, global field power (GFP) deflection and latency characteristics of the respective grand average ERPs time-locked to stimulus presentation. Preliminary epochs of interest for each component were defined based on deflection extrema in the mean local field power (e.g., Brunet, Murray, & Michel, 2011;Lehmann & Skrandies, 1980;Murray, Brunet, & Michel, 2008). Symmetrical clusters were extracted over the left (LH) and right (RH) hemispheres comprising nine spatially adjacent posterior electrodes: RH: A32, B3, B4, B5, B6, B7, B8, B10, B11, and LH: A5, A6, A7, A8, A9, A10, A11, D31, and D32, which correspond with electrode locations CP2, P4, P6, P8, PO8 and CP1, P3, P5, P7, PO7 of the extended 10 -20 system. These electrode clusters formed the regions-of-interest (ROIs) for the subsequent analyses of contrasts between stimulus conditions. Standard waveform analyses were based on the amplitude data as a measure of differential ERP sensitivity to 3D shape similarity between mono and stereo viewing. Mean amplitudes were analyzed using the General Linear Model by way of ANOVA. Greenhouse-Geisser corrections were applied to all analyses of ERP data. Corrected degrees of freedom are reported where applicable. An a priori alpha level of .05 (two-tailed) was adopted. Exact p values are reported (p ϭ x) except where p Ͻ .001.
Mass univariate analyses. Mass Univariate analyses (Groppe, Urbach, & Kutas, 2011;Guthrie & Buchwald, 1991;Murray et al., 2008) were used to complement the standard waveform analyses. This involved using pair wise, frame-by-frame, repeated measures t tests across all 128 electrodes. An a priori criterion for significance was adopted in which a threshold of p Ͻ .01 (two-tailed) must be attained for at least 12 consecutive time frames in at least 5 neighboring electrodes over time windows of 150 ms (Guthrie & Buchwald, 1991). For this purpose, the mass univariate analyses were conducted on 150-ms bins (0 -150 ms; 151-300 ms; 301-450 ms) encompassing the P1, N1, P2, and N2/P3 components.

Behavioral Results
Accuracy data were log transformed prior to statistical analyses.

Learning Phase
A 3(Training day) ϫ 2(Display: mono; stereo) mixed ANOVA, with Display as a between subjects factor showed significant main effects of Training day, F(2, 60) ϭ 58.06, p Ͻ .001, with accuracy (% correct) increasing over time, from day one (M ϭ 69.48, SD ϭ 17.38) to two (M ϭ 94.71, SD ϭ 8.06), p Ͻ .001, and two to three (M ϭ 98.09, SD ϭ 4), p ϭ .006. There were no differences between mono and stereo display groups and all participants passed criterion by the end of the third training session. 2,3 Test Phase Figure 4 shows mean percent correct responses per condition. The data were analyzed using a 4 (Stimulus type: Target; SD [locally similar]; DS [globally similar]; DD [dissimilar]) ϫ 2 (Stimulus viewpoint: trained/untrained) ϫ 2 (Display: mono/stereo) mixed ANOVA, with Display as a between subjects factor. There were significant main effects of Stimulus type, F(3, 90) ϭ 13.5, p Ͻ .001, and Stimulus viewpoint, F(1, 30) ϭ 10.41, p ϭ .003, with higher overall accuracy for trained (M ϭ 97.05%, SD ϭ 2.65) than untrained (M ϭ 95.4%, SD ϭ 3.42) viewpoints. There was also a significant threeway interaction, F(3, 87) ϭ 3.19, p ϭ .027. To investigate this further we analyzed mono and stereo data separately using 4 (Stimulus type) ϫ 2 (Stimulus viewpoint) repeated measures ANOVAs. For the mono viewing group, there was an interaction between Stimulus type and Stimulus Viewpoint, F(3, 45) ϭ 5.9, p ϭ .002. This derived from significantly higher accuracy for trained than untrained viewpoints for target stimuli, p ϭ .003 (see Figure 4). In contrast, for the stereo viewing group there were no significant main effects or interactions. Finally, accuracy for tar-gets presented at untrained views was higher for stereo (M ϭ 94.68%, SD ϭ 5.09) than mono (M ϭ 85.19%, SD ϭ 14.46) displays (p ϭ .035). This pattern of results is consistent with a stereo advantage in view generalization for targets between trained and untrained views.

Analyses of ERP Data
The aims of these analyses were as follows: (a) to determine whether the ERP showed sensitivity to the manipulation of stereo and mono viewing; (b) to establish whether the ERPs were differentially sensitive to target/nontarget shape similarity defined by either shared local parts or global 3D spatial configuration; and (c) to determine whether differential perceptual sensitivity to these shape attributes was modulated by mono versus stereo viewing.
ERP analyses I: Perceptual sensitivity to stereo/mono presentation. We first wanted to determine whether our display manipulation of stereo versus mono presentation was sufficient to induce a measurable early perceptual sensitivity in visual evoked potentials. Mass univariate analyses were used to identify a temporal marker defining the earliest time point of differential ERP sensitivity to mono versus stereo viewing. A point-wise mass univariate contrast between the mono and stereo viewing across all conditions revealed differences in the ERP from around 50 ms poststimulus onset over a large group of posterior, temporaloccipital and anterior leads. This difference was sustained during the P1 component over left occipital and some frontal electrodes (see Figure 5). These analyses confirm an early perceptual sensitivity to mono versus stereo viewing.
ERP analyses 2: Perceptual sensitivity to 3D shape similarity as a function of mono/stereo viewing. Our next goal was to establish whether perceptual processing of object shape resulted in differential sensitivity to local parts and global 3D shape configuration as a function of mono versus stereo viewing. To do so we conducted both standard waveform analyses and mass univariate contrasts. For the mono condition (see Figure 6a), there was a main effect of Stimulus type, F(2.27, 34.05) ϭ 3.85, p ϭ .03, driven by a significant difference between the target and DS (globally similar) nontargets, p ϭ .02, with greater negativity for targets (M ϭ Ϫ0.75, SD ϭ 0.26) than DS (globally similar) (M ϭ Ϫ0.23, SD ϭ 0.25) stimuli. No other main effects or interactions were significant. In contrast, for the stereo condition (see Figure 6b) (Figure 7a) there were no significant main effects or interactions. In contrast, for the stereo viewing group (Figure 7b) there was a significant interaction between Stimulus type and Laterality, F(2.76, 41.47) ϭ 4.51, p ϭ .009. Planned comparisons showed that there were no differences between stimulus types in the left hemisphere, but in the right hemisphere mean amplitude for targets was lower than SD (p ϭ .022), DS (p ϭ .024) and DD (p ϭ .002). No other main effects or interactions were significant.

Further Analyses I: Mass Univariate Contrasts Across All 128 Electrodes
Mass univariate analyses were used to complement our standard waveform analyses of the effects of mono and stereo viewing on the discrimination between targets and critical SD (locally similar) and DS (globally similar) nontargets. Unlike the standard analysis, the mass univariate approach allows us to examine the patterns of contrasts between conditions across all 128 electrodes (rather than restricting the analysis to the 9 electrode cluster in each hemisphere). The temporal distributions of these contrasts across all 128 electrodes for mono viewing are shown in Figure 8a-8h.
These mass univariate contrasts show the differential sensitivity between targets and SD/DS nontargets for mono and stereo viewing in the N1, P2 and N2/P3 components. A time series plot of the frequency distribution of significant differences is shown in Figure 9. These data were analyzed as a nonparametric time-series using the Friedman test. For the N1 during mono viewing there was a higher frequency of significant differences between targets and DS (globally similar) nontargets in both the left, 2 (1) ϭ 4, p ϭ .046 and right hemispheres, 2 (1) ϭ 5, p ϭ .025. For stereo viewing there was a higher frequency of significant differences between targets and SD (locally similar) nontargets in the left hemisphere only, 2 (1) ϭ 4, p ϭ .046. The same pattern for stereo viewing was also found during the P2, 2 (1) ϭ 4, p ϭ .046, but there was no significant differences for the mono group. The N2/P3 component also showed a striking contrast in perceptual sensitivity to SD (locally similar) and DS (globally similar) nontargets between mono and stereo viewing. For mono viewing there was a higher frequency of significant differences between targets    and DS (globally similar) nontargets in the right hemisphere, 2 (1) ϭ 10, p ϭ .002. The opposite pattern was found for stereo viewing with a higher frequency of significant differences between targets and SD (locally similar) nontargets in the left hemisphere, 2 (1) ϭ 6.4, p ϭ .011.

Further Analyses II: Effects of Training Viewpoint
The analyses so far show differential sensitivity to SD (locally similar) and DS (globally similar) nontargets between mono and stereo viewing. In brief, during mono viewing there is a greater response modulation to target versus DS (globally similar) nontargets in both the left and right hemisphere that begins during the N1 and continues into the later N2/P3 component. During stereo viewing, there is a greater response modulation to target versus SD (locally similar) nontargets that is predominant in the left hemisphere and which begins during the N1 but only peaks during the later N2/P3. In a final analysis, we wanted to examine whether these differential response patterns are modulated by viewpoint familiarity; that is, whether they generalize across image classification at trained and untrained views. Figure 10 shows a time series plot of the frequency distribution of significant differences between target and nontarget conditions for trained and untrained viewpoints. The data were analyzed as a nonparametric time-series using the Friedman test. For the mono viewing group the higher frequency of significant differences between target and DS (globally similar) distracters in the left and right hemispheres during the N1 was found for trained viewpoints but did not generalize to untrained viewpoints (LH: 2 (1) ϭ 4, p ϭ .046, RH: 2 (1) ϭ 4, p ϭ .046). In contrast, for the stereo viewing group, there were no differences between trained and untrained viewpoints at the N1. For the mono group at the N2/P3, however, there was a higher frequency of significant differences between target and SD (locally similar) distracters for trained than untrained viewpoints in the left hemisphere, 2 (1) ϭ 6.4, p ϭ .011. There was also a higher frequency of differences between target and DS distracters in the left and right hemispheres for trained than untrained viewpoints (LH: 2 (1) ϭ 10, p ϭ .002; RH: 2 (1) ϭ 10, p ϭ .002). For the stereo group, there was a higher frequency of significant differences between target and SD (locally similar) distracters for trained than untrained viewpoints in the left hemisphere, 2 (1) ϭ 6.4, p ϭ .011 and a higher frequency of differences between target and DS (globally similar) distracters for  trained than untrained viewpoints in the right hemisphere, 2 (1) ϭ 6.4, p ϭ .011.

Discussion
The main findings can be summarized as follows: First, the behavioral data provided evidence for an advantage in view generalization for stereo over mono displays. This was shown by higher accuracy in target classification of untrained views for stereo displays. Second, the ERP data showed differential amplitude responses to mono versus stereo viewing as early as 50 -100 ms poststimulus onset, with higher amplitudes on the P1 component for stereo displays. Third, we observed differential amplitude modulations of evoked potentials to targets and nontargets defined by shared parts (SD; locally similar) or shared spatial configuration (DS; globally similar) starting at the N1 component between 145 and 200 ms poststimulus onset. N1 amplitudes for mono displays showed greater differential sensitivity to DS (globally similar) nontargets. For stereo displays, there was a greater differential amplitude modulation for SD (locally similar) nontargets in left hemisphere electrodes. Fourth, a pattern of differential amplitude modulation was also found at the later N2/P3 component around 260 -385 ms poststimulus onset. This was most clearly shown in the mass univariate analysis. For mono viewing, there was a higher frequency of significant differences between targets and DS (globally similar) nontargets. For stereo viewing, there was a higher frequency of significant differences between targets and SD (locally similar) nontargets. Fifth, under mono viewing, the differential sensitivity to DS (globally similar) nontargets was found for trained but not untrained views. In contrast, the amplitude sensitivity in stereo viewing to SD (locally similar) nontargets was found with both trained and untrained views.
These new empirical findings have several important implications for models of object recognition. First, the results provide new evidence that the representation of complex 3D object shape involves the specification of higher-order part structure and 3D part configuration. This is shown by the differential sensitivity in the ERPs to shape differences between targets and nontargets defined by either shared local parts or 3D shape configuration. These differences emerged during the N1 component between approximately 145-200 ms poststimulus onset, and were also found during the N2/P3 component around 260 -385 ms poststimulus onset. This finding is consistent with theoretical models, and other supporting empirical evidence, that the perceptual representation of complex 3D object shape involves the specification of higher-order part structure and global 3D spatial configuration (e.g., Arguin & Saumier, 2004;Behrmann et al., 2006;Behrmann & Kimchi, 2003;Biederman, 1987;Hummel & Stankiewicz, 1996;Marr & Nishihara, 1978). The results challenge theoretical models which do not attribute functional significance to these properties of object shape representations -including the hierarchical, feedforward HMAX deep (i.e., multilayer) network architecture (e.g., Riesenhuber & Poggio, 1999;Serre et al., 2007), and others (e.g., Bulthoff & Edelman, 1992;Chan et al., 2006;Khaligh-Razavi & Kriegeskorte, 2014;Krizhevsky et al., 2012;Li & Pizlo, 2011;Li et al., 2009;Pizlo, 2008).
Second, the results also provide new evidence that the recognition of complex 3D object shape can be modulated by stereo visual input. This was shown in both the behavioral and ERP data patterns. Behaviorally, we found an advantage for object recognition under conditions of stereo viewing in relation to classification accuracy for targets presented at previously untrained views. This observation adds to a growing body of behavioral evidence that stereo input can facilitate 3D object recognition-at least under some conditions (e.g., Bennett & Vuong, 2006;Burke, 2005;Burke et al., 2007;Chan et al., 2006;Edelman & Bülthoff, 1990;Hong Liu et al., 2006;Lee & Saunders, 2011;Rock & DiVita, 1987;Simons et al., 2002). According to Cristino et al. (2015), stereo input provides additional cues to 3D object shape including, for example, the specification of surface slant, curvature polarity and 3D part configuration. We also found differential modulation of ERP amplitudes during mono and stereo viewing as a function of target/nontarget shape similarity. Notably, we found evidence for differential modulation of ERP amplitudes under mono and stereo viewing for DS (globally similar) and SD (locally similar) distractors. This shows that stereo viewing can modulate perceptual processing of different attributes of 3D shape-contrary to the predictions of theoretical models that do not attribute functional significance to stereo information in the derivation of 3D object representations (e.g., Bulthoff & Edelman, 1992;Chan et al., 2006;Li & Pizlo, 2011;Li et al., 2009;Pizlo, 2008;Riesenhuber & Poggio, 1999;Serre et al., 2007). One interpretation of the results is that stereo viewing enhances processing of information about the 3D spatial configuration of object parts, and that this information facilitates the classification of SD (locally similar) distracters as nontargets on the basis of their distinct global 3D spatial configuration. In contrast, under conditions of mono viewing, we found early differential sensitivity to DS (globally similar) distracters that shared spatial configuration but not local parts (that is, where targets and distractors can be differentiated on the basis of distinct local parts). This raises the possibility that, in the absence of stereo input (as is the case in most previous empirical studies of object processing), the perceptual analysis of 3D object shape is weighted toward differences in 2D local shape attributes. Furthermore, the enhanced processing of local part structure did not generalize to untrained views, suggesting that under monocular viewing conditions object shape processing may be weighted toward an 'image-based' processing strategy. Taken together, these findings suggest that mental representations of 3D object shape in human vision are rich in structure, encoding both 2D image-based local features, and 3D shape properties, broadly consistent with a 'hybrid' approach to object recognition mediated by representations combining both 2D and 3D object structure (Foster & Gilson, 2002;Hummel, 2013;Hummel & Stankiewicz, 1996). 4 A recent study by Leek et al. (2016), using a sequential novel object matching task under conditions of mono viewing only, also reported early differential perceptual sensitivity to shape differences defined by either shared parts or global spatial configuration. In that work, differential sensitivity in perceptual matching of novel 3D objects was-as in the current study, found to emerge earliest on amplitude modulations during the N1 component over posterior electrodes between objects sharing either local parts or global spatial configuration. The current data extend these findings in several important ways. First, we have shown that this differential perceptual sensitivity extends to an object recognition task where observers are required to match a perceptual description of 3D object shape to a (previously learned) long-term memory representation. Second, the results also show that this differential perceptual sensitivity is modulated by mono versus stereo input-in which mono viewing enhances local differences in part structure, while stereo viewing enhances differences in global 3D spatial configuration. Third, we also found that this stereo viewing effect generalizes across changes in 3D object viewpoint, whereas perceptual sensitivity to local differences in part structure found under conditions of mono viewing were restricted to trained viewpoints.
Finally, one other issue merits brief discussion. Although our primary goal was to examine whether mono versus stereo visual input differentially modulates the perceptual processing of 3D object shape during recognition, we also observed an early perceptual sensitivity, and lateral asymmetry, to stereo disparity. We found the earliest differential responses to mono versus stereo input from around 50 ms poststimulus onset over a large group of posterior, temporal-occipital and anterior leads. This difference was sustained during the P1 component over left occipital and some frontal electrodes. Additionally, we also found greater P1 amplitudes for right over left hemisphere electrode sites. We have taken this to reflect early perceptual sensitivity to mono-versus stereo input in our design. One might argue that these differences do not reflect the resolution of stereo disparity per se, but rather sensitivity to the presentation of different images to the left and right eye in the stereo condition. However, if this were the case, we would expect to find differences between mono-and stereo presentation in all conditions regardless of target-distracter similarity. The observed interactions between stimulus type and viewing condition show that this was not the case.
In summary, we investigated whether stereo viewing modulates perceptual processing of 3D object shape. A recognition memory task was used in which observers were trained to recognize a subset of 3D novel objects under conditions of either mono or stereo viewing. In a subsequent test phase, they discriminated trained objects from nontargets that shared either local parts, 3D spatial configuration or neither dimension, across both previously trained and novel viewpoints. The behavioral data showed a stereo advantage for generalization between trained and untrained views. ERPs amplitudes also showed early differential sensitivity to local part, and 3D spatial configuration, similarity between targets and distracters. This occurred during an N1 component from 145-200 ms poststimulus onset and during an N2/P3 component from 260 -385 ms poststimulus onset. For mono viewing, amplitude modulation during the N1 was greatest between targets and distracters with different local parts for trained views only. For stereo viewing, amplitude modulation during the N2/P3 was greatest between targets and distracters with different global 3D spatial configurations and generalized across trained and untrained views. The results show that image classification is modulated by stereo information about the local part, and global 3D spatial configuration of object shape. The findings challenge current theoretical models that do not attribute functional significance to stereo input during the computation of 3D object shape.