Viewpoint dependence and scene context effects generalize to depth rotated three-dimensional objects

Viewpoint effects on object recognition interact with object-scene consistency effects. While recognition of objects seen from “noncanonical” viewpoints (e.g., a cup from below) is typically impeded compared to processing of objects seen from canonical viewpoints (e.g., the string-side of a guitar), this effect is reduced by meaningful scene context information. In the present study we investigated if these findings established by using photographic images, generalize to strongly noncanonical orientations of three-dimensional (3D) models of objects. Using 3D models allowed us to probe a broad range of viewpoints and empirically establish viewpoints with very strong noncanonical and canonical orientations. In Experiment 1, we presented 3D models of objects from six different viewpoints (0°, 60°, 120°, 180° 240°, 300°) in color (1a) and grayscaled (1b) in a sequential matching task. Viewpoint had a significant effect on accuracy and response times. Based on the viewpoint effect in Experiments 1a and 1b, we could empirically determine the most canonical and noncanonical viewpoints from our set of viewpoints to use in Experiment 2. In Experiment 2, participants again performed a sequential matching task, however now the objects were paired with scene backgrounds which could be either consistent (e.g., a cup in the kitchen) or inconsistent (e.g., a guitar in the bathroom) to the object. Viewpoint interacted significantly with scene consistency in that object recognition was less affected by viewpoint when consistent scene information was provided, compared to inconsistent information. Our results show that scene context supports object recognition even when using extremely noncanonical orientations of depth rotated 3D objects. This supports the important role object-scene processing plays for object constancy especially under conditions of high uncertainty.


Introduction
Object recognition happens fast, automatically, and in most cases seems effortless to us.Because our environment is highly dynamic, especially when interacting with it, one and the same object will produce a range of different images on the retina.In fact, it is very unlikely that an object would produce the same retinal image twice owing to changes in viewpoint, lighting, reflections, or viewing distance.Still, our visual system is able to flexibly transform this variable visual input in a way that object identity can successfully be read out from the resulting abstract representations in higher areas of visual cortex (see DiCarlo & Cox, 2007).
Past research has made great advances toward understanding the mechanisms that underly invariant object recognition, when objects are presented in isolation (i.e., DiCarlo & Cox, 2007).More recently, however, researchers have started to investigate the viewpoint problem in the context of object-scene processing.Object recognition rarely occurs in isolation where the only available information are the objects' features.In our everyday lives, we encounter objects within certain contexts, which provides us with a pool of complex visual and multimodal information that is integrated during object recognition.Past research has shown that context facilitates object recognition (Biederman, Mezzanotte, & Rabinowitz, 1982;Oliva & Torralba, 2007; for a recent review see Lauer et al., 2021).Evidence from behavioral as well as neurophysiological studies (e.g., Brandman & Peelen, 2017) suggest an interactive processing of objects and scenes.For instance, objects placed in semantically consistent contexts are recognized faster and more accurately, often referred to as the scene-consistency effect (Davenport & Potter, 2004;Palmer, 1975).Accordingly, models of object recognition have been updated to incorporate the integration of contextual information (Bar, 2004).Further, frameworks incorporating object-scene and object-object relations (e.g., the so-called scene grammar) describe a set of internalized rules based on regularities found in real-world scenes that facilitate scene and object perception and guide our attention during different visual cognitive tasks (Draschkow & Võ, 2017;Josephs, Draschkow, & Võ, 2016;Võ, 2021;Võ, Boettcher, & Draschkow, 2019;Võ & Henderson, 2009;Võ & Wolfe, 2013a;Võ & Wolfe, 2013b).
Recent work has also looked at influences of object and scene orientation on the scene consistency effect (Lauer, Schmidt, & Võ, 2020;Sastyin et al., 2015).Sastyin et al. (2015) conducted a series of experiments investigating the interaction between viewpoint and scene consistency on object and scene recognition.They used photographic images of objects from canonical and noncanonical viewpoints and paired them with consistent and inconsistent scenes.They evaluated viewpoints in a relative manner with canonical viewpoints containing more canonical characteristics than noncanonical viewpoints as determined by rating the stimuli.Others have defined canonical viewpoints as the viewpoint from which one would photograph an object or the viewpoint from which one sees the object when imagining it, mostly finding off-axis views to be preferred (Blanz, Tarr, & Bülthoff, 1999;Cutzu & Edelman, 1994;Palmer, Rosch, & Chase, 1981).It has been shown that using these criteria leads to relatively consistent results between participants.Sastyin et al. (2015) found a significant interaction between viewpoint and consistency, where the viewpoint effect was weaker when consistent scene information was provided.The authors concluded that object recognition relied more on context information if the object was presented from a noncanonical viewpoint.
These results are an impressive example of how contextual scene information can support object recognition.Here, we investigated if the contextual modulation of viewpoint effects generalizes to strongly noncanonical object orientations.That is, investigate object-scene processing under conditions that produce high uncertainty.This is an important test of the visual system's ability to flexibly rely more on recurrent top-down modulation from scene context when objects are difficult to recognize.In our study, we used three-dimensional (3D) models of objects to create our stimulus set.The use of 3D models to test conditions of object constancy has led to valuable insights such as uncovering the stages of shapeand size-invariant object recognition in the visual system (Isik, Meyers, Leibo, & Poggio, 2014), as well as investigating the features and computational transformations that support 3D object recognition (Biederman & Gerhardstein, 1993;Gauthier et al., 2002;Isik et al., 2014;Logothetis, Pauls, Bülthoff, & Poggio, 1994;Poggio & Edelman, 1990;Ratan Murty & Arun, 2018;Zisserman et al., 1995).In our case, the use of 3D models is motivated by the ability to create a set of highly noncanonical viewpoints in a controlled manner while retaining naturalistic properties, such as the 3D structure of the objects from each viewpoint.Recent work using 3D immersive environments has highlighted the importance of studying vision under more naturalistic constraints in order to investigate cognitive processes in the context of natural behavior (Draschkow, Kallmayer, & Nobre, 2021;Helbing, Draschkow, & Võ, 2020;Helbing, Draschkow, & Võ, 2022;Kristjánsson & Draschkow, 2021).
In the present study, we conducted three behavioral experiments.In our first two experiments (Experiments 1a and 1b), we presented 3D models of real-world objects from six different angles (0°, 60°, 180°, 120°, 240°, and 300°) rotated around the pitch axis in a word-picture verification task.Because rotating the objects around the pitch axis results in highly atypical viewpoints, we expected to find viewpoint-dependent recognition indicated by lower accuracy and slower RTs.In Experiment 1b, we wanted to replicate Experiment 1a with grayscale versions of the images, expecting similar effects of viewpoint as for Experiment 1a (Hayward & Williams, 2000).Experiments 1a and 1b also served to identify viewpoints that produced highest (canonical) and lowest (noncanonical) recognition performance, which we then used in Experiment 2.
In Experiment 2, we paired 3D objects presented in canonical (0°rotation) and noncanonical (120°r otation) viewpoints with semantically consistent and inconsistent scenes.Our aim was to test if viewpoint dependence and object-scene processing effects (Sastyin et al., 2015) generalize to depth rotated 3D models of objects.

Participants
Participants were recruited at Goethe-University Frankfurt am Main.The sample consisted of 12 participants who completed Experiment 1a (6 women, M age = 23.92years, range = 19-29 years), 12 different participants who completed Experiment 1b (8 women, M age = 19 years, range = 18-22 years), and another set of 32 participants who completed Experiment 2 (25 women, M age = 24.28years, range = 18-51 years).The sample size of Experiment 2 was a priori chosen to be higher compared to previous studies which found reliable effects across multiple experiments with 20 participants (e.g., Sastyin et al., 2015).In Experiment 1a, all except for six participants were psychology students who were compensated with course credits; the remaining participants volunteered for the experiment without any compensation.All had normal or corrected-to-normal vision, were native German speakers, and were unfamiliar with the stimulus materials.Written informed consent was obtained before participation, data collection and analyses were carried out according to guidelines approved by the Human Research Ethics Committee of the Goethe University Frankfurt.

Stimulus material
For Experiments 1a and 1b, we collected 100 3D models of objects from a broad range of categories such as furniture, foods, vehicles, plants, and electrical devices.Eighty-two of the 3D models were purchased from CG Axis Complete packages I, II, III, and V, and 18 additional models were obtained free of charge from sources like TurboSquid and free3D.Each model was rotated around its pitch axis by 0°, 60°, 120°, 180°, 240°, and 300°degrees and sized to fit a 60 cm × 60 cm × 60 cm box using the free 3D animation software Blender.Importantly, we chose the most frontal view for the 0°label.Not necessarily because it was the most canonical out of all possible views (usually off-axis views are perceived as more canonical; e.g., Palmer et al., 1981), but because it did not include any additional in-plane rotations or rotations around other cardinal axes.Crucially, it still allowed us to determine the most canonical and noncanonical views out of the chosen set of views.A snapshot from each angle was systematically recorded in front of a gray background using the virtual reality software Vizward5 to create our final stimulus set of 600 images.Additionally, we created grey-scaled versions of these images for Experiment 1b using the GrayscaleEffect function in Vizard5 (https://docs.worldviz.com/vizard/latest/postprocess_color.htm).
For Experiment 2, we used the same 3D models as in Experiment 1, adding an additional 56 models collected from the CGAxis packages, resulting in a total of 156 models.Instead of creating snapshots of all six angles, we chose the two viewpoints that had previously produced the highest (canonical viewpoint; 0°) and lowest (noncanonical viewpoint; 120°) recognition performance averaged over Experiments 1a and 1b.We gray-scaled the images using the previously described method.
Additionally, we collected 312 photographic images of scenes, one consistent and one inconsistent scene for each object.We defined a consistent scene as one in which we would expect the object to appear naturally.In both cases, the target object was not present in the scene.Most of the photographs were obtained from the SCEGRAM database (Öhlschläger & Võ, 2017), as well as from Google images.In Experiment 2, objects were presented as templates superimposed on scenes.This was done in line with previous work investigating the influence of object and scene orientation on scene-consistency effects (Lauer et al., 2020;Lauer & Võ, 2022).

Procedure
To investigate the speed and accuracy of object recognition, while keeping the procedure comparable with previous studies, a word-picture verification task was used for all experiments (Figure 1).Participants were instructed on screen as well as through standardized verbal instructions to decide as quickly and accurately as possible whether the object on screen matched the basic level category label presented to them at the beginning of the trial using a corresponding match or mismatch key.Participants were not made aware of the different viewpoint conditions beforehand.

Design
Experiments 1a and 1b consisted of six blocks with 100 trials each.In each block, the object was presented from a different angle (0°, 60°, 120°, 180°, 240°, or 300°) chosen randomly and counterbalanced between participants.The order of objects within each block was randomized.Each object appeared three times in the match condition (object image matched basic level category label) and three times in the mismatch condition (object image did not match basic level category label), randomized between blocks.
In the mismatch condition, the basic level category label stemmed from a different superordinate category than the object image (e.g., the label "chair" as part of the superordinate category "furniture" was paired with an image of a "car" as part of the superordinate category "vehicle").
Because there was no effect of viewpoint in the mismatch condition in Experiments 1a and 1b, most trials in Experiment 2 were match trials (n = 120), with 23% mismatch trials (n = 36) that were later excluded from analysis.In Experiment 2, each object was presented to each participant once, and we counterbalanced consistency (consistent vs. inconsistent) and viewpoint (canonical vs. noncanonical) between participants.

Data analysis
In Experiments 1a and 1b, we were interested in the effects of viewpoint (how far the object was rotated away from its canonical 0°angle) and match (whether the object matched the basic level category label as part of the experimental design) on reaction times (time between the onset of the object image and keypress response) and accuracy.In Experiment 2, we were interested in the interaction between viewpoint (canonical vs. noncanonical viewpoint), and scene consistency (consistent scene versus inconsistent scene) on reaction times and accuracy.
Raw data were preprocessed and analysed using R (R Core Team, 2021).Objects that produced accuracy ratings that deviated more than 2.5 standard deviations (SD) from the mean (computed for each condition separately) were excluded from analysis.Based on this criterion, we excluded four objects in Experiment 1a, one in Experiment 1b, and two in Experiment 2. We based our reaction time analysis on correctly matched trials only (percent trials removed: Experiment 1a = 4.45%, Experiment 1b = 10.16%,Experiment 2 = 8.55%).
In our data analysis, we used (generalized) linear mixed-effects models ((G)LMMs) using the lme4 package (Bates, Mächler, Bolker, & Walker, 2015).We chose this approach because of its potential advantages over analysis of variance, because it allows us to simultaneously estimate by-participant and by-stimulus variance (Baayen, Davidson, & Bates, 2008;Bates, Mächler, Bolker, & Walker, 2014;Kliegl, Wei, Dambacher, Yan, & Zhou, 2011).The random effects structure of each model was determined using a drop-one procedure starting with the full model including by-participant and by-stimulus varying intercepts and slopes for the main effects in our design.We then subsequently removed random slopes that did not contribute significantly to the goodness of fit as determined by likelihood ratio tests.This strategy allowed us to avoid overparameterization and produce converging models that are supported by the data.Details about the individual analysis and models are described in the Data analyses sections of each experiment.For each GLMM, we report β regression coefficients together with the z statistic and apply a two-tailed 5% error criterion for significance testing.The p values for the binary accuracy variable are based on asymptotic Wald tests.Additionally, reaction times were transformed following the Box-Cox procedure (Box & Cox, 1964) to correct for deviation from normality as to better meet LMM assumptions (see individual Data analysis sections for further details).For the LMMs, regression coefficients are reported with the t-statistic and p values were calculated with the lmerTest package (Kuznetsova, Brockhoff, & Christensen, 2017).We defined sum contrasts for match (match vs. mismatch), and consistency (consistent vs. inconsistent) where slope coefficients represent differences between factor levels and the intercept is equal to the grand mean.

Apparatus
All experimental sessions were carried out in the same six experimental cabins of the department of psychology at Goethe-University Frankfurt am Main, containing the same experimental set up (computers running OS Windows 10).Stimulus presentation, RTs and accuracy were systematically controlled and recorded by OpenSesame (Mathôt, Schreij, & Theeuwes, 2012), presented on a 19-in monitor (resolution = 1,680 × 1,050, refresh rate = 60 Hz, viewing distance = approximately 65 cm, subtending approximately 11.13°× 9.28°of visual angle for the object images and approximately 19.0°× 15.84°of visual angle for the background images).

Experiments 1a and 1b
In Experiments 1a and 1b, we investigated the effect of viewpoint on object recognition RT and accuracy using 3D models of objects rotated around the pitch axis (0°, 60°, 120°, 180°, 240°, and 300°).The only difference between the experiments was that 3D models were presented either in color (Experiment 1a) or a grayscale version of the model was used (Experiment 1b).Participants had to indicate whether the object matched the previously presented basic level category label.

Procedure
Participants were presented with a fixation point in the middle of the screen followed by a basic level object category label (in German, font.Droid Sans Mono; font size.26; color.black).This presentation was followed by the target object presented in the middle of the screen, which could either match or mismatch the label, until the participant gave

Data analysis
After data preprocessing, we used a binomial GLMM to examine the effects of viewpoint and match on accuracy.As fixed effects we included viewpoint (0°, 60°, 120°, 180°, 240°, or 300°) as a first-and second-degree polynomial, the match versus mismatch comparison, and the interactions between these terms.The second-degree polynomial viewpoint term was added, because we expected viewpoint to affect recognition in a nonlinear manner (symmetry around 180°).Our final model included random intercepts for participants and stimuli, as well as a by-stimuli random slope for the match versus mismatch effect for Experiment 1a, and random intercepts for participants and stimuli, as well as a by-stimuli and by-participant random slope for the match effect for Experiment 1b.
Based on the power coefficient output of the Box-Cox procedure (λ = 0.22), RTs were log transformed.We used the same fixed effects structure for the RT-LMMs as for the accuracy-GLMMs.As random effects, we entered random intercepts for participants and stimuli, as well as by-participant and by-stimuli random slopes for the effect of match for Experiments 1a and 1b.

Discussion
In Experiment 1a, we found viewpoint-dependent object recognition for objects rotated around the pitch axis.This effect can best be described by a quadratic curve that approximates symmetry around 120°rotation.We also found that in our sequential matching task, only the match condition produced viewpoint-dependent behavior, whereas mismatch trials seemed unaffected by viewpoint.Finding a mismatch might rely more on the analysis of global, viewpoint-invariant features, whereas matching might be more dependent on the analysis of local, viewpoint-dependent features (e.g., Jolicoeur, 1990a) (e.g., deciding a shape is not a car might require less viewpoint-dependent information than identifying the shape as a chair).In Experiment 1b, we were able to replicate our results from Experiment 1a.Grayscaling the images seemed to have made the overall task slightly more difficult, while still producing similarly viewpoint-dependent behavior.Although some studies report mirror confusion effects for rotations around 180°(e.g., Gregory & McCloskey, 2010), we did not encounter this phenomenon in our study.In our case, rotating around the pitch axis produced views such as "upside-down, from behind" which is untypical for images that usually produce mirror confusions.The canonical (0°) and noncanonical (120°) viewpoints we used in Experiment 2 represented viewpoints that produced the best and worst recognition performance derived from average accuracy ratings obtained from Experiments 1a and 1b.

Experiment 2
In Experiment 2, we paired canonical (0°) and noncanonical (120°) viewpoints with consistent and inconsistent scene contexts.We were specifically interested in the interaction between viewpoint and consistency, with the expectation that meaningful scene context information would decrease the effect of viewpoint on object recognition.

Procedure
In Experiment 2, we used the same word-picture verification task as in Experiments 1a and 1b (Figure 1B).Scene context was provided by first previewing the consistent or inconsistent scene for 300 ms and then overlaying the target object on top of the scene background until a response was given.

Data analysis
For both the accuracy-GLMM and RTs LMM, we entered interaction terms between viewpoint and consistency as fixed effects.The GLMM included random intercepts for participants and stimuli, as well as a by-stimuli random slope for the effect of viewpoint.RT data were log transformed.
For the RT-LMM, we had random intercepts for participants and stimuli, and a by-participant random slope for the effect of viewpoint and by-stimuli random slopes for the effects of viewpoint and consistency.

Accuracy
Accuracy was significantly higher for canonical viewpoints than for noncanonical viewpoints as revealed by the GLMM (β = 0.68, SE = 0.14, z = 4.82, p < 0.001), but there was no significant main effect for consistency (β = 0.06, SE = 0.07, z = 0.75, p = 0.45).Critically, there was a significant interaction between viewpoint and consistency (β = −0.21,SE = 0.07, z = −2.84,p = 0.004) (Figure 3A).Post hoc interaction contrasts revealed that the viewpoint-dependence effect was significantly stronger in the inconsistent scene condition compared to the consistent scene condition (β = −0.84,SE = 0.3, z = −2.84,p = 0.005).This finding is in line with our hypothesis that providing meaningful scene context can decrease the effects of viewpoint on object recognition.Additionally, the scene-consistency effect was only significant in the noncanonical condition (β = 0.53, SE = 0.15, z = 3.45,

Discussion
In general, object recognition accuracy was viewpoint dependent; however, there was a significant interaction between viewpoint and consistency.In line with our hypothesis, the viewpoint effect was significantly weaker for consistent scenes and the scene consistency effect was only observed for noncanonical viewpoints (Figure 3A).Noncanonical viewpoints were recognized significantly slower than canonical viewpoints.However, this result was unaffected by scene consistency.

General discussion
In the present study, we investigated how scene context information modulates viewpoint-dependent object recognition under conditions of high uncertainty using 3D models of everyday objects.Although providing meaningful context did not eradicate the viewpoint effect fully, it significantly decreased recognition accuracy costs.By extending previous findings (Sastyin et al., 2015), we provide further support for a model of object recognition that incorporates context (e.g., Bar, 2004), while dynamically adapting to the amount of available information based not only on visual features of the object (Burgund & Marsolek, 2000;Hayward & Tarr, 1997;Jolicoeur, 1990), but also context.
It is assumed that, when objects are presented in context, rapidly accessed low spatial frequency information is fed back to the occipito-temporal cortex facilitating high spatial frequency based analysis during object recognition (Bar, 2004;Kauffmann, Ramanoël, & Peyrin, 2014;Peyrin, Chauvin, Chokron, & Marendaz, 2003;Peyrin, Baciu, Segebarth, & Marendaz, 2004).The highly noncanonical viewpoints we used in our experiments produce high uncertainty in the initial set of possible target objects.We show that, under conditions where low spatial frequency analysis of the object leads to very ambiguous target candidates, the visual system relies more on top-down regulation modulated by recurrent processing of low spatial frequency information from the scene (Bar, 2004).
It further motivates models of object constancy-the visual system's ability to produce representations that are robust to changes in, for example, viewpoint or lighting (e.g., DiCarlo & Cox, 2007)-that efficiently integrate contextual information and can lead to both viewpoint-dependent and invariant behavior based on available information and the task at hand.
A key component of the present study was to generalize previous findings on object-scene processing effects and viewpoint dependence to depth-rotated 3D objects.We want to highlight the importance of generalizing findings from traditional two-dimensional settings to more naturalistic settings and stimuli.Kristjánsson and Draschkow (2021) have shown very illustratively for a variety of phenomena that, given more naturalistic constraints, a system is able to circumvent, for example capacity limits by drawing on the rich visual experience of natural environments.Although we did not use fully immersive environments, using 3D models offers a more realistic encounter of everyday objects and, therefore, a more precise measure of viewpoint dependence in real-world object recognition.It should be noted, however, that there is a trade-off between naturalistic looking stimuli (i.e., photographs) and stimuli that more precisely capture naturalistic properties (i.e., 3D structure of objects from different viewpoints) in a highly controlled manner, while not looking as naturalistic.Here, we opted for providing more naturalistic 3D properties of the displayed objects.
From the present study, it is unclear what kind of information contained in the scenes was responsible for decreasing the viewpoint costs.Rapidly accessed global information such as the gist of the scene (Oliva & Torralba, 2007) could be the main factor.At the same time, more local information such as the detection and recognition of certain objects in the scene preview could provide information about related possible target objects based on internalized scene-object and object-object regularities (Võ et al., 2019).Revealing the time course of when what kind of contextual information is integrated to buffer viewpoint effects would provide new insights into how the visual system so effortlessly achieves invariant object recognition.
Varying what information is presented during the task (i.e., providing meaningful context vs. showing objects in isolation) is one way to probe the visual system's ability to overcome processing limitations in viewpoint-dependent object recognition.Alternatively, one could keep the visual input constant, but vary the level at which participants have to perform the matching task (Hamm & McMullen, 1998).If there are object representations that contain more or less viewpoint-dependent or invariant information, how does this factor interact with the integration of contextual information in the form of scene context?
Finally, we would like to address that, on average, performance was high in the matching task throughout all our experiments.These ceiling effects are probably due to the type of task we chose, which are different from the tasks usually used to study scene consistency effects (Davenport & Potter, 2004;Sastyin et al., 2015).Despite these differences in difficulty, we were able to demonstrate a significant decrease in viewpoint costs by providing meaningful scene context.
Past research has made strong advances toward understanding the computations that underly invariant object recognition (DiCarlo & Cox, 2007).Understanding these mechanisms in isolation is key to understanding object recognition in general.We argue that understanding how the visual system is able to make use of richly structured naturalistic environments to circumvent computational bottlenecks will ultimately lead to better, more robust models of object recognition and inspire approaches in fields such as computer vision (e.g., Bomatter et al., 2021).
To conclude, in the present study we show that scene context supports object recognition, even when using extremely noncanonical orientations of depth rotated 3D objects.We highlight the importance of testing capacity limits of object recognition in more naturalistic frameworks to build more robust and flexible models and move toward a better understanding of vision under naturalistic constraints.

Figure 1 .
Figure 1.Exemplary overview of a subselection of stimuli used in Experiment 1a and the viewpoints used when presenting them (A).Trial procedures for the matching task in Experiments 1a and 1b (B) and Experiment 2 (C).The object was presented in color in Experiment 1a and grayscaled in Experiment 1b.Note that the depicted labels are in English for visualization purposes (German in the original experiment).Participants had to press the "c" key on their keyboard to indicate a match between label and image, and the "m" key to indicate a mismatch.