Effects of natural scene inversion on visual-evoked brain potentials and pupillary responses: a matter of effortful processing of unfamiliar configurations

The inversion of a picture of a face hampers the accuracy and speed at which observers can perceptually process it. Event-related potentials and pupillary responses, successfully used as biomarkers of face inversion in the past, suggest that the perception of visual features, that are organized in an unfamiliar manner, recruits demanding additional processes. However, it remains unclear whether such inversion effects generalize beyond face stimuli and whether indeed more mental effort is needed to process inverted images. Here we aimed to study the effects of natural scene inversion on visual evoked potentials and pupil dilations. We simultaneously measured responses of 47 human participants to presentations of images showing upright or inverted natural scenes. For inverted scenes, we observed relatively stronger occipito-temporo-parietal N1 peak amplitudes and larger pupil dilations (on top of an initial orienting response) than for upright scenes. This study revealed neural and physiological markers of natural scene inversion that are in line with inversion effects of other stimulus types and demonstrates the robustness and generalizability of the phenomenon that unfamiliar configurations of visual content require increased processing effort.


Introduction
The visual world is too rich of details for it all to be fully captured and processed by the brain. To prevent overload and exuberant energy consumption, neural networks across the visual hierarchy employ several tricks, such as adaptation, sparse coding, predictive coding, and other spatiotemporal information reduction mechanisms, to aid rapid though energy-efficient subjective perception (Barlow, 1990;Huang and Rao, 2011). These mechanisms ensure that sensory information, when organized in a familiar (and predictable) manner, is processed faster and more accurately. On the other hand, observers process unfamiliar organizations of visual content slower and less accurate (e.g., Itier et al., 2006;McLaren, 1997;Yin, 1969). Here we investigate effects of image inversion, a popular method, especially in studies on face perception (Rossion, 2009;Valentine, 1988;Yovel, 2016), to substantially decrease an observer's familiarity with the organization of visual features. While behavioral measures (e.g. recognition performance and reaction times) provide valuable insights, we will use physiological and neural measurements of the effects as these provide alternative, reliable, and objective insights in the underlying mechanisms (e.g., Frässle et al., 2014). More specifically, we will measure pupillary responses and event-related potentials (ERPs) to upright and inverted natural scenes as (neuro-) physiological markers of the degree of (additional) effort required to process (un)familiar scenes (Minnebusch and Daum, 2009;Strauch et al., 2022).

Face, body, and object inversion
Inverted images and their effects on visual and emotional processing have received much scientific attention, especially in the field of face perception. Human faces are disproportionately more difficult to recognize or memorize when a face is presented upside down (e.g., Yin, 1969). Several studies have attributed this detrimental effect to configural distortions (Bartlett and Searcy, 1993;Freire et al., 2000) that presumably lead to the inability to form a global, holistic percept (Murray et al., 2000;Valentine, 1988) based on local face features that are unaffected by inversion (Farah et al., 1995a). Conversely, others argued that face inversion (or face distortion) effects are not caused by global configural distortions (Konar et al., 2010;Riesenhuber et al., 2004) but by a reduced experience with processing local facial features for recognition (Sekuler et al., 2004). Whatever process changes by face inversion, researchers commonly associate such effects with the N170 electroencephalography (EEG) component, likely stemming from increases in activity from brain areas, including (right-lateralized) occipito-temporal and temporo-parietal regions, involved in object processing (Aguirre et al., 1999;Bentin et al., 1996;Eimer, 2011;Haxby et al., 1999;Itier and Taylor, 2004;Jacques et al., 2019;Rossion et al., 1999;Rousselet et al., 2004;Yovel and Kanwisher, 2005).
Some researchers demonstrated even earlier inversion effects around 70-100ms after picture onset using magnetoencephalography (Liu et al., 2002). In addition to such neural signatures, pupil size also serves as a marker of face inversion, showing stronger dilations to inverted faces than upright faces, supposedly reflecting the allocation of more mental effort to process the unfamiliar facial configuration (Conway et al., 2008;Falck-Ytter, 2008). Interestingly, it is not yet clear whether inversion effects on ERPs and pupil dilation generalize beyond face stimuli to bodies, houses, and other objects. Some studies find evidence in favor of object inversion effects on behavior and brain potentials (Eimer, 2000;Epstein et al., 2006;Minnebusch and Daum, 2009;Mohamed et al., 2011;Reed et al., 2003;Righart and de Gelder, 2007;Stekelenburg and de Gelder, 2004) while others find much weaker or no such evidence (Bentin et al., 1996;Diamond and Carey, 1986;Itier et al., 2007;Rossion et al., 2000;Rousselet et al., 2007).
Besides these inconsistencies, the effects of inversion of complex images that display multiple or undistinctive objects, such as often the case in images of natural scenes (landscapes), are even less clear.

Scene inversion
Effects of scene inversion on behavior are in line with face inversion effects. The presentation of scenes, here defined as pictures of landscapes or complex objects with cluttered backgrounds, result in detrimental recognition performance and delayed reaction times for inverted pictures (Epstein et al., 2006;Scapinello and Yarmey, 1970;Walther et al., 2009). Note that scene inversion effects may be weaker than those of face inversion (Rousselet et al., 2003). Besides behavior, the peripheral nervous system is also affected by scene inversion. Inverted as compared to upright images of scenes evoke relatively weaker pupil constrictions (or stronger pupil dilations on top of an initial pupil constriction) (Castellotti et al., 2020;, suggesting either enhanced attention for upright images (Binda et al., 2013;Mathôt et al., 2014;Mathôt et al., 2013;Portengen et al., 2021) or increased mental effort to process inverted images (Binda and Murray, 2015;Joshi and Gold, 2020;Laeng et al., 2012;Mathôt, 2018). Less is known about neural markers of (natural) scene inversion in the central nervous system. Inverted versus upright artificial and natural scenes evoke distinct patterns of fMRI-based activity measured in mostly extrastriate areas and parahippocampal place area (Epstein et al., 2006;Kaiser et al., 2020a;Walther et al., 2009). The inversion-evoked pattern of increased activity across areas implicated in visual processing point at the possibility that inverted scenes require more effort to be processed, like the case with face inversion (Sadeh and Yovel, 2010). However, as far as we know, no EEG studies have examined how natural scene inversion affects ERP components.
Only few publications on (natural) scene-evoked ERPs exist (Bastin et al., 2013;Cichy et al., 2017;Groen et al., 2016;Groen et al., 2013;Rivolta et al., 2012;Sato et al., 1999) and none have reported on effects of inversion (but see Harel and Al Zoubi, 2019 for a conference abstract). So far it is only known that scene inversion starts to alter brain signals, that reflect the decoding of a scene's category (e.g., roads vs houses), around 170ms (Kaiser et al., 2020b). Taken together, areas in extrastriate regions and further up the visual hierarchy appear to be affected by image inversion in general, but the question remains which ERP components are affected by the inversion of natural scenes. Natural scenes lend themselves to be exceptionally functional as stimuli to investigate inversion effects due to the diversity of features varying across stimuli and stimulus locations. If an inversion effect is found, one cannot relate it to a single image statistic, such as ordinal edges (e.g., houses) that suddenly become overrepresented in the upper image regions after inversion.

Current study
To summarize our research goals, we here aim to examine the effects of image inversion on pupil responses and ERPs. We will specifically examine (i) the timing of effects on pupil size to investigate whether additional effort is required to process inverted natural scenes and (ii) ERP amplitudes to investigate whether similar components are affected by natural scene inversion as compared with face/object inversion. Based on previous studies, we expect to confirm stronger relative pupil dilations (on top of pupil constrictions) and stronger amplitudes of early ERP components recorded from occipital, temporal, and parietal sites in response to inverted as compared with upright natural scenes.

Experimental procedures
Participants Fifty-five participants from Utrecht University were recruited and received course credit or money for participating in the current study. Half of this sample size suffices to find significant inversion effects in a pupillometry study  but we doubled the sample to ensure that, in the light of inconsistencies of object and scene inversion effects in the EEG literature, any null results would not be because of low statistical power. Eight subjects were excluded because they did not follow instructions (broke fixation in the majority of the trials, or did not respond during catch trials), they were tired, or because of technical issues that led to incomplete data. The remaining forty-seven participants (31 females; age: M = 22.9, SD = 2.6; 44 right-handed) were included for further analysis. All participants were healthy, had normal or corrected-to-normal vision, were naïve with respect to the purpose of the experiment, and signed the informed consent. The study conformed to the ethical principles of the Declaration of Helsinki.

Apparatus and stimuli
The experiment and stimuli were generated in MATLAB (Mathworks, Natic, MA, USA) using Psychtoolbox (Brainard, 1997). We displayed stimuli on an Asus ROG Swift PG278Q monitor (Beitou District, Taipei, Taiwan) with a resolution of 1920x1080 (60Hz) against a grey background with a luminance of 58.5 cd/m 2 . Hundred-twenty stimuli were grabbed from the web and consisted of images showing a natural scene (mostly landscapes with a meadow and sky). Images were transformed to grayscale and then histogram equalized to remove global luminance and contrast differences across stimuli. The adjusted images had a mean luminance of 81.8 cd/m 2 (SD = 0.4). Pictures were presented either upright or inverted ( Figure 1A) at the center of the screen with a resolution of 1070x669 pixels (corresponding to 34.9° by 22.3° in visual angle). A fixation dot with a diameter of 20 pixels (0.68°) was presented on top of each image. We randomly interleaved ten additional presentations of colored images of scenes (either upright or inverted) as catch trials (participants had to press a button whenever a colored scenes was observed; for details, see Procedure). A 64-channel +8 BioSemi ActiveTwo EEG system (Amsterdam, Noord-Holland, The Netherlands) in combination with a dedicated computer running BioSemi ActiView (version 7.05) was used to record EEG data, with a sample rate of 512Hz and a bandwidth of 104Hz (3dB). An Eyelink 1000 plus (Ottawa, Ontario, Canada; version 5.09), connected to another computer for separate recordings, tracked the gaze point of the right eye at 1000Hz.

Procedure
After equipping the EEG cap, electrodes were placed according to the 10-20 system. We placed the reference electrodes at the mastoids behind each ear and we placed additional electrodes for electrooculography (EOG) around the eyes (superior and inferior to the left eye and temporal to each eye). Participants then placed their head in the Eyelink chinrest, 55cm from the screen and a 13-point (re-)calibration procedure was performed at the start and after every quarter of the experiment.
Participants were instructed to maintain fixation on the dot in the center of the screen throughout the experiment. Trials started with the presentation of the fixation dot for a random duration chosen from the range 500 -1500ms. An image then appeared on the background for 3000ms, with the fixation dot superimposed on top of the image. The image disappeared thereafter to automatically start a new trial, making the total presentation time of one trial 3500 -4500ms ( Figure 1B). Participants received the opportunity to take long breaks during the eye-trackers re-calibration and three additional self-paced breaks between calibration sessions to prevent fatigue.
The experiment consisted of 250 trials in total with the following conditions: upright (120 images), inverted (120 images), and 10 catch (color images). Participants were asked to press a keyboard button (spacebar) whenever a colored picture instead of gray-scale picture was shown on the screen.
These catch trials were not analyzed but used to motivate participants to pay attention to the images.
The experiment lasted approximately 45 minutes in total.

Analysis
We processed and analyzed both pupillometry and EEG data using a homemade pupillometry toolbox in MATLAB and the FieldTrip toolbox (Oostenveld et al., 2011), respectively.

Pupillometry and event-related pupillary responses (ERPR)
The ERPRs were obtained after applying a series of processing steps per participant. We first removed blinks from the continuous recordings of pupil size by detecting sudden, extreme changes in pupil size, removing episodes starting with such a sudden decrease followed by an increase typically caused by a blink, and filling the removed episodes with simulated data with MATLAB's spline cubic interpolation algorithm. We then converted each continuous recording to an event-related data structure with segments of 0-3000 msec after image onset. This resulted in a data point matrix with 136 rows and 3000 columns. To remove baseline (steady-state) effects of individual differences, we z-normalized each pupil trace per trial (i.e., each row) by first subtracting the average pupil size in the initial 10ms period of each trial and then dividing all matrix data points by the overall standard deviation across all matrix data points. Lastly, we extracted the amplitude of each pupil constriction per trial, which reflect the degree of visual processing of image content (Naber et al., 2018), by calculating the minimum z-normalized pupil size within a window of 400-1200 msec (i.e., during a pupil constriction episode) per trial (analyses on average pupil size in a window after pupil constriction produced similar results; data not shown). We did not compute pupil response latencies because it was not affected by image inversion in a previous study .

EEG and event-related potentials (ERP)
The ERPs were obtained using similar processing steps as for the ERPRs. After referencing the data to the electrodes placed at the mastoids, the FieldTrip toolbox bandpass filtered the EEG voltage recordings leaving only frequencies within a band of 0.3-30Hz intact. We windowed the ERPs between 100 msec before and 500 msec after image onsets. We applied baseline corrections for the ERPs with a window setting of 100ms before image onset. We removed EOG artifacts through manual inspection of the 20 first components (runica method) following the guidelines provided by FieldTrip documentation pages.
We inspected variances per electrode and trial, and we removed electrodes (1 or 2 electrodes in 29 of 47 participants) and trials with outliers manually (percentage trials removed averaged across participants: M = 4%, SD = 2%). Next, we calculated relative amplitudes (i.e., the increase or decrease in potential as compared to a preceding trough or peak, respectively) per ERP component from ERPs averaged across a group of parietal electrodes (i.e., all BioSemi electrodes including the letter P, which also includes occipital and temporal electrodes). Component peaks and troughs were automatically detected using MATLAB's findpeaks function for the components N1 (60-120 msec window), P1 (90-170 msec), N2/N170 (120-200 msec), P2 (150-300 msec), N2 (200-350 msec), and P3 (250-400 msec) per participant. The manual (and subjective) detection of components (using a mouse cursor to select peaks and troughs) produced qualitatively similar results (data not shown).

Data availability
The stimuli, stimulus presentation code, and analysis code is publicly available online at https://osf.io/2jxe3. Because of the large EEG and eye-tracking data set, these data are stored on a local server of Utrecht University rather than an open repository. The latter can be requested by sending an e-mail to m.naber@uu.nl.

Results
The result section is organized in the following manner: we first examined the image inversion effect on event-related pupillary responses (ERPR) and the associated amplitudes (Pupil Amp ). Then we investigated the same effect on event-related potentials (EPR) and the relative amplitudes of the underlying components to examine which electrodes and components marked inverted image processing best.

Effect of image inversion on ERPR
The ERPR to the images (Figure 2A)

Effects of image inversion on ERP
To inspect the image inversion effect on brain potentials, we first plotted the ERPs of parietal electrodes as, in line with the literature (e.g., Itier and Taylor, 2004), we expected the effects to occur in these areas.
The ERPs evoked by the presentation of natural scenes, first averaged across presentation trials and then across participants (Figure 3A), consisted of a complex pattern with several components with distinct latencies and amplitudes between upright and inverted images. The pattern of the timing of the components appeared to be similar to a previous ERP study that presented scenes to participants (Harel et al., 2016). The gradual evolvement of negativity before 100ms is likely caused by a contingent negativity variation that is typically observed in preparation of stimulus presentations every couple seconds (Kononowicz and Penney, 2016). Upright images appeared to evoke an early positive peak (P80) around 80ms, an N1 around approximately 110ms, a P1 around 130ms, an N2 around 150ms, a P2 around 220ms, an N3 around 280ms, and a P3 around 320ms. The components of the ERP traces evoked by inverted images showed comparable timings with the only exception that the early P80 was covered up by the relatively stronger superimposed N1 component. The ERPs showed most strongest inversion effects around the occurrence of N1 and N2 components, as confirmed by ERP difference plots ( Figure 3B; for post-hoc comparisons per time point, see Figure S1B; for scalp maps per component and per condition, see Figure S2; for difference in potentials and component amplitude between conditions, see Figure S3) and statistical comparison of the difference in potentials ( Figure 3C; N1: t(46) = 5.77, p < 0.001; P1: t(46) = 5.16, p < 0.001; N2: t(46) = 4.35, p = 0.001; P2: t(46) = 6.45, p = 0.001; latencies did not differ, data not shown). Note, however, that the relative amplitudes (i.e., the change in potential as compared to the preceding positive peak; N1 amplitude was compared to baseline as an exception due to no preceding components) only differed between the upright and inverted condition for N3 ( Figure 3D; t(46) = 5.44, p = 0.001; for scalp maps of N1 and N2, see Figure 3E-F). The latter suggests that the baseline difference between upright and inverted conditions of later components (i.e., after N1) was likely driven by the relative amplitude difference of N1. In sum, we found that natural scene inversion evokes an N1 component with a relatively strong amplitude, which continuously changed brain potentials up to the P2 component, followed by an additional N3 component with a relatively weaker amplitude.  Figure S3).

Asterisks and the curly signs indicate when the uncorrected p-values of t-test comparisons scored
below significance levels (~ p < 0.100; * p < 0.050; ** p < 0.010; *** p < 0.001; most asterisks in panel E-F indicate p < 0.001 but only one asterisk is shown for aesthetical reasons).

Discussion
The first finding reported in this paper concerns the pupillary image inversion effect. We could replicate the finding that the pupil constricts stronger to upright images as compared to inverted images (Castellotti et al., 2020;, and a thorough examination of the time traces of pupil size suggests that scene inversion evokes an enhanced dilatory alerting response. This effortrelated dilation was superimposed on an initial orienting-related constriction, which is a typical phenomenon in pupillometry (Mathôt, 2018;Naber et al., 2012;Naber and Murphy, 2020;Strauch et al., 2022).
Besides these pupillometric results, we also reported on the neural marker of specifically natural scene inversion. A considerable number of studies reported face inversion effects on the amplitude and latency of N170 components (Bentin et al., 1996;Eimer, 2011;Itier and Taylor, 2004;Minnebusch and Daum, 2009;Rossion et al., 1999;Stekelenburg and de Gelder, 2004). However, no study has published about whether these inversion effects generalize to images of natural scenes. During a conference, Harel and Al Zoubi (2019) did recently report interesting scene inversion results that hint at an effect of natural scene inversion on the P2 component, and we look forward to the full research report to compare the results in more detail. Another EEG decoding study suggests that the categorization accuracy of upright versus inverted natural scenes is possible after 170 msecs (Kaiser et al., 2020a, b), so we had some expectations as to the timing of a potential natural scene inversion effect. Here we find that already the N1 component in occipito-tempo-parietal regions showed more pronounced troughs of activity for inverted as compared to upright natural scenes. Although we did not find any inversion effects on latency (data not shown), we do show for the first time that the previously found face inversion effects on ERP amplitudes generalize to natural scenes, although the effect occurs around 100ms, which is earlier than the typical face inversion effect around 170ms. Nonetheless, this means that, in general, brain potentials occurring around 100-200ms likely reflect a process evoked by stimulus inversion.
A number of studies have related the N170 face inversion component to clinical populations, including autism (for review, see Tang et al., 2015), schizophrenia (Tsunoda et al., 2012), prosopagnosia (Farah et al., 1995b), and Alzheimer's disease (Lavallée et al., 2016). In light of the current findings, it would be interesting to see whether natural scenes produce similar results as with face inversion and would suggest that the visual processing deficits observed in these clinical populations extend to complex stimuli in general, rather than exclusively in response to social stimuli.
When interpreting the EEG and pupillometry results together, inversion-evoked processes are likely raised by a state of alerting (discomfort, unease, or unfamiliarity) caused by effortful processing of stimuli with unusual layouts for which the visual system is not trained to process and interpret (Conway et al., 2008;Falck-Ytter, 2008). The neural network involved in alerting and mental effort is well known (Petersen and Posner, 2012), and recent evidence from pupillometry studies suggest the involvement of noradrenergic pathways and neural loci like the locus coeruleus (for reviews, see Joshi and Gold, 2020;Laeng et al., 2012;Mathôt, 2018;Strauch et al., 2022). It will be interesting to study the exact nature of this additional process in future studies, as well as its relation to other brain potentials such as visual mismatch negativity evoked by violations of sensory regularity (Berti and Schröger, 2001;Czigler et al., 2006;Heslenfeld, 2003;Horimoto et al., 2002;Kremláček et al., 2016;Maekawa et al., 2005;Pazo-Alvarez et al., 2004;Tales et al., 1999). Stimulus inversion may evoke a whole sequence of cognitive states, including changes in processing efficiency (Sekuler et al., 2004) and a heightened state of arousal, but it may also draw more attention to the stimulus . Whether an increase of attention explains the stronger N1 remains to be investigated, but an ERP study by Groen and colleagues (2016) suggests that attentional effects during natural scene processing emerge only after 250ms, meaning that the here reported N1 effect are probably not driven by changes in attentional resources.
In conclusion, we demonstrate stronger N1 peak amplitudes in the occipito-temporo-parietal area and larger pupil dilations after observing inverted natural scenes as compared with upright natural scenes, extending inversion effects beyond previously associated stimuli categories such as faces and objects. The neurophysiological markers are likely related to effortful processing of stimuli in general.