Context effects on object recognition in real-world environments: A study protocol

Background: The environments that we live in impact on our ability to recognise objects, with recognition being facilitated when objects appear in expected locations (congruent) compared to unexpected locations (incongruent). However, these findings are based on experiments where the object is isolated from its environment. Moreover, it is not clear which components of the recognition process are impacted by the environment. In this experiment, we seek to examine the impact real world environments have on object recognition. Specifically, we will use mobile electroencephalography (mEEG) and augmented reality (AR) to investigate how the visual and semantic processing aspects of object recognition are changed by the environment. Methods: We will use AR to place congruent and incongruent virtual objects around indoor and outdoor environments. During the experiment a total of 34 participants will walk around the environments and find these objects while we record their eye movements and neural signals. We will perform two primary analyses. First, we will analyse the event-related potential (ERP) data using paired samples t-tests in the N300/400 time windows in an attempt to replicate congruency effects on the N300/400. Second, we will use representational similarity analysis (RSA) and computational models of vision and semantics to determine how visual and semantic processes are changed by congruency. Conclusions: Based on previous literature, we hypothesise that scene-object congruence would facilitate object recognition. For ERPs, we predict a congruency effect in the N300/N400, and for RSA we predict that higher level visual and semantic information will be represented earlier for congruent scenes than incongruent scenes. By collecting mEEG data while participants are exploring a real-world environment, we will be able to determine the impact of a natural context on object recognition, and the different processing stages of object recognition.


v3
Abstract Background: The environments that we live in impact on our ability to recognise objects, with recognition being facilitated when objects appear in expected locations (congruent) compared to unexpected locations (incongruent). However, these findings are based on experiments where the object is isolated from its environment. Moreover, it is not clear which components of the recognition process are impacted by the environment. In this experiment, we seek to examine the impact real world environments have on object recognition. Specifically, we will use mobile electroencephalography (mEEG) and augmented reality (AR) to investigate how the visual and semantic processing aspects of object recognition are changed by the environment. Methods: We will use AR to place congruent and incongruent virtual objects around indoor and outdoor environments. During the experiment a total of 34 participants will walk around the environments and find these objects while we record their eye movements and neural signals. We will perform two primary analyses. First, we will analyse the event-related potential (ERP) data using paired samples t-tests in the N300/400 time windows in an attempt to replicate congruency effects on the N300/400. Second, we will use representational similarity analysis (RSA) and computational models of vision and semantics to determine how visual and semantic processes are changed by congruency. Conclusions: Based on previous literature, we hypothesise that sceneobject congruence would facilitate object recognition. For ERPs, we predict a congruency effect in the N300/N400, and for RSA we predict that higher level visual and semantic information will be represented earlier for congruent scenes than incongruent scenes. By collecting

Introduction
Our visual environment has a powerful impact on many cognitive processes, one of which is the ease of, and ability to, recognise objects. Recognition is thought to be both faster and more accurate when objects appear in expected environmental contexts, compared to when they are situated in unex- . While these studies establish that visual contexts impact the processing of objects, they have done so using carefully controlled paradigms and situations, which necessarily involved removing the participant from 'the wild', instead presenting participants with visual depictions of the world. However, examining cognitive faculties in an ecologically valid manner is important, as there are differences in how fundamental neural processes function in natural compared to controlled situations. . However, as these findings are largely based on experiments where the object is isolated from an environment, either in the sense that the item is presented without a spatial context, or where the item is in context, but the natural spatiotemporal structure is broken, we know little about what aspects of the recognition process are modulated by context. Here we seek to determine which aspects of recognition, namely visual and semantic processing, are modulated by context using a novel application of representational similarity analysis (RSA; Kriegeskorte et al., 2008) to mEEG signals during the perception of virtual objects. The use of RSA will allow us to determine when visual and semantic features of objects are represented in the mEEG data, and how this is impacted by the environment.
The current protocol aims to further develop methods and approaches for mEEG and combine this with emerging AR technology to effectively study real-world neurocognitive processes. We will use this protocol to fulfill two research goals, the first being to determine whether congruency effects on the N300/400 replicate in the real-world. The second goal is to expand our knowledge of the congruency effect by examining how context impacts the visual and semantic processing of objects.

Amendments from Version 2
A document in the extended data titled 'matched_stimulus_list' with a list of the objects in the experiment and their scores for each of the above listed features has been added. We have also updated the text in the manuscript in paragraph two of the Stimuli subsection of the methods to describe the matching more clearly (page 4). This was done in response to Reviewer 3 comments 1 and 2.
We have added additional justification for why we will perform the exploratory analyses across all time points to the second paragraph of the ERP analyses subsection of the statistical analysis section of the methods (page 6) in response to Reviewer 3 comment 3.
We have updated the manuscript to expand on the hypotheses for the eye tracking analyses (page 7) in response to Reviewer 3 comment 4.
Any further responses from the reviewers can be found at the end of the article

Protocol
Participants A total of 34 young adult participants (18-35 years old) with no history of neurological conditions, and with normal or corrected-to-normal vision will be recruited from the University of Cambridge and local areas. Participants will be compensated at a rate of £15/hour for their time. A sample size of 34 was determined using power analyses on data from Draschkow et al. (2018) that examined congruency effects on object-scene processing. This was calculated using the 'sampsizepwr()' function in MATLAB (MATLAB, 2021, RRID:SCR_001622), based on the mean difference between the amplitude of the N400 on congruent and incongruent conditions, to obtain a power of 0.8, at an alpha of 0.05. This study has been approved by the Department of Psychology ethics committee at the University of Cambridge (PRE2020.007, date of approval: 08/04/2020). Written informed consent will be obtained from all participants prior to taking part in the study. The study will be performed in accordance with all appropriate institutional and international guidelines and regulations, in line with the principles of the Declaration of Helsinki.

Apparatus
Participants will be presented with AR stimuli using a Hololens 2 device. The Hololens 2 has a horizontal field of view of 42°, a vertical field of view of 29°, and presents images at 60Hz. Eye movements will be tracked using the Hololens 2 with a sampling rate of 30Hz, average gaze position error of about 1.5°. Eye tracking calibration will be performed using a nine-point calibration procedure. EEG will be recorded using the Brainvision LiveAmp 64 mobile system (Brain Products GmbH, Gilching, Germany, RRID:SCR_009443). In all sessions we will record 64-channel EEG through ActiCap Slim active Ag/AgCl electrodes, with a reference electrode placed at FCz and a sampling rate of 500Hz. The EEG electrodes are placed on the participant with the Hololens 2 placed on top of the electrodes, and the LiveAmp and electrode cables placed in a backpack that the participant wears while performing the experiment. A custom-built button box is plugged into the Hololens 2 USB-C port and the LiveAmp trigger port, meaning that when the button is pressed, it simultaneously sends a signal to Hololens 2 and a 5V signal to the LiveAmp. The LiveAmp then converts that signal to a trigger that is marked in the EEG recording. The signal to the Hololens 2 triggers the appearance of an object. An example of the setup can be seen in Figure 1.

Stimuli
In this experiment, participants will walk through two different real environments, one indoor environment (a space consisting of 3 offices, a corridor and a meeting room) and one outdoor environment (a space consisting of a parking lot and picnic area), where they will see virtual objects in various locations. A total of 85 images of concepts from a property norming study will be used (Hovhannisyan et al., 2021). Five of the 85 images will be presented in the practice trials, and the remaining 80 will be presented in the main experiment.
The following process was used to choose the 85 stimuli: The 995 images were narrowed down to 412 by the lead author. This was done based on which objects could plausibly appear in the locations used in the experiment. These objects were then rated as either congruent or incongruent with each of the environments, and verified by a second experimenter. From this we extracted 40 objects that were rated as expected in the environments, and 40 objects that were rated as unexpected in the environments. For the unexpected objects we made sure that they were highly unexpected, but not impossible. For example, a reindeer in a parking lot in the UK is highly unexpected but not impossible. However, if an object would be highly unexpected and impossible, we would not use that object (for example a dinosaur in the picnic area). The 80 objects were selected from the 412 using the Match software (version 2.09.02, Van Casteren & Davis, 2007) to ensure objects in congruent and incongruent conditions, and those found in indoor or outdoor environments, were matched for the following measures taken from Hovhannisyan et al. (2021): category, domain, hit rate on a visual recognition memory task, hit rate on a lexical recognition memory task, false alarm rate for a lexical recognition memory task, false alarm rate for a visual recognition memory task, image energy, image size, proportion of non-white space in the image, hue of the image, saturation of the image, frequency in COCA database, proportion of participants identifying the modal name of the object, and the number of non-taxonomic features.
While this process creates matched sets of congruent and incongruent objects, it is based on our ratings of congruency. However, the participants may have different ideas of what is congruent or incongruent with each environment and location. Therefore, we will capture any differences by asking participants whether they think the object is congruent or incongruent with the environment on a five-point scale. These ratings will then be used to label congruent and incongruent trials for our analyses.
The stimuli will be placed in the environment using Experimenter (beta testing version), an experiment design software package built for the Unity engine and presented to participants with the Hololens 2. The locations of the stimuli will be the same for all participants. A list of the stimuli used and their locations in the experiment can be found in the Extended data (Nicholls, 2022).

Procedure
The 80 images presented in the main experiment will be presented in four blocks of 20 images.
To familiarise the participants with the Hololens and the appearance of AR stimuli they will first perform five practice trials in an indoor environment. The trials follow the same procedure as for the experiment described below. During the experiment participants will walk through two different environments, one indoor environment, one outdoor environment. In each environment participants will complete two blocks of 20 trials (40 total per environment). In each trial participants will need to find an arrow indicating the location of the object (Figure 2A). When participants are close enough to the arrow, the arrow will change colour ( Figure 2B) indicating that the participants can press the button on their button box ( Figure 1). Once the button is pressed an object will appear for five seconds ( Figure 2C). At the same time a trigger is sent to the LiveAmp. Participants will be instructed to look at the object for the entire time it is visible, and to keep as still as possible. After the object disappears participants will be presented with a question, asking them how expected the appearance of the object was on a scale from one to five ( Figure 2D). One is unexpected, three is neither, and five is expected. Once they have responded the next trial will begin. This process is repeated until all 80 objects are found. A total of 40 objects will be congruent with the environment, and 40 objects will be incongruent. During this time participants' EEG activity, and eye movements will be recorded. Overall, we expect the experiment will take one hour to complete, with 15 minutes per block.

EEG preprocessing
For the preprocessing of the EEG data we will use the BeMoBIL pipeline ( To start with we will run the 'bemobil_process_ all_EEG_preprocessing()' function, which downsamples the data to 250Hz, and removes frequency artefacts using the ZapLine toolbox (de Cheveigne, 2020). The 'bemobil_proc-ess_all_EEG_preprocessing()' function will then find, and spherically interpolate, bad channels using 'clean_artifacts()' and 'pop_interp()' before re-referencing the data to the common average. We will then bandpass filter the data using a Blackman filter with limits between 1 and 20Hz, with a transition width of 0.1Hz. The data will then be epoched to -1 and +2 seconds of the appearance of the object, and baseline corrected to the data between -100ms and 0ms in relation to the appearance of the object. Once the data is epoched and baseline corrected we will run the 'bemobil_process_all_ AMICA()' function from the BeMoBIL pipeline. This function applies a high-pass zero-phase Hanning window FIR filter with a cutoff at 1.75Hz to the data, before decomposing the data into statistically independent components using AMICA (Palmer et al., 2008), and dipoles will be fitted for each component. After this, the 'bemobil_process_all_AMICA()' function will copy the independent components onto the bandpass filtered and epoched data (the data inputted to the function, prior to highpass filtering) and reject any components not originating from the brain, such as eye movements, using the IClabel algorithm.
Statistical analysis of EEG data EEG analysis of the data will be done in two parts. The first part will investigate the ERP data for congruency effects, with the aim of determining whether previous congruency effects on ERPs can be replicated when participants are in real-world environments. The second part of the analysis will expand our knowledge of the congruency effect by determining which object recognition processes are impacted by context and how. To do this we will use RSA to determine the when the visual and semantic features of objects are represented in the EEG data, and how the time courses are impacted by contextual congruency. For all analyses incongruent and congruent trials will be determined by participants' congruency ratings.

ERP analyses.
For the ERP analysis we will also split this into two parts. The first part will determine whether we can replicate previous N300/400 congruency effects. To do this we will perform the same ERP analysis as in Draschkow et al. (2018). We will epoch the data to -100 to 1000ms of stimulus appearance. We will calculate the mean amplitude for two consecutive time windows: 250-350ms (N300), and 350-500ms (N400) after the appearance of the object, across mid-central electrodes (electrodes FC1, FC2, C1, Cz, C2, CP1, CPz, and CP2). We will do this by averaging across trials, electrodes, and time points within subjects, separately for congruent and incongruent trials, and assessing differences with paired sample t-tests in the N300 and N400 time windows.
The second part will be to perform an exploratory analysis to investigate congruency effects across all electrodes, and from zero to 1000ms after the presentation of the object image. The second part will be to perform an exploratory analysis to investigate congruency effects across all electrodes, and from zero to 1000ms after the presentation of the object image. Examining time points other than the N300/400 allows us to determine whether there are any additional congruency effects that occur outside of these ERPs. We would also be able to determine whether the N300/400 indicates the middle or end of a processing step as done by Schyns, Petra, and Smith (2007) for the N170. By examining all electrodes we can further determine whether there are congruency effects occurring in regions beyond the central electrodes.
We will do this using hierarchical linear modelling with LIMO EEG (LIMO EEG, RRID:SCR_009592, Pernet et al., 2011). LIMO analyses the data in two steps: the first level consists of estimating parameters of a general liner model (GLM) for each subject at each time point and each electrode individually. The second level of the analysis takes the beta coefficients obtained from each subject in the first level of the analysis and analyses them across subjects to test for statistical significance. LIMO then applies a bootstrap cluster correction for multiple comparisons. The second level of analysis offers five statistical test options that can be performed on the beta coefficients across subjects. Here we will use a regression with a fixed factor of congruency.
RSA analysis. The second part of the analysis will use RSA and computational models of vision and semantics (Clarke et al., 2018) to ask how visual and semantic processes are changed by congruency, by comparing the size and latency of RSA effects between congruent and incongruent conditions. Each object is associated with EEG activity at the 64 electrodes that can be characterised as a spatial pattern of neural activity, which varies over time. For each time point within the epoch of interest (0 to 1000ms), we will calculate the dissimilarity between activity patterns for each pair of objects using 1-Pearson's correlation. This will result in an 80×80 dissimilarity matrix showing how each object is more or less similar to each other object. These representational dissimilarity matrices (RDMs) will be calculated at each time-point.
To determine whether the RDMs capture patterns associated with visual or semantic processes, the EEG-based RDMs are subsequently correlated with model RDMs based on visual and semantic properties. The RDMs based on visual properties will be extracted from an artificial neural network model, CORnet (Kubilius et al., 2019). Node activations for the 80 objects will be extracted from the visual stimuli using THINGSvision (Muttenthaler & Hebart, 2021) and RDMs will be constructed for model layers 'V1', 'V2', 'V4' and 'IT'. The RDM for semantic information will use the semantic feature lists from a concept property norming study that used the same images (Hovhannisyan et al., 2021). The property norming study contains lists of object-feature pairs with 995 objects and 5520 features. Each object has a feature vector of length 5520, indicating if each feature is associated with the object. The semantic feature RDM is constructed from the dissimilarity between feature vectors for the 80 objects.
The visual and semantic RDMs (n=5) will be correlated with the EEG-based RDMs at each time-point using Spearman's correlation giving a correlation time-course for each model RDM, and for each participant. Significant positive effects of each model RDMs will be established using one-sample t-tests against zero (p < 0.01, alpha 5% controlling for five model RDMs) and cluster-based permutation testing (Maris & Oostenveld, 2007) to control for multiple comparisons over time. This analysis will be conducted for (1) all 80 objects together, (2) 40 congruent object-context trials and (3) 40 incongruent object-context trials.
Statistical comparisons between the congruent and incongruent RSA effects will use paired t-tests and cluster-based permutation testing. Differences in onset and peak times will be statistically tested using 1000 bootstrap resamples to create a distribution of onset and peak times for each condition.
Onset latencies and distributions will be calculated as the first timepoint with a significant p-value in a one-sample t-test against zero (α=0.01) for each resampled dataset. To evaluate potential differences in the peak latencies between conditions, group average RSA time-courses will be calculated for each resampled dataset, and the timepoint of the maximum effect will be extracted, creating a distribution of peak times for each model RDM and condition. The distribution of peak and onset latencies will then be used to define 95% confidence intervals (CIs) of pairwise differences.

Eye tracking preprocessing
To separate fixation eye movements from saccades we will use the identification two-means clustering algorithm (I2MC; Hessels et al., 2017). This is an algorithm that automatically labels fixations across a wide range of noise levels and data loss making it appropriate for the Hololens 2 gaze data's low sampling rate.
For this algorithm we will run the I2MC.m script in MATLAB with the default settings. This script contains an automatic algorithm for labelling fixations and saccades which requires no input from the user other than importing the data. The algorithm follows three main steps: interpolation of missing data, two-means clustering, and fixation labelling.
For the interpolation, periods of missing coordinates in the gaze coordinate signals were interpolated provided that the period of missing coordinates had to be shorter than 100ms and that valid data had to be available for at least two samples at each end of the missing window. Once the interpolation is complete a two-means clustering is performed. For the clustering a moving window of 200ms width slides over the gaze position signal. If the current window contains no missing data, the gaze position data will be forced into two clusters. A clustering weight is then applied to the current window from the cluster membership. The window is then moved one sample in the gaze position signal. A weight is assigned to this new window, and so on until all windows in the data have been assigned a weight. If a gaze sample was included in multiple windows, then the weights for that sample are averaged.
The final step of the I2MC algorithm is to label the fixations. A cut-off is used to determine fixation candidates from the clustering-weight signal. All periods of clustering-weight signal below this cut-off are labelled as fixation candidates, and thereafter consecutive fixation candidates are merged. Finally, short fixation candidates are excluded from the output. We will define fixations based on the gaze direction. If the head rotates relative to our objects, but the direction of gaze stays constant we would count that as a fixation.

Statistical analysis of eye tracking data
To get an indication of how long participants process the objects, and how this is impacted by congruency we will analyse participants' fixation durations. We will do this using linear mixed models with a fixed effect of congruency, and random intercepts of participant and item. We will start with random slopes of congruency. In case the model does not converge we will prune the random effects model using the procedure described in Bates et al., 2015. In addition to congruency effects, the previous literature also indicates that N300/400 effects relate to semantic processes (Kutas & Federmeier, 2011). Combined, these findings suggest that additional semantic processing is required to recognise objects in incongruent contexts compared to in congruent contexts. Therefore, we predict that RSA effects for semantic object properties will be larger for objects in incongruent contexts compared to in congruent contexts. Further, as objects in incongruent contexts are recognised more slowly than objects in congruent contexts ( . Therefore, we predict that we will replicate this finding in the current study.

Dissemination of findings
The study will be published in an open-access peer reviewed journal and the associated data and scripts for the study will be placed in an open access data repository.

Discussion
Previous research has shown that object recognition involves the rapid transformation of visual information to semantics to be able to accurately identify objects. It has also been shown that object recognition is influenced by the environment, with a congruent environment facilitating object recognition. However, what has not been established is how object recognition is influenced by the environment when the objects are embedded in the real-world and how this impacts the temporal dynamics of object recognition. Through the combination of mEEG, AR, and advanced statistical and computational techniques, we aim to elucidate the impact of real-world environments on the temporal dynamics of object recognition.
This research can also provide a basis for more exploratory questions concerning real-world contexts and exploration, such as how contextual gists form over time as we move into a new environment, how these relate to event boundaries, and whether congruency effects relate to how well such contextual gists have formed

Study status
The study is currently at the start of the data collection phase.

Data availability Underlying data
No data are associated with this article.

Extended data
To access the extended data can be found under Associated Projects on the project registration website. cognition, statistical modeling in time for congruent objects. Eye movements will also be recorded, and the authors expect longer fixation times for incongruent objects.
Overall, I thought this was an interesting proposal that aims to further our understanding of how objects are processed in a real-world environment. While I have a few suggestions for improving the design, the authors should be commended for their ambitious plan. Studying the brain under naturalistic conditions is challenging and this proposal attempts to balance experimental control (i.e., preselected objects, controlled onset times, manipulation of context) with an immersive experience that provides improved external validity. Below, I outline my suggestions for the design.
One the biggest issues is the lack of counterbalancing of objects across conditions. It is my understanding that one set of objects will serve as congruent objects, while the other set will serve as incongruent. However, any differences observed between the conditions could be attributed to differences between the objects, rather than their congruence with the environment. A tighter design would counterbalance objects across conditions (e.g., ½ the participants see a kettle in the kitchen, while the other ½ see a kettle in the parking lot). 1.
Related to this, the authors say that the images in each block will be matched on a number of predictors, but it is unclear why the blocks need to be matched. Shouldn't the conditions be matched (congruent vs incongruent)? Or are the conditions presented in a blocked fashion?

2.
Some more justification could be provided for the exploratory analysis to investigate congruency effects across all electrodes from zero to 1000ms. I think this analysis is warranted, but the authors should explicitly say what will be gained.

3.
The purpose of the eye-tracking analyses is unclear since participants are instructed to maintain fixation on the object until prompted to rate how unexpected it was. If participants are supposed to maintain fixation, it seems unlikely that fixation will vary with congruency. The authors may want to consider using pupil dilation at object onset instead, as this may provide a more meaningful measure. Increased pupil dilation is often associated with increased attention and arousal, and thus, you might expect greater pupil dilation for incongruent than congruent trials.

4.
Finally, the difference between congruent and incongruent object processing could be due to a speed-up for congruent objects and/or a slow-down for incongruent objects, relative to a context-free condition (more akin to standard laboratory tasks). The authors may want to consider running a control experiment using the same objects appearing on a computer screen while EEG and eye-movements are recorded, against which they could compare their AR data. This would help emphasize what is gained from measuring object processing within a real-world environment and determine whether the difference between congruent and incongruent conditions is due to a benefit from the congruent context or a cost from the incongruent one (and these effects may dissociate for visual and semantic information).

5.
Is the rationale for, and objectives of, the study clearly described? Yes Is the study design appropriate for the research question? expertise to confirm that it is of an acceptable scientific standard.

Assaf Harel
Department of Psychology, Wright State University, Dayton, OH, USA The current study aims to investigate how the semantic background ('context') in which objects are embedded impacts their processing, and specifically how this process is substantiated at the neural level. This is not a novel question in and of itself, yet the main conceptual advance of the study is it asking how the impact of semantic context is achieved in real-world settings, that is, with freely behaving participants in naturalistic environments. To address this question, the study proposes to record EEG and eye movements from participants who will be moving in real-world environments while presented with AR-objects varying in their semantic congruency with the surrounding backgrounds. The authors hypothesize that similar to ERP studies using static 'snapshot'-like stimuli, the current study would also result in N300-400 effects of congruency, including similar behavioral and eye-tracking results.
This is a sound research project, strongly grounded in previous research and theory and it presents a robust analytical and methodological approach. The study will resolve a concrete theoretical question, namely, whether object-background context effects repeatedly reported in the last forty years can also be observed in naturalistic environments and in the process, will contribute to the advancement of the nascent field of real-world cognitive neuroscience. I do not have any major comments about the study's experimental design or its data analysis approach.
However, I do have a small number of comments and suggestions that the authors should consider as ways of improving the study's overall validity and implications for future research.
First and foremost, the authors should think about the greater significance of the current study beyond just the question of "will the same effects observed in passive vision will also be observed in active vision conditions?". I do not mean this as a disparagement, as it is very important to establish the link between lab-based findings and those from more realistic settings. However, ifas the authors suggest -the same results are expected in both modalities, wouldn't that actually establish the usefulness and veracity of the passive approach to visual perception? Of course, I understand the need to make hypotheses regarding novel paradigms based on previous literature, but the authors should also emphasize what new, unique perspectives could be gained from their novel design. What can be different between the two settings? In other words, the authors should consider a-priori, what free exploration of the environment adds to the phenomenon of semantic context facilitation? It is not just the fact that it's "real". As various embodied theories of cognition suggest (just a few examples) -movement in the world, accumulated information over different viewpoints, should all add to the formation of a contextual gist. Why not use the high temporal resolution of the current approach (combining the strength of controlled environments, EEG, and eye-tracking) to provide novel insights into context effects? How does context form? How does it break down? These avenues should be further explored.
Second, if I understand correctly, the participants will be moving around real-world environments while looking at AR objects that will either be congruent of incongruent with the semantic content of the surrounding environments. In that case, I have a few comments regarding the experimental design.
How is the background-object context defined here? Usually, studies of semantic congruency use specific semantic scene categories (often basic-level scene categories), which have distinct objects related to them (i.e., a kitchen and a pan). However, if the background here is defined as a general manmade background (I could tell an office/lab space from Figure 2), how will specific objects be related to the background? Indeed, it may be argued that two manmade tools may be more related to a general manmade background than natural objects, but the level of relatedness here is quite coarse and does not provide a sufficient level of information, and it is definitely not comparable to previous studies of the topic. The authors should be aware of this major limit, or try to think of ways of circumventing it, such as using full VR environments, or have participants explore real spaces (e.g., kitchens, bedrooms, etc.). Also, how are outdoor (I am assuming mostly) natural contexts defined? Those may prove to be quite different from manmade (indoor) scenes. The authors should consider explicitly testing the specific effect of having superordinate scene categories, and how would those, in turn, differ from basic-level ones. 1.
Another limit that should be acknowledged by the authors is that AR objects are in themselves segregated from their background, that is, they are probably NOT perceived as real, authentic objects. Thus, even if semantic congruency effects will be observed, the conclusions should be qualified and limited to AR-like circumstances. Otherwise, the authors would first have to establish that real-world objects are perceived and recognized in a similar fashion to AR objects.

2.
Third, the current task used by the authors at the end of each trial may induce semantic consistency effect artifactually. Asking participants at the end of each trial whether the object is consistent or not with its background will raise the participants' awareness of the experimental manipulation, and thus the observed effects may be due to the task manipulation rather than anything else. I would strongly suggest the authors either use this task as is (to get potential ratings of congruency), but then test another sample of participants without using such an explicit task, or simply revert to using a paradigm that does not have an explicit judgment of object congruency.

3.
Is the rationale for, and objectives of, the study clearly described? Yes

Moreno I. Coco
Sapienza University of Rome, Rome, Italy This prospective study aims to investigate the impact of more realistic contexts on the integration of object-scene through the combined use of augmented reality, mobile electroencephalography, and eye-tracking. Participants (34) will navigate two possible environments (1 indoor, 1 outdoor) which will contain 80 objects, half of which are congruent and the other half incongruent with the environment (e.g., a pen vs. a flower on a desk). Based on the literature the authors expect a larger negative shift of brain activity to be associated with an incongruent compared to congruent object at a latency of 300/400 ms (N300/N400) after having inspected it for the first time. Congruent objects should also be better recognized.
Overall, I like the study and I believe that it is indeed worth extending research on object recognition, and phenomena of context integration, to more ecological and naturalistic scenarios. There are still some obscurities, though, in aspects of the rationale/background/objectives, the design of the study, and the methods of analysis that I believe the authors could clarify/strengthen.
In the rationale, the authors claim that "little is known about what aspects of the recognition process are modulated by the context". They then cite a couple of studies, most using stimuli that are photoshopped ( When continuing to read the protocol, the novelty seems to be the use of RSA but it remains rather vague as to what insights are we supposed to learn, and what aspects of the object-scene integration would it help explain. So, in a way, the objectives could be much better stressed and identified. Some ideas seem to be present in the hypotheses section, but they certainly need expansion and greater clarity (e.g., why would visual and semantic information be represented earlier for congruent objects? And earlier concerning what exactly? A precise time-point of the exploration?) The design seems a little weird to me and there are also aspects of the analysis that need to be clarified.
Only two environments will be used, and then, there will be some superposition of 80 object pictures (40 congruent/40 incongruent) taken from the (Hovhannisyan et al., 2021) 5 dataset.
If there are only two environments, how would they make sure that objects truly are congruent or incongruent with them? How many objects would fit an office desktop, for example, and how is established the magnitude of the semantic association with it? What about the location? Would it be counterbalanced, or would the object appear always at the same location? And if the objects change position, how would they account for the effects of spatial locations on the likelihood of an object being there? In a nutshell, it is unclear how would the semantic consistent/inconsistency effect be operationalized and properly normed.  6 . This study had 40 participants, and what about the number of items? They seem to have twice as many items, and as the authors are certainly aware, EEG is very noisy, I can only imagine how much noisier would a mEEG be. So, why so few items per participant, which connects also to why so few environments. Note, also, power analysis should be run according to the analysis they wish to do. They propose to use linear mixed effect models, and so they may want to estimate their effect size according to this (e.g. Green & Macleod, 2016) 7 .
○ Eye-tracking: if the sampling rate is 30Hz, how could a speed-velocity saccade detector be reliably used? It is simply too low to be confident about it. Better to use other methods that estimate fixation and saccade on spatial information (e.g., Hessels et al., 2017) 8 .
○ Is the analysis of EEG time-locked to eye-movement information? If so, when and how? The study protocol is very vague about the alignment between all those different signals they are collecting. In general, I felt that the different paragraphs seem to stand alone, so the reader is missing the logical flow connecting the data being analyzed.