Journal of Experimental Psychology: General Prospective Search Time Estimates Reveal the Strengths and Limits of Internal Models of Visual Search

Having an internal model of one ’ s attention can be useful for effectively managing limited perceptual and cognitive resources. While previous work has hinted at the existence of an internal model of attention, it is still unknown how rich and ﬂ exible this model is, whether it corresponds to one ’ s own attention or to a generic person-invariant schema, and whether it is speci ﬁ ed as a list of facts and rules or alternatively as a probabilistic simulation model. To this end, we tested participants ’ ability to estimate their own behavior inavisualsearch taskwithnoveldisplays.In sixonlineexperiments(fourpre-registered),prospectivesearch time estimates re ﬂ ected accurate metacognitive knowledge of key ﬁ ndings in the visual search literature, including the set-size effect, higher ef ﬁ ciency of color over conjunction search, and the asymmetric contributions of target and distractor identities to search dif ﬁ culty. In contrast, estimates were biased to assume serial search, and demonstrated little to no insight into sizeable effects of search asymmetries for basic visual features, and of target-distractor similarity. Together, our ﬁ ndings reveal a complex picture, where internal models of visual search are sensitive to some, but not all, of the factors that make some searches more dif ﬁ - cult than others.

assume serial search, and demonstrated little to no insight into sizeable effects of search asymmetries for basic visual features, and of target-distractor similarity.Together, our findings reveal a complex picture, where internal models of visual search are sensitive to some, but not all, of the factors that make some searches more difficult than others.

In order to efficiently interact with the world, agents construct mental models: simplified representations of the environment and of other agents that are accurate enough to generate useful predictions and handle missing data (Forrester, 1971;Friston, 2010;Tenenbaum et al., 2011).For example, participants' ability to predict the temporal unfolding of physical scenes has been attributed to an "intuitive physics engine": a simplified model of the physical world that uses approximate, probabilistic simulations to make rapid inferences (Battaglia et al., 2013).Similarly, having a simplified model of planning and decision-making allows humans to infer the beliefs and desires of other agents based on their observed behavior (Baker et al., 2011).Finally, in motor control, an internal model of one's motor system and body allows subjects to monitor and control their body (Wolpert et al., 1995).This internal forward model has als been proposed to play a role in differentiating self and other (Blakemore et al., 1998).In recent years, careful experimental and computational work has advanced our understanding of these internal models: their scope, the abstractions that they make, and the consequences of these abstractions for faithfully and efficiently modeling the environment.

Agents may benefit from having a simplified model not only of the environment, other agents, and their motor system, but also of their own perceptual, cognitive, and psychological states.For example, it has been suggested that knowing which items are more subjectively memorable is useful for making negative recognition judgments ("I would have remembered this object if I saw it," Brown et al., 1977).Similarly, children guided their decisions and evidence accumulation based on model-based xpectations about the perception of hidden items (Siegel et al., 2021).In the context of perception and attention, Graziano and Webb (2015) argued that having a simplified Attention Schema-a simplified model of attention and its dynamics-is crucial for monitoring and controlling one's attention, similar to how a body schema supports motor control.

Indeed, people are not only capable of predicting the temporal unfolding of physical scenes, or the behavior of other agents, but also the workings of their own attention under hypothetical scenarios.In one study, participants held accurate beliefs about the serial nature of visual search for a conjunction of features, and the parallel nature of visual search for a distinct color (Levin & Angelone, 2008).Similarly, the majority of third graders knew that the addition of distractors makes finding the target harder, particularly if the distractors and target are of the same color (Miller & Bigi, 1977).These and similar studies established the existence of metacognitive knowledge about visual search, as a result raising new questions about its structure, limits, and origins.We identify three such open questions.First, do internal models of visual search represent search difficul y along a gradient, or alternatively classify search displays as being either parallel or serial?Second, to what extent is knowledge about visual search learned or calibrated based on first-person experience?And third, are internal models of visual search structured as a list of facts and laws, or as an approximate probabilistic simulation?

Here we take a first step toward providing answers to these three questions, using visual search as our model test case for internal models of perception and attention more generally.Participants estimated their prospective search times in visual search tasks and then performed the same searches.Similar to using colliding balls (Smith & Vul, 2013) and falling blocks (Battaglia et al., 2013) to study intuitive physics, here we chose visual search for being thoroughly studied and for following robust behavioral laws.In Experiments 1 and 2, we used simple colored shapes as our stimuli, and compared participants' internal models to scientific theories of attention that distinguish parallel from serial processing.We found that participants represented the relative efficiency of different search tasks, but had a persistent bias to assume serial search.In Experiments 3 and 4, we used unfamiliar stimuli from the Omniglot dataset (Lake et al., 2011) with the purpose of testing the richness and compositional nature of participants' internal models, and their reliance on person-specific knowledge.We find that participants are capable of predicting their search times, even for novel stimuli.Furthermore, we show that for complex stimuli, internal models of visual search are better fitted to one's own search behavior compared with the search behavior of other participants.Finally, in Experiments 5 and 6, we find important limitations of these models: they fail to represent asymmetries between searching for the presence or absence of basic visu

features, and are blind to the effects of semantic
target-distractor similarity on search difficulty.Together, we find that people are capable of estimating the relative search difficulty of previously unseen searches, but that this ability is limited by having only partial insight into the many factors that affect visual search.

Experiments 1 and 2: Shape, Orientation, and Color An internal model of visual search may take a similar form to that of a scientific theory, by specifying an ontology of concepts and a set of causal laws that operate over them (Gerstenberg & Tenenbaum, 2017;Gopnik & Meltzoff, 1997).For example, participants may hold an internal model of visual search that is similar to Anne Treisman's Feature Integration Theory.According to this theory, visual search comprises two stages: a pre-attentive parallel stage, and a serial-focused attention stage (Treisman, 1986;Treisman & Sato, 1990).In the first stage, visual features (such as color, orientation, and intensity) are extracted from the display to generate spatial "feature maps."Targets that are defined by a single feature with respect to their surroundings can be located based on these feature maps alone ( feature search; for example, searching for a red car in a road full of yellow taxis).Since the extraction of a feature map is pre-attentive, in these cases searc can be completed immediately.In contrast, sometimes the target can only be identified by integrating over multiple features (conjunction search; for example, if the road has not only yellow taxis, but also red buses).In such cases, attention must be serially deployed to items in the display until the target is identified.

A simplifying assumption of Feature Integration Theory is that there is no transfer of information between the pre-attentive and focused attention stages.In other words, observers cannot selectively direct their focused attention to items that produced strong activations in the pre-attentive stage.Guided Search models (Wolfe, 1994(Wolfe, , 2021;;Wolfe et al., 1989) assume instead that participants use these pre-attentive guiding signals in their serial search.Compared to Feature Integration Theory, Guided Search models provide a much better fit to empirical data, at the expense of being more complex and rich in detail.To date, it is unknown where internal models of visual search fall on this performancecomplexity trade-off: do people differentiate between parallel and serial searches like in Feature Integration Theory, or do they represent search difficulty on a continuum, more like Guided Search?

In Experiments 1 and 2, we used stimuli that lend themselves to a categorical distinction between parallel and serial search: simple geometrical shapes of different colors and orientations.We asked whether participants' internal models of visual search predict which search displays demand serial deployment of attention and which do not.Critically, participants gave their search time estimates before they were asked to perform searches involving these or similar stimuli, so their search time estimates reflected prior beliefs about search efficiency.Experiment 2 was designed to replicate and generalize the results of Experiment 1 to a new stimulus dimension (orientation) and distractor set sizes.Ou

hypotheses an
analysis plan for Experiment 2, based on the results of Experiment 1, were pre-registered prior to data collection (pre-registration document: https://doi.org/10.17605/OSF.IO/2DPQ9).Raw data, experiment demos, and full analysis scripts are available at https://github .com/matanmazor/metaVisualSearch.


Participants

Experiments were approved by the Massachusetts Institute of Technology Committee on the Use of Humans as Experimental Subjects under protocol 0812003014.All participants gave their informed consent prior to participating.For Experiment 1, 100 participants were recruited from Amazon's crowdsourcing web service Mechanical Turk.Experiment 1 took about 20 min to complete.Each participant was paid $2.50.The highest performing 30% of participants receiv

an additi
nal bonus of $1.50.For Experiment 2, 100 participants were recruited from the Prolific crowdsourcing web service.The experiment took about 15 min to complete.Each participant was paid £1.5.The highest performing 30% of partic

ants received an
additional bonus of £1.


Procedure

The study was built using the Lab.js platform (Henninger et al., 2019) and h sted on a JATOS server (Lange et al., 2015).Demo versions of all six experiments are available at https://github.com/matanmazor/metaVisualSearch.


Familiarization

First, participants were acquainted with the visual search task.The instructions for this part were as follows:

In the first part, you will find a target hidden among distractors.First, a gray cross will appear on the screen.Look at the cross.Then, t e target and distractors will appear.When you spot the target, press the spacebar as quickly as possible.Upon pressing the spacebar, the target and distractors will be replaced by up to five numbers.To move to the next trial, type in the number th

replaced t
e target.

The instructions were followed by four trials of an example visual search task (searching for a T among 7 Ls).Feedback was delivered on speed and accuracy.The purpose of this part of the experiment was to familiarize participants with the task.


Estimation

After familiarization, participants estimated how long it would take them to perform various visual search tasks involving novel stimuli and various set sizes.On each trial, they were presented with a target stimulus and a display of distractors and were asked to estimate how long it would take them to find the target if it was hidden among the distractors (see Figure 1).

To motivate accurate estimates, we explained that these visual search tasks will be performed in the last part of the experiment, and that bonus points will be awarded for trials in which participants detect the target as fast or faster than their search time estimate.The number of points awarded for a successful search changed as a function of the estimate given for the same search, such that more points were offered for riskier estimates.In order to meaningfully compare estimates for different searches, it was important that any tendency to produce risky or conservative estimates is conserved across all searches.To achieve that, the number of points offered for a successful search was set to 10/√estimate.We chos this rule because, for right-skewed log-normal reaction time distributions, an optimal strategy is to consistently choose an estimate that is aligned with the 70th quantile of the estimated RT distribution (see Appendix).The report scale ranged from 0.1 to 4 s.

After one practice trial (estimating search time for finding one T among three randomly positioned Ls), we turned to our stimuli of interest.In Experiment 1, participants estimated how long it would take them to find a red (#FF5733) square among green (#16A085) squares (color condition), red circles (shape condition), and a mix of green squares, red circles, and green circles (shape-color conjunction condition), for set sizes 1, 5, 15, and 30.Together, participants estimated the expected search time of 12 different search tasks (see Figure 1, upper right panel).In Experiment 2, participants rated how long it would take them to find a red tilted bar (20°off vertical) among green titled bars (color condition), red vertical bars (orientation c ndition), and a mix of green tilted and red vertical bars (orientation-color conjunction

ndition) for s
t sizes two, four, and eight.Together, participants estimated the expected search time of nine different search tasks (see Figure 1, lower right panel).In both experiments, the order of estimation trials was randomized between participants.


Visual Search

Participants performed three consecutive

arch tas
s for each of the 12 (Experiment 1) or 9 (Experiment 2) search types.The order of presentation was randomized between participants.No feedback was delivered about speed.To motivate accurate responses, error trials were followed by a 5-s pause.


Results

Accuracy in the visual search task was reasonably high in both Experiments (Experiment 1: M = 0.93, 95% CI [0.90, 0.96]; Experiment 2: M = 0.82, [0.77, 0.87]).Error trials and visual search trials that took shorter than 200 ms or longer

han 5 s were
xcluded from all further analyses.Participants were excluded if more than 30% of their trials were excluded based on the aforementioned criteria, leaving 89 and 74 participants for the main analysis of Experiments 1 and 2, respectively.


Search Times

For each participant and distractor type, we extracted the slope of the function relating RT to distractor set size.As expected, search slopes for color search were not significantly different from zero in Experiment 1, (−0.40 ms/item; t[88] = − 0.45, p = .652,BF 01 = 7.74) and Experiment 2 (0.51 ms/item; t[73] = 0.07, p = .946,BF 01 = 7.80).This is consistent with color being a basic feature that is not dependent on serial attention for its extraction by the visual system (Treisman, 1986;Treisman & Sato, 1990).The slope for shape search was close, but significantly higher than zero (5.66 ms/item; t[88] = 4.35, p , .001), and the slope for orientation was numerically higher than zero (11.05 ms/item) but not significantly so (t[73] = 1.5

p = .139,BF 01 = 2.
0).In both Experiments, conjunction search gave rise to search slopes significantly higher than zero (Experiment 1: 14.80 ms/item, t[88] = 9.16, p , .001;Experiment 2: 72.14 ms/item, t[73] = 7.50, p , .001;see Figure 2).


Estimation Accuracy

We next turned to analyze participants' prospective search time estimates, and their alignment with actual search times.In both Experiments, participants generally overestimated their search times.This was the case for all search types across the two Experiments (see Figure 2, left panels).This

expected, based on our bonus sch
me that incentivized conservative estimates (see Appendix).Despite this bias, estimates were correlated with true search times, supporting a metacognitive insight into visual search behavior (see Figure 2, left panels; within-subject Spearman correlations, Experiment 1: M = 0.28, 95% CI [0.21, 0.35], t[88] = 7.77, p , .001;Experiment 2: M = 0.16, [0.07, 0.26], t[73] = 3.48, p = .001).


INTERNAL MODELS OF VISUAL SEARCH

To test participants' internal models of visual search, we analyzed their estimates as if they were search times, and extracted estimation slopes relating estimates to the number of distractors in the display (see Figure 2, right panels).Estimation slopes (expected ms/item) were steeper than search slopes for all search types.In particular, althou h search time for a deviant color was unaffected by the number of distractors, participants estimated that color searches with more distractors should take longer (mean estimated slope in Experiment 1: 17.76 ms/item; t[88] = 6.35, p , .001; in Experiment 2: 29.43 ms/item; t[73] = 2.63, p = .010).In other words, at the group level, participants showed no metacognitive insight into the parallel nature of color search.

Although they were significantly different from zero, in both Experiments estimation slopes for color search were significantly shallower than for conj

ction search (Experiment 1: t[88] = 4.08, p , .00
,Experiment 2: t[73] = 3.87, p , .001).In contrast, although true search slopes were shallower for shape and orientation than for conjunction ( p's , .001), the difference in estimation slopes was not significant (difference between shape and conjunction slopes: t[88] = 1.65, p = .103;difference between orientation and conjunction slopes: t[73] = 1.18, p = .244).


Experiments 3 and 4: Complex, Unfamiliar Stimuli

In Experiments 1 and 2, an internal model of visual search allowed participants to accurately estimate how long it would take them to find a target stimulus in arrays of distractor stimuli.Participants had insight into the set-size effect and into th

fact that
conjunction searches are more difficult than color searches.Positive color search slopes that are nevertheless significantly shallower than conjunction search slopes further suggested a graded representation of search efficiency, but no awareness of the possibility of parallel processing of preattentive basic features.An alternative interpretation is that a gradient of positive search slopes emerges MAZOR, SIEGEL, AND TENENBAUM due to a group averaging effect of individual dichotomous representations.If some participants represent color search as parallel, and others as equally diff cult as conjunction search, the mean slope for color search would be higher than zero and significantly lower than for conjunction search.

In Experiments 3 and 4, we addressed this possibility and further asked how rich this model is, by using displays of complex stimuli with which participants are unlikely to have any prior experience (letters from a medieval Alphabet and from the Futurama TV series, hand drawn by Mechanical Turk workers).Here, insight into the set size effect and its absence in feature searches would not be useful for generating accurate search time estimates.Instead, participants' internal model of visual search must be capable of extracting relevant features from rich stimuli, and using these features to generate graded stimulus-specific predictions.Using these more complex stimuli further allowed us to ask if search-time estimates rely on person-specific knowledge, as subjects are expected to vary more in their search behavior in more complex displays.Experiment 4 followed Experiment 3 and was pre-registered (preregistration document: https://doi.org/10.17605/OSF.IO/DPRTK).Raw data, experiment demos, and full analysis scripts are available at https://github.com/matanmazor/metaVisualSearch.


Participants

For Experiment 3, 100 participants were recruited from the Prolific crowdsourcing webservice.The experiment took about 15 min to complete.Participants were paid £1.5.The highest

erforming 30%
of participants received an additional bonus of £1.For Experiment 4, 200 participants were recruited from the Prolific INTERNAL MODELS OF VISUAL SEARCH crowdsourcing webservice.We recruited more participants for Experiment 4 in order to have sufficient statistical power for our inter-subject correlation analysis.The experiment took about 8 min to complete.Participants were paid $1.27.The highest performing 30% of participants received an additional bonus of $0.75.


Procedure

The procedure for Experiments 3 and 4 was similar to that of Experiment 1 with several changes.

Stimuli were letters drawn by Mechanical Turk workers (Lake et al., 2011), instead of geometr

al shapes
see Figure 3).In Experiment 3, we used letters from the Alphabet of the Magi.In Experiment 4, we used letters from the Futurama television series as well as Latin letters.We explained to participants that they will search for a specific letter (the target letter) among copies of another letter (the distractor letter).In Experiment 3, both target and distractor were letters from the Alphabet of the Magi, and distractors were drawn by different Mechanical Turk workers.In Experiment 4, on half of the trials, the target was a Latin letter and distractors were Futurama letters, and on the other half the target was a Futurama letter and distractors were Latin letters.In these experiments, distractors were copies of the same letter drawn by the same Mechanical Turk worker.This was important for our visual search asymmetry analysis (see below).

In the familiarization part, we used as target and distractors two letters from the Alphabet of the Magi (Experiment 3) and two letters from the Futurama alphabet (Experiment 4).Importantly, these letters were only used for training and did not appear in the Estimation or Visual search parts.In the estimation part, participants gave search time estimates for eight search tasks, all involving 10 distractors, and in the visual search part, they performed these search tasks.To minimize random variation in spatial configurations (which was important for the search asymmetry analysis), in Experiment 4, letters appeared on an invisible clock face.Finally, the report scale ranged from 0.1 to 4 s in Experiment 3 and to 2 s in Experiment 4.


Results

Accuracy in the visual search task was high in both experiments (Experiment 3: M = 0.89, 95% CI [0.86, 0.92]; Experiment 4: M = 0.97, [0.96, 0.98]).Error trials and visual search trials that took shorter than 200 ms or longer than 5 s were excluded from all further analyses.Participants were excluded if more than 30% of their trials were excluded based on the aforementioned criteria, leaving 88 and 200 participants for the main analysis of Experiments 3 and 4, respectively.Importantly, in both experiments all searches involved exactly 10 distractors, so a positive correlation could not be driven by the effect of distractor set size.Furthermore, since participants had no prior experience with our stimuli, their estimates could not have been informed by explicit knowledge about specific letters ("The third letter in the Alphabet of the Magi pops out to attention when presented between instances of the fourth letter" or "the fifth letter in the Futurama Alphabet is difficult to find when presen

").These positive correlations
eveal a more intricate knowledge of visual search.Our next two analyses were designed to test whether estimates were based on person-specific knowledge, and whether their generation involved a simulation of the search process.


Estimation Accuracy


Cross-Participant Correlations

We chose unfamiliar letters as stimuli for Experiments 3 and 4 in order to make heuristic-based estimation more difficult, and to encourage an introspective estimation process.If participants were using idiosyncratic knowledge about their own attention, we would expect to find higher correlations between their search time estimates and their own search times (self-self alignment), compared to with the search times of a random surrogate participant (self-other alignment).To test this, we ran a non-parametric permutation test, comparing self-self and self-other alignment in prospective search time estimates.In Experiment 3, a numerical difference between self-self (mean Spearman correlation M r = 0.44) an

self-other align
ent (M r = 0.41) was marginally significant (p perm = 0.05).In Experiment 4, we pre-registered this analysis and found a significant advantage for self-self alignment compared with self-other alignment (see Figure 6; mean Spearman correlations for self-self


INTERNAL MODELS OF VISUAL SEARCH

M r = 0.10 and self-other M r = 0.04, p perm = 0.01).This result can be interpreted as indicating that at least some of the participants' internal model of visual search builds on idiosyncratic knowledge about their own attention.Alternatively, it may reflect inter-individual differences in the perception of complexity and similarity of targets and distractors.We unpack some implications of these two competing accounts in the General Discussion section.


Estimation Time

We next looked at the time taken to produce search time estimates in the estimation part.We reasoned that if participants had to mentally simulate searching for the target in order to generate their search time estimates, they would take longer to estimate that a search task will terminate after, for

xample, 1,500 compared t
1,000 ms.This is similar to how a linear alignment between the degree of rotation and response time in a mental rotation task was taken as support for an internal simulation that evolves over time (Shepard & Metzler, 1971).We find no evidence for within-subject correlation between estimates and the time taken to deliver them, not in Experiment 3 (t [86] = 0.40, p = .692)and not in Experiment 4 (t[191] = 0.74, p = .458).However, given that estimation times were three times longer than search time estimates (median time to estimate = 5 s in Experiment 3 and 3 s in Experiment 4), a simulation-driven correlation may have been masked by other facto s that contributed to estimation times, such as motor control over the report slider.


Visual Search Asymmetry

To keep things simple, internal models of visual search may make the simplifying assumption that target and distractor stimuli contribute to search difficulty in similar ways.For example, models can specify that search time generally inversely scales with the perceived similarity between the target and distractor stimuli, without taking into account the different roles target and distractor play in determining search difficulty.Alternatively, internal models of visual search may represent the asymmetric nature of visual search tasks (finding an A among Bs is not the same as finding a B among As) at the expense of additional model complexity.

To test whether internal models of visual search were sensitive to the assignment of stimuli to tar et and distractor roles, we leveraged a well-established phenomenon in visual search: subjects are generally faster detecting an unfamiliar stimulus in an array of familiar distractors compared to when the target is familiar and the distractors are not (Malinowski & Hübner, 2001;Shen & Reingold, 2001;Zhang & Onyper, 2020).This asymmetry cannot be captured by a model of visual search that is blind to the assignment of stimuli to target and distractor roles.In Experiment 4, participants were presented with pairs of familiar and unfamiliar letters, and estimated their search time for finding the familiar letter among unfamiliar distractors and vice versa.This allowed us to test for visual search asymmetries in search times and in search time estimates.

As expected, searching for a familiar target among unfamiliar distractors was more difficult on average, with a difference of 41 ms in search time (t[199] = 4.41, p , .001).To test if subjects were sensitive to t e assignment of stimuli to target and distractor roles, we extracted individual subjects' Spearman correlations between search times and their reciprocal estimates (i.e., the estimate for the same search with the target and distractor roles inverted).For example, instead of comparing search times for finding the letter v among 10 square spiral letters (stimulus pair 1) with estimates for the same search, we compared it with estimates for finding one square spiral letter among 10 v's (stimulus pair 5).If estimates were affected by the assignment o stimuli to target and distractor roles, this inversion should attenuate the correlation, but if visual search estimates reflected a symmetric notion of similarity the correlation should not be affected.

Inverting the target/distractor assignment dropped the correlation between estimates and search time to zero (M = − 0.01, 95% CI [ − 0.06, 0.04]), significantly lower than the original correlation (M D = 0.10, [0.03, 0.18], t[191] = 2.63, p = .009;see Figure 5B).This is in contrast to what is expected if search time estimates reflected symmetric similarity judgments, and in line with an interpretation of our findings as evidence for an internal model of visual search that is sensitive to the assignment of stimuli to target or distractor roles.

Interestingly, however, a difference in

ean estim
ted search time between familiar and unfamiliar targets did not reach statistical significance (M = 9.88, 95% CI [ − 5.90, 25.66], t[199] = 1.23, p = .218).A drop in subject-specific Spearman correlations without a significant difference in mean search times indicates that subjects' sensitivity to the assignment f stimuli to target and distrac

r roles was not fully captured by the m
tacognitive insight that familiar targets are more difficult to find.Subjects may have been sensitive to other visual properties that contributed to search asymmetries.In Experiment 5, we further explore sensitivity to three such features that produce robust asymmetries in visual search behavior: orientation, open edges, and addition of line strokes.


Figure 6

Self-Self Versus Self-Other Alignment Note.True correlation between estimates and search times (self-self alignment, vertical lines) plotted against a null distribution of correlations, when matching the estimates of each participant with the search time of a

ndom surrogat
participant (self-other alignment).


Experiment 5: Three Search Asymmetries

In Experiment 4, search time estimates were sensitive to the assignment of stimuli to target and distractor roles, but not to the visual search asymmetry for familiar and un

miliar sti
uli.In Experiment 5, we examined three additional search asymmetries (line orientation, o en edges, and line addition), and asked whether they are accurately specified in participants' internal models of visual search.Experiment 5 was pre-registered (pre-registration document: https://doi.org/10.17605/OSF.IO/VJQ2F).Raw data, experiment demos, and full analysis scripts are available at https://github.com/matanmazor/metaVisualSearch.


Participants

For Experiment 5, 203 participants were recruited from the Prolific crowdsourcing webservice.The experiment took about 10 min to complete.Participants were paid $1.59.The highest performing 30% of participants received an additional bonus of $1.59.


Procedure

The procedure for Experiment 5 was similar to that of Experiment 1 with several changes.

Participants estimated their prospective search times for three stimulus pairs.Within each pair, participants provided estimates for two ve sions of the search: one where the first stimulus serves as a target and the second as the distractor, and one where the roles were reversed.For each search, subjects provided estimates for set sizes of 6 and 18.The three stimulus pairs were (a) a vertical line and a tilted (20°off vertical) line, (b) a circle and a circle intersected by a line, and (c) a circle and a circle with an open gap (see Figure 7 left panel).For brevity, we refer to these last stimuli as O, Q, and C. All three stimulus pairs have been shown to produce asymmetries in visual search time, such that the assignment of stimuli to target or distractor roles affects search time (Treisman & Gormican, 1988;Treisman & Souther, 1985).

The estimation scale ranged from 0 to

s.In Exp
riment 5, we adapted the estimate-to-points conversion rule to be 10/estimate 3/4 rather than 10/estimate 1/2 .Making the number of offered points decline faster ensured that the optimal strategy is to report the median of the posterior distribution over reaction times, making it possible to directly compare median search times and prospective estimates.

In the visual search par

consecutive instanc
s of each search.In order to prevent subjects from relying on their iconic memory to identify the position of the target after making an initial response, stimuli were masked by a random black and white image for a duration of 50 ms following spacebar responses.


Results

Accuracy in the visual search task was high (M = 0.96, 95% CI [0.95, 0.97]).Error trials and visual search trials that took shorter than 200 ms or longer than 5 s were excluded from all further analyses.Participants were excluded if more than 30% of their trials were exclu ed based on the aforementioned criteria, leaving 200 participants for the main analysis of Experiment 5.


Visual Search Asymmetries

Search slopes were significantly different for the six searches (F [3.57, 704.01 Within the three stimulus pairs, we observed the expected search asymmetries.The mean search slope for finding one vertical target among multiple tilted distractors (18.12 ms/item) was significantly steeper than the slope for the inverse search (5.16 ms/item; M = 12.97, 95% CI [8.27,17.68


Estimation Ac uracy

The mean Spearman correlation between search slopes and their estimates was 0.32 and significantly higher than zero (M = 0.32, 95% CI [0.28, 0.36], t[197] = 15.76,p , .001).Contrary to our findin

from Experiments 1 to 4, the average search est
mation slope (17.48 ms/item) was significantly shallower than the average search slopes (29.00 ms/item; M D = − 11.39, [ − 14.76, − 8.02], t[197] = − 6.66, p , .001).This difference may be driven by the change to our estimate-to-points conversion rule, which now incentivized more risky estimates.

Overall, participants integrated information about the assignment of stimuli to target or distractor roles in providing their estimates.The mean Spearman correlation between search times and the estimates of their reciprocal searches (i.e., searches with the same stimuli and set sizes, but an opposite target/distractor assignment) was 0.22-significantly lower than the correlation between search times and their corresponding (non-reciprocal) estimates (0.32

M = 0.10, 95%
CI [0.05, 0.15], t[197] = 4.14, p , .001).

However, when examining the effect on search slope within specific stimulus pairs, we found little to no support for asymmetries in prospective search time estimates.Estimation slopes were not sensitive

p = .07
;open edges: mean difference of 21 ms, t[198] = 1.74, p = .084;line addition: mean difference of 40 ms, t[199] = 3.66, p , .001).However, here too, these effects are INTERNAL MODELS OF VISUAL SEARCH much smaller than the true effects in actual search behavior (mean differences of 250, 387, and 556 ms for the three stimulus pairs, respectively).


Experiment 6: Semantic Versus Vi ual Similarity

Search difficulty is a function, among other things, of the similarity between the target and the distractors.If this fact is represented in internal models of visual search, the question remains what kinds of similarity affect people's intuitions about search difficulty, and whether they are the same ones that affect visual search in practice.To test this, in Experiment 6 we manipulated semantic and visual similarity between targets and distractors, and measured their independent effects on search times and search time estimates.Experiment 6 was pre-registered (pre-registration document: https://doi.org/10.17605/OSF.IO/AH9NR).Raw data, experiment demos, and full analysis scripts are available at https://github.com/matanmazor/metaVisualSearch.


Participants

For Experiment 6, 150 participants were recruited from the Prolific crowdsourcing webservice.The experiment took about 10 min to complete.Participants

ere paid $2
The highest performing 30% of participants received an additional bonus of $2.


Procedure

The procedure for Experiment 6 was similar to that of Experiment 5. Participants estimated their prospective search times for six different searches: three searches with the letter E and three with the number 3 serving as targets.Each target was used in three conditions involving distractors that could be semantically and visually similar to the target (baseline condition: the letters H and A for the target E or the numbers 8 and 2 for the target 3), semantically dissimilar but visually similar (the numbers 8 and 2 for the letter E and the letters A and H for the number 3), or semantically similar but visually dissimilar (same as the baseline condition, but appearing in a different font (italics vs. not) relative o the target letter; see Figure 8, left panel).For each target and condition, subjects provided estimates for set sizes of 6 and 18.


Results

Accuracy in the visual search task was high (M = 0.95, 95% CI [0.94, 0.97]).Error trials and visual search trials that took shorter than 200 ms or longer than 5 s were excluded from all further analyses.Participants were excluded if more than 30% of their trials were excluded based on the aforementioned criteria, leaving 146 participants for the main analysis of Experiment 6.

Search slopes were significantly different for the six searches (F [3.84, 544.92] = 63.72,MSE = 0.00, p , .001,ĥ2 G = .239;see  [4.11, 583.14] = 3.92, MSE = 0.00, p = .003,ĥ2 G = .012).However, unlike true search times, here we find no significant differences as a function of semantic dissimilarity (25 ms/item for both the semantic dissimilarity and baseline conditions; t[142] = 0.42, p = .673)or visual similarity (24 ms/item and 25 ms/item in the visual dissimilarity and baseline conditions; t[143] = − 1.11, p = .269).Subjects were however sensitive to the main effect

f distractor type on search time, prod
cing steeper estimation slopes for numbers (26 ms/item) than for letters (22 ms/ item; t[142] = − 2.96, p = .004;see Figure A1).Together, targetdistractor similarity along the manipulated dimensions had significant effects on search difficulty, but we found no trace for these effects in search time estimates.


Discussion

Over more than four decades of research on spatial attention, experiments where participants report the presence or absence of a target in a display revealed basic principles such as the set-size effect (Treisman, 1986;Treisman & Sato, 1990;Wolfe, 1998), the advantage for feature search over more complicated conjunction and spatial configuration searches (Treisman, 1986;Treisman & Sato, 1990), and asymmetries in the representations of visual features (Malinowski & Hübner, 2001;Shen & Reingold, 2001;Treisman & Gormican, 1988;Treisman & Souther, 1985).It is revealing that we find some of these findings intuitive and others more surprising: even without training in psychology, people have a set of expectations and beliefs about their own perception and attention, and about visual search more specifically.

Here we measured these expectations and their alignment with actual visual search behavior.In six experiments, we show that naïve participants provide search time estimates that are consistent with partial metacognitive knowledge of their own attention.In line with previous reports, prospective search time estimates reflected accurate knowledge of the set size effect and differences in efficiency between feature and conjunction searches (Levin & Angelone, 2008;Miller & Bigi, 1 77).Participants represented search efficiency along a continuum, were able to provide reasonably accurate search time estimates for complex stimuli and displays with which they had no prior experience, and were sensitive to the assignment of stimuli to target and distractor roles.At the same time, they showed little to no insight into visual search asymmetries and the effects of semantic target-distractor similarity on search efficiency, and as a group, their estimates revealed no awareness of the pop-out effect for color search.In the following paragraphs, we unpack our central findings in more detail.


Do Subjects Know That Color Pops Out?

Searching for a deviant color is relatively easy, and people know that.Psychology students correctly estimated that searching for a green vertical line is harder when some distractors are green compared to when all are red, and that increasing the number of distractors would make the search harder in the former, but not in the latter all-red case (Levin & Angelone, 2008).The understanding that adding more distractors does not affect search time in color search reflects metacognitive insight into the parallel nature of color search.Similarly, when asked to order visual search displays according to difficulty, 81% of third graders used color, but only 48% used shape, to inform their orderings (Miller & Bigi, 1977).Knowledge about the special status of color is also evident in the way we communicate with others about what we see.People consistently prefer object descriptions that include information about color, even when color information is fully redundant (Jara-Ettinger & Rubio-Fernandez, 2022).To account for this fact, Jara-Ettinger and Rubio-Fernandez (2022) suggested that speakers hold mental models of the visual search behavior of their listeners, and choose their words to maximize search efficiency according to these models.In their hypothetical implementation of th s model, color information allows listeners to restrict their search to objects of the target's color, making the search highly efficient.Thus, knowing that listeners can easily orient their attention by color, speakers prefer longer descriptions that reduce the effective set size and by that improve search efficiency.

In Experiments 1 and 2, we similarly found evidence for metacognitive knowledge that color search is easy.Estimation slopes for color search were consistently shallower than for orientationcolor and shape-color conjunction searches (for comparison, estimation slopes for shape and orientation searches showed no such difference).Still, although shallower than conjunction estimation slopes, color estimation slopes were significantly positive at the group level, reflecting a belief that color search is serial in nature.This seems to be in conflict with he results of Levin and Angelone (2008), where only 32.5% of subjects thought that adding more distractors to a color search would make it slower (compared to 87.9% for color-orientation conjunction search).However, two differences between our studies are worth pointing out.First, Levin and Angelone's sample consisted of students, who may have learned or heard about visual search, and updated their internal models accordingly.And second, the fact that the mean estimation slopes in our experiments were overall positive does not preclude the possibility that for a subset of participants, it was in fact zero.Using the proportion of positive estimation slopes for color search in Experiments 1 and 2 (0.71), and the fact that this proportion should equal 0.5 among subjects who believe that set size has no effect on color search but their estimates are affected by random noise, we can extract a lower bound for the proportion of subjects who believed color search had a positive slope p by solving the equation p + 0.5(1 − p) = 0.71, resulting in an estimate of p = 0.42.Note that this analysis conservatively assumes that estimate slopes should be positive for all subjects who believed color search was serial.In other words, among our random sample of online participants, more than 40% of subjects provided estimates that are consistent with color search being serial.

This blindness to pop-out effects indicates a missing component in internal models of visual search (or at least, in the models of some participants): unlike Feature Integration Theory and Guided Sear

models, they have no pre-attentive components.This means that i
ems are randomly selected in no particular order, and the only thing that changes between easier and harder searches is the speed with which this serial process can take place.Without a bottom-up activation of "feature maps," or effortless processing of guiding signals, this model echos early theories of vision as a sense that operates more like touch than like hearing, by sending out sensors to explore the environment (for a review, see Dedes, 2005).The immediate pop-out of color cannot take place in a model that requires subjects to voluntarily attend to individual items in order to perceive them.

Beliefs about the relative efficiency of different search tasks can also be probed by measuring the time participants take to conclude that a target is absent from a display.Unlike target-present trials that are terminated upon detecting the target, in target-absent trials decisions are made based on the belief that a hypothetical target would have been found.For example, if subjects know that color search is parallel, t ey may immediately conclude that a target is absent from an array if the target color does not immediately pop out to their attention.In contrast, subjects that hold the erroneous belief that finding a color requires a serial search will take longer to conclude that a target is missing.Using this indirect approach, and focusing on the first trials of the experiment, before subjects have the opportunity to adapt their search termination strategies, Mazor and Fleming (2022) found that subjects immediately terminate a search when the target color is absent from the search array.This provides indirect evidence that the implicit metacognitive knowledge that is involved in guiding search termination is dissociable from the kind of explicit metacognitive knowledge that we measure here.In support of this dissociation between search termination behavior and explicit metacognitive ratings of search difficulty, Mazor and Fleming found that search termination slopes were shallower for feature searches than for conjunction searches even among participants whose explicit metacognitive ratings reflected a belief that feature searches are harder.


What Is Person-Specifi

About Internal Models of Visu
l Search?

In Experiments 3 and 4, we show that internal models of visual search are at least partly person-specific: participants' predictions better fitted their own search times compared to the search times of other participants.Still, in both experiments, the correlation between participants' estimates and the search times of other participants was considerably above zero (see Figure 6).We note that above-zero self-other correlations are expected even if internal models of visual search are fully person-specific, as long as search behavior is relatively conserved across different individuals.In contrast, a significant difference between self-self and self-other correlations is expected only if some of the knowledge that is expressed in search time estimates relies on idiosyncratic knowledge.We consider two possible sources of inter-subject variation that may contribute to idiosyncratic beliefs about visual search: judgments bout similarity or complexity of visual objects, and person-specific knowledge about attention.

First, subjects may vary in how they perceive different visual objects to be simple or complex, similar or different.If perceptions of complexity and similarity contribute to search behavior, and if subjects' internal models correctly specify these effects of complexity and similarity on search behavior, generic internal models of visual search may produce person-specific search time estimates.Indeed, we found an advantage for self-self correlations only in Experiments 3 and 4, where stimuli were complex enough to produce meaningful variability in how they are perceived by different subjects.However, as we show in Experiments 4 and 5, any personinvariant specification of how similarity contributes to search time would need to be sensitive to asymmetries in the

erception of similarity (Tversky, 1977) in order t
fully account for our findings of a drop in the correlation between estimated and true search times when swapping the target and distractor roles.For example, internal models may specify that what matters most to search time is whether the target is similar to the distractors, but not so much whether the distractors are similar to the target.

Second, beliefs about attention itself may be learned or calibrated based on first-person experience.Humans accumulate observations not only of external events and objects, but also of their own cognitive and perceptual states.Specifically, subjects have been shown to notice when their attention is captured by a distractor (Adams & Gaspelin, 2021) even in the absence of an overt eye movement (Adams & Gasp lin, 2020).These observations can then be integrated into an internal model or an intuitive theory: which items are more or less likely to capture attention, under what circumstances, etc. Future research into the development of this simplified model and its expansion based on new evidence (e.g., by measuring intuitions before and after exposure to some evidence, Bonawitz et al., 2019) is needed to understand the relationship between metacognitive monitoring of attention and metacognitive knowledge of attentional processes.

This relates to recent theoretical and mpirical advances underscoring the utility of keeping a mental self-model, or a self-schema for attention control (Wilterson et al., 2020), social cognition (Graziano, 2013), phenomenal experience (Metzinger, 2003), and inference about absence (Mazor, 2021;Mazor & Fleming, 2022).For example, knowing that a red berry would be easy to find among green leaves, a forager can quickly decide that a certain bush bears no ripe fruit.Alternatively, knowing that a snake would be difficult to spot in the sand, they might allocate more attentional resources to scanning the ground.Reasonably accurate search time estimates in Experiments 3 and 4 suggest that internal models of spatial attention can be applied to unseen stimuli in novel displays, and are at least partly tailored to one's own perceptual and cognitive machinery.

What Is the Role of Target-Distractor Similarity?

Visual search is harder when distractors are similar to the target.Having insight into this simple fact, and the fact that searches become harder with the addition of more distractors, should be sufficient to produce search time estimates that are aligned with actual search times.This way, subjects can rely on a rough overlap between items that are similar to the target and ones t

t have the
otential to be distracting, and produce relatively accurate search time estimates based on their similarity judgments alone.Alternatively, as described above, subjects may be using an approximate probabilistic model of their own attention, producing search time estimates that correlate with, but are not causally dependent on, perceptions of similarity.

We set up Experiment 6 to directly test the effect of targetdistractor similarity on perceived search difficulty.To our surprise, search time estimates were not at all sensitive to sizeable effects of semantic and visual target-distractor similarity on search time.This metacognitive blindness was specific to the interaction between target and distractor identities: subjects were sensitive to the fact that number distractors are more distracting overall-an effect that we, based on our knowledge of scientific mo els of visual search, could not predict in advance.

Importantly, this finding is consistent with search time estimates being causally dependent on perceptions of similarity, just not the kind in which a number is similar to a number more than to a letter, or a tilted character more similar to tilted than to upright characters.Instead, in this hypothetical similarity space, number distractors in Experiment 6 were more similar to both 3s and Es than were letter distractors.More broadly, if search ti