Peripheral vision is mainly for looking rather than seeing

Vision includes looking and seeing. Looking, mainly via gaze shifts, selects a fraction of visual input information for passage through the brain ’ s information bottleneck. The selected input is placed within the attentional spotlight, typically in the central visual field. Seeing decodes, i.e., recognizes and discriminates, the selected inputs. Hence, peripheral vision should be mainly devoted to looking, in particular, deciding where to shift the gaze. Looking is often guided exogenously by a saliency map created by the primary visual cortex (V1), and can be effective with no seeing and limited awareness. In seeing, peripheral vision not only suffers from poor spatial resolution, but is also subject to crowding and is more vulnerable to illusions by misleading, ambiguous, and impoverished visual inputs. Central vision, mainly for seeing, enjoys the top-down feedback that aids seeing in light of the bottleneck which is hypothesized to starts from V1 to higher areas. This feedback queries for additional information from lower visual cortical areas such as V1 for ongoing recognition. Peripheral vision is deficient in this feedback according to the Central-peripheral Dichotomy (CPD) theory. The saccades engendered by peripheral vision allows looking to combine with seeing to give human observers the impression of seeing the whole scene clearly despite inattentional blindness.


Introduction: vision in light of an attentional bottleneck
This paper gives a perspective on peripheral vision's role in vision and awareness.As an important background, brain cannot process all the sensory information in the environment.An information bottleneck, also called attentional bottleneck, admits only a tiny fraction of sensory information to deeper processing.This is manifest in the inattentional blindness, making us blind to visual input outside our attentional spotlight (Simons and Chabris, 1999).To process the most important information for survival, primate vision has two main processing stages: selection and decoding (see Glossary) (Zhaoping, 2014(Zhaoping, , 2023b)), see Fig. 1C.Selection chooses the fraction of visual inputs for passing the processing bottleneck.Decoding recognizes, i.e., infers or discriminates, objects from the selected information.In natural behavior, selection is typically overt by shifting gazelookingplacing the selected input inside the attentional spotlight, typically centered at fovea, for decoding or seeing.Hence, to meet the bottleneck challenge, peripheral vision works in concert with central vision by specializing in looking rather than seeing, while central vision specializes in seeing rather than looking.
Accordingly, the rest of the paper argues that peripheral vision is suitably powerful for looking and naturally limited for seeing, using experimental and computational literature.First we illustrate that visual information loss along the visual pathway occurs not only in the retina during visual sampling, but also in the central brain.Then, we show examples that looking by peripheral vision can occur before seeing by central or peripheral vision (Zhaoping and Guyader, 2007), and can be guided by visual signals that even central vision cannot see (Zhaoping, 2012(Zhaoping, , 2008b)).This is because looking is often guided by a saliency map created from external visual inputs by the primary visual cortex (V1) (Li, 2002), whose visual responses are often before perception or without perceptual outcomes.In contrast, peripheral seeing is vulnerable to visual crowding and misleading visual inputs.This is largely due to its deficiency in a brain resource devoted to central vision: the top-down feedback that aids seeing in light of the bottleneck (Zhaoping, 2017(Zhaoping, , 2019)).

Information loss within and beyond the retina
Peripheral vision is known for its lower visual acuity, demonstrated by Fig. 1A.At larger eccentricities from the fovea, images of objects need to be enlarged in order to be recognized.This poorer acuity is caused not only by sparser photoreceptors in the retina (causing information loss during input sampling) (Zhaoping, 2014), but also by a loss of information beyond the retina via an information bottleneck, often referred to as the attentional bottleneck.This is manifest in visual crowding E-mail address: li.zhaoping@tuebingen.mpg.de.

Contents lists available at ScienceDirect
Neuroscience Research (Whitney and Levi, 2011), as in Fig. 1B.When the two 'T's are equally distant from the fovea, they enjoy equal retinal sampling resolution by cone receptors.This resolution is apparently sufficient to recognize the left T but not the right T. The latter is crowded by contextual letters that are spatially non-overlapping with this T.
However, one might wonder whether neural processing in the retina downstream from the cones causes the contextual letters to degrade or suppress information about the T. The demonstration in the bottom panel of Fig. 1B excludes this possibility.The two 3 × 3 arrays are leftright mirror-symmetric relative to the left cross, except with identical central bars.The central bar is much more legible in the right array.The two arrays differ critically in the orientation contrast between the central and surrounding bars.However, human retinal neurons have little sensitivity to orientation.Further, adding more contextual bars in the far-right array reduces rather than increases crowding (Manassi et al., 2012).
Apparently, the information bottleneck restricting the amount of information feeding forward lies beyond the retina.Imagine that the small amount of information admitted by the bottleneck only suffices to convey two sufficiently different orientation values about the visual inputs surrounding and including the central item (for simplicity, ignore other feature dimensions such as color and spatial locations).This is only sufficient to recognize the T when the surrounding letters are absent.
By directly fixating on the crowded T, central vision can recognize it.This is not just because central vision has a higher retinal sampling resolution.This is also not because the bottleneck does not apply to central vision.As will be explained later, central vision has feedback mechanisms to retrieve additional information through the bottleneck.

Looking by peripheral vision versus seeing by central vision
In natural behavior, it is not typical to refrain from looking towards an object, e.g., the T in Fig. 1B, when one tries to see and scrutinize it.

Looking by peripheral vision before seeing
Logically, looking, especially the first look, must often occur before seeing, since the processing bottleneck means seeing all sensory inputs in real time before the selection is impossible.This is evident in the visual search in Fig. 2A (Zhaoping and Guyader, 2007).Here, the observer's task was to search for a uniquely oriented bar, which is here tilted counterclockwise from vertical.When search started, the gaze was at the image center and so the target was illegible due to crowding.Nevertheless, the first saccade was correctly aimed at the target, demonstrating that peripheral vision is good at looking, i.e., determining an appropriate destination for the gaze shift.The target bar is very salient because its orientation is unique; this saliency, rather than a recognition of the target, guided looking.For untrained observers, with targets at 15 • from the center of such a busy display, gaze arrived at the search target within one second in about half of the trials (Zhaoping and Guyader, 2007).
There are further reasons to think that looking occurred before seeing or recognizing the target for the case of Fig. 2A.For instance, the visual input was designed such that the uniquely tilted the target bar intersected a vertical bar to make an 'X ' shape.This X is identical to all the other X's in the image, except that it is rotated or reflected.Human object recognition is largely invariant to rotation and reflection.Thus, even though the observer's task was to find the uniquely tilted bar rather than to recognize any X shape, we can expect that if the X shape containing the target bar had been recognized before the saccade, the object shape equivalence (between the X containing the target bar and the other X's) would have led to confusion.In this example, the confusion did happen, but only after the gaze had reached the target.This made the observer abandon the target to search elsewhere.If the X had been recognized before the saccade to the target, this saccade might have been abandoned before it started.
Further evidence that looking can occur before seeing in the search in Fig. 2A comes from masking the visual inputs.In some trials, all search items were masked (by replacing each X by a star shape) at the moment when the gaze reached the target (Zhaoping and Guyader, 2007).Apparently, this masking prevented or interrupted seeing the target by central vision, and thus prevented the confusion.After the mask onset, observers could make rather accurate forced choices to report whether the target was in the left or right half of the image, even though they Fig. 1.Peripheral vision, visual crowding, and the attentional bottleneck.A: each letter is equally legible when one fixates (i.e., directs gaze at) the central dot.Visual acuity decreases with visual eccentricity from fovea (Anstis, 1974).B: in each frame, the two T's (top) or the three central bars are equally distant from the nearest '+'.The left, but not the right, T is legible when one fixates on the '+'.This suggests that the image sampling quality of the retina is sufficient at this distance from the fovea, and that information is lost beyond the retina.The bottom frame suggests that this loss is not due to neural processing in the retina.The two 3 × 3 arrays of bars are left-right mirror images of each other except for the identical central bars.The distance between the T and its surrounding letters is greater than that between the neighboring bars in each array.Note that human retina has no orientation tuned neurons.C: vision contains encoding, selection, and decoding stages (Zhaoping, 2014).In B, a gaze shift from the central '+' to the crowded T is selection; recognizing this T is decoding.Peripheral, but not central, vision is deficient in the top-down feedback that queries for more information to aid seeing or decoding.
were unaware that the mask onset coincided with the gaze arrival.However, if the mask onset was postponed by several hundreds of milliseconds, then performance of the reports decreased (Zhaoping and Guyader, 2007), suggesting that the target X shape was recognized by central vision after the gaze arrival and before the masking.
In some trials, the inputs were masked before the gaze arrival.In these trials, observers typically continued making a few saccades on the display before their forced choice reports.We term such behavior aftersearch (Zhaoping, 2008a).In about 13% of the after-search trials, the gaze reached the location of the extinguished target after one or more after-search saccades.These after-search gaze arrivals were not accidental: among the after-search trials, the target location was reported correctly in about 84% of the trials with gaze arrival, but was reported randomly in trials without gaze arrival.This was true regardless of whether one or more after-search saccades were needed for the arrival (Zhaoping, 2008a).In about half of these after-search gaze arrival trials, gaze reached the extinguished target after at least two after-search saccades, almost certainly guided by some memory of the saliency, rather than a recognition, of the target.
This separation between looking and seeing, and viewpoint invariance in shape recognition, also explains the following visual search asymmetry.It is more difficult to find and report a letter 'N' among many of its mirror images than to find the mirror image among many N's (Frith, 1974).To find the target, finding a uniquely oriented (oblique) bar is sufficient, and recognizing the N shape is unnecessary.However, once the target's shape is recognized or seen, viewpoint invariance in shape recognition can trigger confusion just like in Fig. 2A.When the search image is crowded by many search items, the target is largely illegible unless the gaze is nearby.In this case, there is almost no asymmetry in the time it takes the gaze to reach the target in the first place.In other words, the asymmetry is not in looking, but in seeing, i.e., in the latency between the gaze's first arrival at the target and observer's report of the target's location (Zhaoping and Frith, 2011).The N in its more familiar form is recognized more quickly once the gaze arrives, prompting a faster onset and stronger object confusion, and thus a longer latency for the report.
Human peripheral vision is mainly engaged in looking also in real world scenes.For example, when observers searched for an apple in an image of a living room, the time needed for looking, i.e., for their gaze to reach the target, is almost unaffected if their central visual field is artificially blinded by blanking the image pixels in their central visual field (Nuthmann, 2014).
When a scene is less cluttered, such as that of an empty football field, visual crowding is reduced.Consequently, seeing, albeit less well, by peripheral vision can substantially impact looking to decide on the destination of the next saccade.Indeed, in the search for N among its mirror images or the reverse, when the search display was sparser, the search asymmetry appeared also in the reaction time for the gaze to reach the target (Zhaoping and Frith, 2011).This is because the target is recognized by peripheral vision before the gaze reaches the target, so that the confusion (between the target and non-targets) caused by seeing the target can cancel a saccadic plan towards the target.
Seeing becomes more difficult farther into the peripheral field, such that human recognition of letters is impossible beyond 40 • eccentricity (Strasburger and Rentschler, 1996).Meanwhile, visual inputs way beyond 40 • eccentricity can evoke gaze shifts (Goldring et al., 1996).Primary visual cortex (V1) covers the whole visual field (Gattass et al., 1987) extending beyond 90 • in eccentricity (Rönne, 1915).However, higher cortical areas are increasingly focused on just the central visual field.V3 and V4 cover up to 40 • in eccentricity (Gattass et al., 1988); the receptive fields of most IT neurons include the fovea (Gross et al., 1969;Kay et al., 2015), and few neurons in frontal eye fields or intraparietal cortex have their receptive fields mapped beyond 35 • (Mohler et al., 1973;Mayo et al., 2015) or 60 • (Blatt et al., 1990; BenHamed et al.,  2001) from fovea.When pre-existing knowledge is insufficient to guide gaze, looking towards far periphery needs to happen before the saccadic target can be seen properly or at all.This looking is guided by early visual processes such as the saliency map created in V1 (Li, 2002).Extending this guidance by saliency from far to nearer peripheral fields, looking before seeing in Fig. 2A is not extraordinary.(Zhaoping and Guyader, 2007).After the first saccade in this example, central vision sees the X shape made of the target bar and an intersecting vertical bar.This X is a rotated version of all the other X's in the image.Rotational invariance in object recognition confused the observer, leading to the decision to veto the target before returning to the target (returning gaze trajectory not shown for clarity).Data from (Zhaoping and Guyader, 2007), adapted from Figure 1.4 of (Zhaoping, 2014), drift correction to measured gaze positions is added here, using initial gaze positions at the image onset.B: looking can occur without seeing, more so in more peripheral vision.Observers search as quickly as possible for an uniquely oriented bar (which differs from uniformly oriented non-target bars by 50 • in orientation).All except one bars are shown to the left eye only, and one non-target bar, the ocular singleton, is shown to the right eye.Eye-of-origin of visual inputs is task-irrelevant and is invisible perceptually, but it is visible to V1, the only visual cortical area with many monocular neurons.The orientation target and the ocular singleton have the same eccentricity from the center of perceived image.When this eccentricity was 12 • or 7.3 • , respectively, 75% or 50% of the first saccade during the search was directed to the ocular singleton (typicallly within 300 milliseconds since the visual input appears) (Zhaoping, 2012(Zhaoping, , 2008b.) .)Sound from in or beyond the visual periphery can also attract gaze.Auditory cortical signals project to V1 regions for the peripheral, but not central, visual field (Falchier et al., 2002), likely influencing the saliency map in V1.Hence, at least in primates, peripheral vision can be considered as one component of the multisensory processing in the periphery specialized for orienting the central sense: typically the gaze (Zhaoping, 2023b).

Looking can be guided by highly salient visual signals that are invisible to seeing or awareness
Fig. 2B illustrates that a saccadic destination can have a visual signal to attract gaze, but this visual signal is invisible to seeing even in central vision.In a search (Zhaoping, 2008b(Zhaoping, , 2012) ) for a very salient bar tilted 50 • from uniformly oriented non-target bars, the first saccade during search was distracted by a particular non-target bar in 3 out of 4 trials (as in Fig. 2B).This distracting bar is not at all perceptually distinctive.Instead, it is unique in the eye in which it is shown (the right eye in this example).We therefore call it an ocular singleton.Normal human observers cannot tell whether an input comes from the left or right eye (once artifacts are removed) (Ono and Barbeitcto, 1985), because most of the neurons in visual cortical areas beyond V1 are insensitive to the eye-of-origin of an input (Zhaoping, 2014).Nevertheless, that this ocular singleton can strongly attract the gaze means, by definition, that its location is very salient.The search target and the ocular singleton distractor were designed to be on opposite sides of, but equally distant from, the starting gaze position at the center of the search display.When the target and ocular singleton were closer to the center, the distraction was weaker (see caption of Fig. 2).
In Fig. 2B, could the observers be somewhat aware of this ocular singleton by means of other cues, even when they could not tell its eyeof-origin?For example, could they perceive a relative motion (due to vergence movements) or a relative luminance difference (or luster (Malkoc and Kingdom, 2012;Georgeson et al., 2016;Wendt and Faul, 2022) between the ocular singleton bar and the other bars?To rule these cues out, we assigned a random luminance to each bar, and made the dichoptic input very brief (200 milliseconds) so that no saccades could be evoked before the dichoptic input was masked binocularly.Indeed, observers were completely at chance in guessing whether such a display contained an ocular singleton (against a condition when all bars had the same eye-of-origin).This was when all the bars were uniformly oriented and there was no need to find an orientation singleton.However, when one of the bars in this brief display was uniquely tilted, and the observers were instead asked to find this orientation singleton as their target, they could do better when the ocular singleton was on this target, i.e., the target was unique not only in orientation but also in eye-of-origin (Zhaoping, 2008b).Specifically, we compared three situations: the ocular singleton was on the target, was laterally opposite to the target (as in Fig. 2B), or was absent.Observers could more accurately report whether the target was tilted clockwise or counterclockwise from the other bars in the first than the other situations.Hence, even when the ocular singleton evoked no awareness, it could attract attention covertly to cue either for or against another task.
In another study on a display like Fig. 2B, observers first needed to find the orientation singleton.Once their gaze had reached this orientation target, the dichoptic display was masked and they had to report whether an ocular singleton had been present (Zhaoping, 2012).The ocular singleton, when present, was always laterally opposite to the orientation target and possibly distracting as in Fig. 2B.When the bars had highly heterogeneous luminance values, observers' reports on the ocular singleton were pure guesses when this singleton was actually present, although they were more certain of its absence when it was indeed absent (see Fig. 7E of Zhaoping, 2012).When the bars were less heterogeneous in luminance, the ocular singleton evoked more awareness and was more likely to distract gaze; however, its evoked awareness was not significantly correlated with whether the gaze was distracted (Fig. 8AB of Zhaoping, 2012).This latter finding was unsurprising; observers are often unaware of gaze distractions even by a highly distinctive distractor (e.g., a color singleton) (Belopolsky et al., 2008;Adams and Gaspelin, 2021).Furthermore, observers typically have the impression that they saccade much less frequently than the normal rate of three saccades each second.

The primary visual cortex guides looking without seeing via superior colliculus
V1 has monocular neurons which distinguish the eye-of-origin.Thus, the capture of gaze and attention by an ocular singleton supports the V1 Saliency Hypothesis (V1SH) which states that V1 creates a saliency map to guide looking exogenously (Li, 2002.)The main mechanism in V1 that underpins V1SH is iso-feature suppression (Zhaoping, 2014): nearby neurons tuned to the same featureorientation, color, motion direction, eye-of-origin, or other V1 featuressuppress each other's responses (Allman et al., 1985;Knierim and Van Essen, 1992;Li and Li, 1994;DeAngelis et al., 1994).For example, in Fig. 2A, many bars are horizontal and so activate V1 neurons preferring this orientation.These neurons suppress each other by iso-orientation suppression.Similarly, neurons most strongly activated by other non-target bars are also under iso-orientation suppression.The target bar is uniquely tilted counterclockwise from vertical.It activates a V1 neuron tuned to its orientation.This neuron's response is higher than the other responses by escaping iso-orientation suppression, making the target's location salient.Similarly, in Fig. 2B, the target bar is salient by escaping iso-orientation suppression, and the distracting ocular singleton is salient by escaping iso-eye-of-origin suppression.Indeed, this ocular singleton is apparently even more salient than the target bar in Fig. 2B.
The saliency map is the retinotopic map of the highest V1 responses, one for each visual location.Accordingly, assuming that seeing requires visual cortical areas beyond V1, the saliency reported by V1 neurons guides overt or covert looking or attentional selection without, or before, seeing.For instance, V1 does not recognize the shape X or the letter N. Hence, in Fig. 2A, V1's guidance of gaze is indeed looking without seeing.
V1 projects monosynaptically to the superior colliculus (SC), which reads out the saliency map (see an alternative view in White et al. (2017) and discussions in Liang et al. (2023)).When top-down gaze guidance via SC is weak, SC executes a gaze shift to the location of the receptive field of the most activated V1 neuron.Indeed, there is a positive correlation between V1 responses to an orientation singleton target and faster saccades to this singleton in monkeys (Yan et al., 2018).In the after-search described above, SC likely has the memories of salient locations (Sereno et al., 2022;Mays and Sparks, 1980) to guide saccades to locations of masked visual inputs.

Peripheral vision is deficient in the top-down feedback that aids seeing
The idea that peripheral and central vision are dedicated to, and specialized for, looking and seeing respectively, is the essential part of the Central-peripheral dichotomy (CPD) theory (Zhaoping, 2017(Zhaoping, , 2019(Zhaoping, , 2023b)).Based on this, the CPD theory additionally hypothesizes that top-down feedback (from higher to lower visual brain regions) that aids seeing should be weaker or absent in the peripheral visual field.The hypothesis is supported by emerging experimental evidence (Sims et al., 2021;Wang et al., 2022;Morales-Gregorio et al., 2023;Zhaoping, 2023a).The rationale is that this feedback is to aid seeing; and it would waste resources to devote it also to peripheral vision, which is mainly engaged in looking.
This feedback is necessary because of the bottleneck limiting the amount of sensory information available for seeing.Specifically, the feedback asks for additional information from upstream areas along the visual pathway, particularly from areas before the bottleneck starts.The additional information helps disambiguate multiple possible perceptual interpretations of visual inputs, an operation that is particularly useful when inputs are more noisy and ambiguous.Since it comparatively lacks feedback of this sort, peripheral vision is predicted by the CPD theory to be more vulnerable to visual illusions (Zhaoping, 2019).The scintillating grid illusion in Fig. 3A is just such an example.

Seeing with feedforward visual signals impoverished through the bottleneck predicts peripheral visual illusions
Two new peripheral illusions are predicted by CPD, and have been confirmed experimentally.In flip tilt illusion (Fig. 3B), the orientation of a hetero-pair of dots, one black and one white, is seen as being orthogonal to the orientation of the physical separation between the dots (Zhaoping, 2020).Reversed depth illusion (Fig. 3C) occurs in anticorrelated random-dot stereograms in which a depth surface is depicted by hetero-pairs, i.e., when a black dot in one eye corresponds to a white dot in the other eye (Zhaoping and Ackermann, 2018).In this illusion, the apparent depth order between this surface and a zero-disparity background surface is opposite to that defined by binocular disparities.Both of these illusions are predicted to be visible only in peripheral vision, at least in typical cases, based on the assumption that the attentional bottleneck starts at the output from V1 to downstream visual areas (Zhaoping, 2019).In turn, this assumption is motivated by V1SH's suggestion that attentional selection already starts at V1. Indeed, at least the information about the eye-of-origin of visual inputs is lost by V2, whose neurons are binocular and thus insensitive to eye-of-origin (Zhaoping, 2014).
Fig. 4 explains where these predictions come from.A V1 neuron tuned to horizontal in Fig. 4A can be excited by a vertical hetero-pair of dots when the black and white dots are in its off-and on-subfields, respectively (Fig. 4B).This neuron also responds to a horizontal homo-pair of white dots or black dots, or a horizontal gabor shape.But the vertical hetero-pair cannot activate a V1 neuron preferring vertical orientation (Fig. 4C).Consider the case that the retina is shown the vertical hetero-pair and the bottleneck is so extreme that only this particular V1 neuron (called 'A'), activated by this hetero-pair, is allowed to send its response forward to higher visual areas.The responses of other V1 neurons ('B', 'C', 'D', etc.) are not sent forward.Higher visual areas, with the knowledge of neuron A's receptive field properties and its response r A , would entertain multiple hypotheses about possible retinal inputs, including H 1 (for a vertical hetero-pair), H 2 (for a white horizontal homo-pair), and H 3 (for a black horizontal homopair).This implies that perception would be ambiguous regarding (at least) whether the retinal input is horizontal or vertical.The perceived orientation, by a majority vote from equally weighted feedforward hypotheses, is most likely horizontal, giving the flip tilt illusion.

Feedback queries for additional visual input information to disambiguate percepts
Without the bottleneck, other V1 neurons could also feed their responses forward, enabling higher visual areas to distinguish between hypotheses H 1 and H 2 , etc. Consider the case that the responses r B , r C , Fig. 3. Example illusions in peripheral vision A: flashing dark dots are only away from the center of gaze.B: Each image contains homo-and hetero-pairs of dots (two dots of the same and opposite contrast from the background, respectively).Dot-pairs in the background are randomly oriented.In each image, a ring, consisting of homo-pairs oriented tangential to the ring, is centered on the cross, with a radius about 0.4 times the side length of each image.The two images differ only in that the hetero-pairs on the ring are tangential to the ring in the left image but perpendicular to the ring in the right image.When one fixates on the cross, the ring is more easily seen in the right than the left image because peripheral vision suffers the illusion of seeing the orientation of the hetero-pairs as being orthogonal to their actual orientation.Gazing at the ring elements directly, central vision identifies the smooth contour segments of the ring more easily in the left image.C: The top stereogram depicts a disk in front of a surrounding ring.The bottom stereogram depicts the same scene (using the same binocular disparity value for the disk relative to the zero-disparity ring) except that, for the disk, a black dot in one eye corresponds to a white dot in the other eye.Looking at each stereogram directly with binocular (free) fusion, the depth order can be easily seen in the top but not the bottom stereogram.If you direct and hold your gaze at the top stereogram with binocular (free) fusion, the bottom disk, now in the peripheral visual field, can be at least vaguely seen as being behind the ring.In other words, even though the bottom and top disks have the same binocular disparity, they have opposite perceived depth orders relative to the ring.and r D from neurons B, C, D rule out the hypothesis H 2 , H 3 , and H 4 ; whereas the response r X from neuron 'X' is consistent with all the hypotheses.Given the bottleneck, higher visual areas could query for additional information, for instance for r B , r C , and r D, that are most relevant for disambiguating between the multiple hypotheses.The equal weights of the hypotheses can then be revised to favor hypothesis H 1 , leading to a veridical percept.
According to the CPD theory, the top-down feedback that aids seeing makes such a query.This query is specific to the ambiguity between the original hypotheses (e.g., H 1 , H 2 , H 3 ) suggested by the initial feedforward V1 signal (e.g., r A ).The higher areas have an internal model or knowledge of the visual world, e.g., in terms of how neurons B, C, D etc would respond for each of the alternative hypotheses.This knowledge could be about the likelihood p(r B |H i ) of getting response r B from neuron B given input conforming to hypothesis H i .The higher areas can feed these would-be V1 signals (e.g., the would-be r B , r C , r D ) to V1 for comparison with the actual V1 signals (e.g., the actual r B , r C , r D ) to verify H i .The queried information is the degree of match between the would-be and actual signals.The weight to each hypothesis can then be increased or decreased when this match is good or poor.This perceptual process is termed Feedforward-Feedback-Verify-(re)Weight (FFVW) (Zhaoping, 2017), paraphrasing analysis-by-synthesis(MacKay, 1956; Li, 1990;Carpenter and Grossberg, 1987;Kawato et al., 1993;Dayan et al., 1995;Yuille and Kersten, 2006).Prefrontal cortical areas are likely sources of the top-down feedback (Kar and DiCarlo, 2021).
The CPD theory suggests that peripheral vision is deficient in this feedback verification process.This makes it vulnerable to misleading feedforward V1 signals, such as those causing the flip tilt illusion.Rather analogous to the flip-tilt responses in Fig. 4A, V1 neurons flip the sign of their preferred binocular disparity in response to anticorrelated (or contrast-reversed) random-dot stereograms (Cumming and Parker, 1997).For the central disk in the bottom of Fig. 3C using such a stereogram, the binocular spatial disparity between the corresponding dots comes from the 3D spatial relationship of a disk in front of the zero-disparity ring.However, the V1 neurons activated by this hetero-pair-dotted disk prefer a depth behind the ring.These reversed-depth V1 signals are fed forward to cause the depth illusion, but typically only in peripheral vision (Zhaoping and Ackermann, 2018).In the same spirit, when tilt in space is extended to tilt in space-time, the flip tilt illusion gives the analogous illusion of reversed motion direction (Zhaoping, 2020).This is called reversed phi motion (Anstis, 1970).

Visual information pooling, crowding, saliency, and gist
I propose that visual crowding arises from the deficiency in feedback verification in peripheral vision.Information is too impoverished in feedforward V1 signals responding to crowding displays.Indeed, crowding is associated with suppressed fMRI signals as early as V1 (Millin et al., 2014).Crowding is reduced for the salient bar in Fig. 1B whose V1 response escapes the iso-orientation suppression from the very differently oriented contextual bars.The resulting larger feedforward V1 response increases the feedforward weight for the perceptual hypothesis for its veridical orientation.
In general, the properties of iso-feature suppression in V1 imply that peripheral visual crowding is reduced when the crowded target enjoys a large feature contrast, e.g., in color, from the contextual inputs (Kooi et al., 1994;Kennedy and Whitaker, 2010), making it salient.A stronger feedforward weight for a salient item should make downstream neural selectivities to the corresponding input feature less corrupted by contextual inputs.This is indeed the case in V4 neurons (Kim and Pasupathy, 2023).
One thus predicts that crowding can be reduced when the target exhibits an ocularity contrast from contextual items, e.g., by having a different eye-of-origin (or being biased for a different eye-of-origin with binocularly unbalanced inputs).This was confirmed when the target was a 'C' or its 180 • rotated version and the contextual inputs were 'O's; observers were better at reporting the orientation of the C given a suitable ocularity contrast (Zhang et al., 2012), even though ocularity was perceptually indistinctive.However, when a target T was surrounded by four contextual T's (each T could be rotated by 0 • , 90 • , 180 • , or 270 • ), crowding was not reduced by the ocularity contrast substantially (Kooi et al., 1994;Zhang et al., 2012).This might be because a target bar and a contextual bar (both vertical or both Here we use the flip tilt illusion as an example.A: Schematic of the Gabor-like, horizontally oriented, spatial receptive field (RF) of a V1 neuron, with on-and off-subfields.B: Activation of the neuron in (A) by a vertical hetero-pair, horizontal homo-pairs, and a horizontal Gabor.C: the vertical hetero-pair in a vertical RF does not excite this neuron.D: The retina is shown a vertical heteropair of dots.Imagine that the V1 neuron most activated by this hetero-pair is the only such neuron sending its response through the information bottleneck to higher cortical areas.According to this neuron's response and neural properties, higher areas can entertain multiple hypotheses, H 1 , H 2 , H 3 , and H 4 (as in B), about the probable retinal inputs.The majority vote from these hypotheses gives a perceptual inference (seeing) of the flip tilt illusion, i.e., that the perceived orientation of the retinal inputs is horizontal.If, by analysis-bysynthesis, higher areas can use feedback to query for more information, in terms of the responses from other V1 neurons, to distinguish between the multiple hypotheses, then the additional information can increase the weight for one hypothesis (e.g., H 1 for the hetero-pair) relative to those for the other hypotheses to make the perceptual outcome be more veridical and less ambiguous.
L. Zhaoping horizontal) could interocularly fuse or interact before the target T is segmented from the context.
The actual bottleneck in our brain should be much more complex than our toy model in which we admitted the response of only one V1 neuron.Likely mechanisms include attentional gain control of feedforward signals (Carrasco, 2011;Wang et al., 2015) and pooling of V1 signals onto target neurons in higher visual areas.For example, pooling responses from two monocular V1 neurons preferring vertical orientation, one preferring the left-eye inputs and the other preferring the right-eye inputs, gives rise to a higher cortical binocular neuron which prefers vertical orientation but is blind to the eye-of-origin.More complex pooling models have been proposed for peripheral vision, such that information for a local image region is reduced to summary statistics of simple V1-like features and higher level features such as faces (Balas et al., 2009;Manassi and Whitney, 2018;Rosenholtz et al., 2019).If V1 responses are pooled, the results must respect V1's intra-cortical interactions including iso-feature suppression and collinear facilitation.For example, a more salient bar should feature more in the pooled outcome according to its relatively stronger V1 response.Hence, crowding of the salient bar in Fig. 2B is reduced when the additional contextual bars engender increased iso-orientation suppression to the contextual bars.Therefore, image features should be weighted somewhat by their saliency values before contributing to the summary statistics in peripheral vision; the saliency values in turn depend on global or contextual image properties.This partly explains why crowding is often affected by whether the target is grouped with the context (Zhaoping, 2003;Manassi et al., 2012).
Peripheral vision is not unique in suffering from a loss of information.Central vision is also blind to the eye-of-origin.Instead, peripheral vision is special in lacking substantial feedback verification that aids seeing.Since the feedback verifies the would-be inputs for candidate percepts with the actual inputs, it can be interfered with by replacing the actual inputs by a mask during the verificationthis is backward masking.For example, when central vision views the anticorrelated random-dot stereograms in Fig. 3C, one way to realize this backward masking is to replace the random dots by another set of randomly generated dots every 10 milliseconds (keeping the same scene and design).In this case, the usual veto of V1's reversed-depth signals by the feedback verification is compromised and so, extraordinarily, central vision sees the reversed depth illusion (Zhaoping, 2021).By contrast, since it lacks the verification in the first place, peripheral vision should be less vulnerable to backward masking.This was confirmed in one study (Zhaoping and Liu, 2022).Peripheral vision should also be less affected by any top-down prior biases conveyed by the feedback verification (Zhaoping, 2017).Peripheral vision is good at tasks for which the feedback verification is not critical, e.g., perceiving the gist of a scene, simultaneously tracking multiple objects, and monitoring the environment during walking and even driving (Sereno and Huang, 2014;Vater et al., 2022).Nevertheless, doing such tasks also serves peripheral vision's role for looking by impacting the decision on when and where to direct the next saccade.

Looking, seeing, and understanding
In English and analogously in many other languages, "I see" often means "I understand".To the extent that it is less engaged in seeing, peripheral vision also lacks understanding.An essential element in the understanding is FFVW's feedback verification using brain's internal knowledge.Contemporary artificial neural networks for visual recognition also lack this feedback verification and thus understanding, and are easily misled by adverserial attacks (Akhtar and Mian, 2018).
Less understanding likely causes a lower confidence or trust in percepts in the peripheral field, even when inputs are adjusted such that peripheral and central vision are equally accurate at reporting visual inputs (Toscani et al., 2021), and even when the peripheral percept is veridical while the central percept is based on interpolation (Gloriani and Schütz, 2019).Non-verifiable peripheral percepts could be devalued in favor of extrapolation from verifiable percepts in the central visual field.This likely causes the uniform illusion (Otten et al., 2017) in which observers see a uniform texture extending from the center part of images in which central and peripheral parts differ.By contrast, sufficiently strong illusory percepts from peripheral locations should overwrite extrapolations from central vision or even from veridical memories of these locations from previous fixations.This explains the scintillating grid illusion, extinction illusions (Ninio and Stevens, 2000), and honeycomb illusions (Bertamini et al., 2016(Bertamini et al., , 2023)), when one sees something additional or missing in the peripheral field of a uniform texture.
We notoriously have the impression of seeing the whole visual field clearly, despite our woeful inattentional blindness.This impression may relate to cases of subjective inflation of confidence about peripheral percepts (Li et al., 2018;Knotts et al., 2019), albeit contrasting with lower confidence or trust in peripheral percepts in other previous studies (Gloriani and Schütz, 2019;Toscani et al., 2021), in which the gaze positions were restricted and monitored.In natural behavior, visual details at any peripheral location are only one look away from being seen clearly with central vision; this could cause the false impression of peripheral clarity.This is not unlike another false sense that a light inside a refridgerator is always on simply because it turns on each time one opens the door of the fridge (Thomas, 1999;O'Regan, 2011;Lau, 2022).Worse, human observers have a limited awareness of their gaze shifts (as mentioned in section 3.2); they may not realize that the details at a peripheral location have been briefly in their central visual field during a glance.Before and after a saccade to an object, peripheral and the foveal view can be integrated by the brain for an unified percept (Stewart et al., 2020;Kroell and Rolfs, 2022).This may explain the existence of top-down feedback to the foveal region of the lower visual cortical areas even for peripherally viewed objects (Williams et al., 2008;Fan et al., 2016).This would allow the feedback verification process to verify the perceptual hypotheses arising from the peripherally viewed inputs before the saccade by the centrally viewed inputs after the saccade, thereby achieving a form of transsaccadic visual understanding (Zhaoping, 2019).

Fig. 2 .
Fig.2.Demonstrations that looking and seeing are separate visual processes by peripheral and central vision, respectively.A: a visual input image, in black and white, is superposed by a trajectory of gaze positions, in red, magenta, and cyan, from the beginning to the later moments of an observer searching for a uniquely oriented bar.The gaze started at the center of the image when the search image appeared, and the first saccade (in red) led the gaze on the target, a bar tilted counterclockwise from vertical.Then gaze stayed around the target for about 0.5 seconds before saccading away (in magenta and then cyan) to search elsewhere.Visual crowding makes the target bar's unique orientation illegible before the first saccade (looking).Untrained observers can bring their gaze to the target within one second in ~ 50% of the trials (the search display spanned ~ 34 • × 46 • in visual angle, containing 2 × 660 bars)(Zhaoping and Guyader, 2007).After the first saccade in this example, central vision sees the X shape made of the target bar and an intersecting vertical bar.This X is a rotated version of all the other X's in the image.Rotational invariance in object recognition confused the observer, leading to the decision to veto the target before returning to the target (returning gaze trajectory not shown for clarity).Data from(Zhaoping and Guyader, 2007), adapted from Figure1.4 of(Zhaoping, 2014), drift correction to measured gaze positions is added here, using initial gaze positions at the image onset.B: looking can occur without seeing, more so in more peripheral vision.Observers search as quickly as possible for an uniquely oriented bar (which differs from uniformly oriented non-target bars by 50 • in orientation).All except one bars are shown to the left eye only, and one non-target bar, the ocular singleton, is shown to the right eye.Eye-of-origin of visual inputs is task-irrelevant and is invisible perceptually, but it is visible to V1, the only visual cortical area with many monocular neurons.The orientation target and the ocular singleton have the same eccentricity from the center of perceived image.When this eccentricity was 12 • or 7.3 • , respectively, 75% or 50% of the first saccade during the search was directed to the ocular singleton (typicallly within 300 milliseconds since the visual input appears)(Zhaoping, 2012(Zhaoping, , 2008b.) .)

Fig. 4 .
Fig. 4. The explanation of the reversed feature illusions based on V1′s neural properties and the information bottleneck limiting feedforward signals from V1.Here we use the flip tilt illusion as an example.A: Schematic of the Gabor-like, horizontally oriented, spatial receptive field (RF) of a V1 neuron, with on-and off-subfields.B: Activation of the neuron in (A) by a vertical hetero-pair, horizontal homo-pairs, and a horizontal Gabor.C: the vertical hetero-pair in a vertical RF does not excite this neuron.D: The retina is shown a vertical heteropair of dots.Imagine that the V1 neuron most activated by this hetero-pair is the only such neuron sending its response through the information bottleneck to higher cortical areas.According to this neuron's response and neural properties, higher areas can entertain multiple hypotheses, H 1 , H 2 , H 3 , and H 4 (as in B), about the probable retinal inputs.The majority vote from these hypotheses gives a perceptual inference (seeing) of the flip tilt illusion, i.e., that the perceived orientation of the retinal inputs is horizontal.If, by analysis-bysynthesis, higher areas can use feedback to query for more information, in terms of the responses from other V1 neurons, to distinguish between the multiple hypotheses, then the additional information can increase the weight for one hypothesis (e.g., H 1 for the hetero-pair) relative to those for the other hypotheses to make the perceptual outcome be more veridical and less ambiguous.