Brains studying brains: look before you think in vision

Using our own brains to study our brains is extraordinary. For example, in vision this makes us naturally blind to our own blindness, since our impression of seeing our world clearly is consistent with our ignorance of what we do not see. Our brain employs its ‘conscious’ part to reason and make logical deductions using familiar rules and past experience. However, human vision employs many ‘subconscious’ brain parts that follow rules alien to our intuition. Our blindness to our unknown unknowns and our presumptive intuitions easily lead us astray in asking and formulating theoretical questions, as witnessed in many unexpected and counter-intuitive difficulties and failures encountered by generations of scientists. We should therefore pay a more than usual amount of attention and respect to experimental data when studying our brain. I show that this can be productive by reviewing two vision theories that have provided testable predictions and surprising insights.

1. Introduction: special difficulties in understanding our brain functions '不识庐山真面目, 只缘身在此山中' 'One cannot see the true face of Mount Lu because one is right inside this mountain'from an ancient Chinese poem It has often been claimed that we are in principle not clever enough to understand how the brain works. Rather than getting stuck in a hermeneutic circle about systems understanding themselves, I would like to point to a very different sort of difficulty against which we must constantly battle in order to make progress. For those who have not come across them before, this is exemplified by figures 1, 2,and 3, in which vision seems oddly good, oddly bad, and just odd, respectively.
Recognizing the apple in figure 1 seems trivial: a pre-school child could of course see it easily, despite lacking advanced maths or programming skills. Indeed, about half a century ago, an MIT professor set up a summer project for students to write a computer program that can see or interpret objects in photographs [37,44]. Why not? After all, seeing must be some manipulation of image data that can be implemented in an algorithm. Nevertheless, decades have passed, and we still have not fully reached the aim of that summer project, and a worldwide computer vision community has been born. As a hint to the problem, it turns out that one of the most difficult issues is the chicken-and-egg problem: to see the apple it helps to first pick out the image pixels for this apple, and to pick out these pixels it helps to see the apple first.
A more recent surprising discovery is our blindness to almost everything in front of us [50]. Consider how much time it takes you to tell the (substantial) difference between the two images in figure 2. This turns out to be more than several seconds for most peoplebut why so long? Our brain gives us the impression of seeing everything clearly, and therefore apparently no data in favor of the, actually true, proposition that this impression is false. This makes us blind to our own blindness. 1 Another counter-intuitive finding, discovered only several years ago, is that our attention or gaze can be attracted by a visual feature to which we are blind. In our experience, only objects that appear highly distinctive from their surroundings attract our gaze automatically. For example, a lone, red flower in a field of green leaves does so, unless we are color-blind. In figure 3, a viewer perceives an image which is a superposition of two images, one shown to each of the two eyes using the equivalent of spectacles for watching three-dimensional (3D) movies. To the viewer, it is as if the perceived image (containing only the bars but not the arrows) is shown simultaneously to both eyes. The uniquely tilted bar-the orientation singletonappears most distinctive from the background. In contrast, the bar uniquely in the left eye-the ocular singleton-appears identical to all the other background bars, i.e. we are blind to its distinctiveness. Nevertheless, the ocular singleton often attracts attention more strongly than the orientation singleton (so that the first gaze shift is more frequently directed to the ocular rather than the orientation singleton) even when the viewer is told to find the latter as soon as possible and ignore all distractions [60]. This is as if this ocular singleton is uniquely colored and distracting like the lone, red, flower in a green field, except that we are 'color-blind' to it. Even many vision scientists find this hard to believe without experiencing it themselves. In fact, many observers are not even aware that their gaze shifted to the ocular singleton before shifting to the orientation singleton.
These three figures illustrate not only that we do not understand the rich and sophisticated subconscious processes that underpin our ability to see, but also that our intuitions and presumptuous conceptions about the underlying problems, or even their orders of magnitude of difficulty, can be actively unhelpful. The philosopher Thomas Nagel is famous for an article entitled 'What is it like to be a bat' [40] in which he argued that we would find it very difficult to understand the subjective experiences of a bat, given that it enjoys a very different sensory modality from us. Exactly the opposite argument is true for a scientific inquiry into the inner workings of vision-it is because we have an excellent subjective sense of conscious vision that our scientific investigations are in danger of being misdirected. Imagine a scenario in which our brain is composed of two parts. The first one receives raw sensory signals X, such as light to our eyes and sounds to our ears, and transforms X to Y, such as a sequence of symbols. The second part manipulates on Y using rules-let us call them familiar rules-which we learned from experiencing and investigating the world around us. These manipulations on Y are commonly known as, for instance, reasoning, calculation, deduction, copying, and deleting, using powers of analogy, intuition, generalization, and other means, and they could conceivably be carried out by current-day computers. Furthermore, without special aids or external guidance, the second part of the brain has no direct access to X, nor to the rules-let us call them the unfamiliar rules-by which the first part of the brain transforms X to Y. Although this is an oversimplification, these two brain parts correspond roughly to our subconscious and conscious brain processes, respectively. Using the familiar rules in our conscious brain, our investigation of the unfamiliar rules in our subconscious brain can easily suffer from being too presumptive and, at the same time, clueless.
Being aware of the special difficulties in understanding the subconscious by the conscious is the first step to overcome them. Accordingly, compared to approaches in other fields of science, extra attention should be paid to experimental observations to avoid being led astray by presumptuous conceptions.
In the following, I briefly describe two theories. One conforms to our knowledge of data coding to explain and predict observations on how initial visual processing stages transform visual inputs; the other challenges our intuitions on how our attention should be guided and predicts the surprising finding in figure 3. They illustrate where our research progress can be difficult and yet can still be made.

Two data-driven theories of biological vision
The purpose of vision is mainly to compute where and what visual objects are in the 3D world from the pair of two-dimensional visual images captured by the retina. It is fruitful to decompose it into the following three stages: encoding, selection, and decoding [61]. The encoding stage transforms the retinal input light to some suitable form represented by the neural activity patterns, often measured as trains of action potentials or spikes from neurons, see figure 4(B). The selection stage selects only a tiny fraction of visual inputs for further processing because the brain has only a finite resource for processing. The resulting blindness to non-selected visual inputs is the basis of the demonstration in figure 2. Much of the selection depends on directing our gaze to the selected region of visual fields Figure 1. The chicken-and-egg problem: to see the apple it helps to first pick out the image pixels for this apple, and to pick out these pixels it helps to see the apple first. Adapted from figure 5.51A of [61].
(gaze shifts, or saccades, occur about three times a second). Figure 4(C) shows the benefits of this: fixating on the '+' makes it hard to see 'T' clearly even when 'T' is alone, not to mention when 'T' is surrounded by other letters; however looking at 'T' directly resolves it perfectly in normal vision. The decoding stage infers or recognizes from the selected visual inputs where and what is the attended object (e.g. the apple or the letter 'T'). In normal visual behavior, selecting and decoding correspond roughly to looking and seeing respectively.
Among the main regions of the brain for mammalian vision, V1 and the brain regions above V1 in figure 4(A) are in the neocortex, with V1 being at the back of the brain and FEF nearer the front. A primate retina has~10 6 receptor cones and~10 8 rods, sending signals to the brain via the output axons (cables) from 10 6 retinal ganglion cells. Each ganglion cell is typically activated (by changing its spiking rate) by a small light or dark spot surrounded by dark or light annulus, respectively, within a small spatial region (about 0.1 degree of visual angle in diameter at central vision) called its receptive field (figure 5); the receptive fields of all the ganglion cells collectively tile the image space [24]. The LGN is not yet understood well beyond its role to relay retinal signals to the cortex, notably to V1.
In a monkey, about half of the total area of the cortex is exclusively or predominantly involved in vision, and about a quarter of this is devoted to V1, which is the largest visual cortical area in the brain [56]. V1 contains about 100 times as many neurons as there are retinal ganglion cells [13]. Most V1 neurons respond to an edge-or bar-shaped pattern within their receptive fields [21]-see figure 5(B) for examples of receptive fields, each of which is typically smaller than the image area of a meaningful object such as an apple held in one's hand. These neurons are laid out in a retinotopic manner, such that neurons that are nearby in V1 respond to inputs that are nearby in the image.
Further downstream from V1 along the visual pathway, neurons have progressively larger receptive fields. It is harder to find the visual patterns that activate them strongly, although many neurons respond  to complex shapes, e.g. a star-like or even a face-like shape. Some visual areas carry more information about 'what' an object is while other areas carry more information regarding 'where' an object is.
Progressing along the visual pathway towards the front of the brain, responses of neurons in areas LIP [9,53] and FEF [54] are insensitive to shape or other properties of visual inputs within their receptive fields, but are affected by whether these inputs are relevant or whether the animal pays attention to, or is about to saccade to, these inputs [9,53]. SC, below the neocortex, controls gaze movements using signals from particularly the retina, V1, LIP, and FEF. See [61] for more details.
2.1. The efficient coding principle: early visual receptive fields and their adaptation Even in the retina, properties of receptive fields differ across animal species, e.g. cats, monkeys, and fish, and they even change between environments within a single animal. Can a single principle account for different receptive fields? More than half a century ago, it was proposed [5,6] that initial visual processing serves to encode visual inputs efficiently so that the neurons transmit as much input information as possible to the brain while limiting neural cost in terms of (for example) the channel capacity to transmit the neural spikes or the metabolic energy and dynamic range for the spiking rate. For example, a larger dynamic range for the retinal ganglion responses would require a thicker optic nerve, the bundle of axons which sends retinal signals (via LGN) to V1 at the back of the brain. Before elaborating this hypothesis, I briefly outline essential data.
In the retina, most neurons respond approximately linearly to retinal input; in V1, the simple cells (one of the main types of V1 cells; they provide inputs to the other type, the complex cells) respond approximately linearly to visual inputs other than a rectification transform [61] so that two simple cells that rectify the negative inputs of each other effectively constitute a linear cell approximately. Henceforth, these retinal and V1 cells are approximated as linear in this section. Let S(x) be image pixel values at location x, then a neuron's response is = å O K x S x x ( ) ( ) based on its (neuron specific) receptive field K(x). When input S depends also on time t, type of cone receptor c ( = c r g b , , for a red, green, or blue cone), and eye of origin e ( = e L R , for left or right eye), a neuron's response at t is ). We may focus on only one or two input dimensions, e.g. x, ). For convenience we sometimes simply use x to denote any dimension or their combinations, see figure 5(B) for some examples.
The spatial example K(x) there is called a centersurround receptive field, giving rise to a neuron that is most excited by a pattern of a bright center with a dark surround, and could be modeled by K x The K(x)s for neighboring retinal ganglion cells often have the same shape, but are centered at different locations, thus tiling the image space. For cell i, we write ( ) captures the shape. The two example center-surround receptive fields on the upper right of 5(B) model two neurons, one excited by a red center and inhibited by a green surround, another excited by a blue center and inhibited by a yellow surround. The lower left of figure 5(B) shows a V1 K(x) preferring a bright vertical bar and another K(x) preferring a segment of a 45 o tilted luminance edge. The lower-right of figure 5(B) shows a stereo-space K(x) for a V1 cell that is more sensitive to left rather than right eye input and prefers different spatial patterns (bar and edge, respectively) from the two eyes.
One formulation of the efficient coding theory is as ) with O i as responses for different neurons and even response at different times, then where K is the matrix for the receptive field transform with K ab as the effective neural connection (see figure 5(A)) from S b to O a ; N is the input noise vector, and vector N o is the encoding noise introduced during the transform, so that the total noise in O is Efficient coding requires us to find the transform K that minimizes where λ is a Lagrange multiplier for a balance between the information transmitted and the cost. As neural cost (e.g. neural dynamic range) and I O S ; ( ) both depend on K (e.g. increase with the magnitude of K), finding the efficient K reaches an optimal trade-off between maximizing information and minimizing neural cost. Hence, the efficient K depends on the visual environment and animal species characterized by P S ( ) and P N ( ). Often the P S ( ) is not precisely known, particularly for large n; while knowledge [1,52] about P N ( ) and P N o ( ) is even more sketchy so that we simply assume that different components of N or N o have independent and identical distributions.
where R S is an n×n matrix, hence, e.g.
and that for the total output noise N total is ) . The cost per neural spike increases with the spiking rate [52], hence a starting point is to model the neural cost as ) , then it can be shown [61] that the K to minimize K E ( ), i.e. the solution of Hence, the efficient transform Third, since K E ( ) (and each of its two terms in equation (5)) is invariant to any unitary transform K UK  , solutions K is degenerate by this U symmetry which can be broken by additional requirements. A desirable requirement of 'minimal distortion' ( [61], so as to reduce neural wiring , which will be used in our applications. The resulting proportional to an identity matrix and thus .) Hence, efficient coding under negligible noise does redundancy reduction-making O a independent of O b when ¹ a b so as not to waste neural cost to redundantly transmit the same information in multiple output channels. However, at high noise levels, efficient coding does smoothing. It integrates the more correlated components in S, making output signals correlated, ab . Redundancy helps with the recovery of information about signal S from noisy neural responses.
Different domains-space, time, color, and stereo -different animal species, and different environments have different dimensions for S; statistics P S ( ), , giving a diversity of K, as observed experimentally. The brain's K is not implemented by the three separate steps K o , g, and U but by a cascade of transforms K K , is a Toplitz matrix, K o is the Fourier transform,  k is the Fourier coefficient for the kth Fourier mode in S(x) (for convenience, abusing much notation and making k additionally denote the wavenumber of the Fourier mode), and  á ñ k 2 as a function of k is the power spectrum of the images S(x) in the image ensemble. Let U K =o 1 be the inverse Fourier transform. Note that K o has its element in the kth row and x a th column as )arise from band-pass filtering the input image S(x) by a spatial filter å g e k k ikx with frequency sensitivities g k . The receptive fields of all neurons have this same filter shape but different, regularly spaced, center locations, e.g. center location x b for the neuron with response O x b . In natural images,  á ñ µ k 1 k 2 2 | | . Hence by equation (6) . The sensitivity g k is isotropic in , giving a center-surround shape to the receptive field at a spatial scale~k 1 p . This scale is predicted to adapt to the S/N of the environment, e.g. moving to a darker environment enlarges this scale since S/N decreases, as observed physiologically [7]. Figure 6 reveals the redundancy reduction for small k | | and smoothing for large k | | by the efficient coding. For small < k k p | | the encoding filter relatively amplifies higher-frequency signals to remove the spatial redundancy in natural images, so that luminance where S/N 1, the filter dampens or cuts off high-frequency inputs to avoid amplifying too much noise. In this limit, the spatially-uncorrelated noise is spatially smoothed or averaged away while spatially-correlated image pixels are integrated and preserved. When noise is more severe (in darker environments), smoothing occurs at a coarser scale by a filter with a larger center and a weaker surround, thus lowering visual spatial acuity.
Replacing space x by time t, and proceeding in an analogous manner leads to the efficient temporal receptive field, or temporal filter, of neurons [17,26,57,61]. In this case, a suitable U in K UgK and temporally causal. Again, when the input S/N is higher, the temporal filter is higher-pass, making the neural impulse response more transient, so that temporal redundancy is reduced and the neuron is more sensitive to input temporal changes-such a coding is called predictive coding. When the S/N is lower, the receptive field becomes a lower-pass temporal filter that smooths out temporal noise. Consequently, a neuron's spatiotemporal filter (in both x and t) to visual inputs trades off spatial and temporal resolution: the temporal filter is higher-pass or lower-pass, respectively, for spatial inputs at a coarser (higher S/N) or finer (lower S/N) spatial scale, as observed in data.
In color vision, = S S S S , , for the signals to the red, green, and blue cones when other input dimensions are ignored [3,11]. The 3 × 3 correlation matrix R S in natural scenes is such that the transform gives three independent components in decreasing order of signal power [3,11,61]: the luminance signal  LUM is a linearweighted sum of S r , S g , and S b ,  RG is a red-green opponency (roughly -S S r g ), and  BY is a yellowblue opponency (roughly a linear-weighted sum of + S S r g and -S b ) [3]. Adding together (by multiplexing through U) a spatial band-pass filter for  LUM (which has a large S/N) and a spatial smoothing filter for  RG (which has a smaller S/N) gives the red-center-green-surround receptive field in figure 5(B) in the space-color transform [61]. Different species of animals inhabit different environments (consider land versus sea) and may have different cone types, giving different color statistics R S and thus exhibiting different color coding transforms [3].
For stereo coding [33] in V1 where inputs from the two eyes first converge, = S S S , , and  á ñ= and gain  g for   as diagonal elements in g, ) . Hence, unless -+  g g , each V1 neuron is likely to prefer inputs from one eye or the other, as illustrated in the example in figure 5(B).
This theory of efficient stereo coding led to novel predictions [61] that could be tested with new experiments. One example starts from the observation that adapting to an environment having an altered ocular correlation will change r. Since  | | ) of the ensemble of images S relative to the typically flat input noise power spectrum, making g k peak around frequency k p where powers from S and N are comparable. (C) Outcome To test this, we created the non-ecological visual input shown in figure 7 that could credibly be interpreted in two different ways, one associated with  + and the other with  -. Specifically, observers' left and right eyes were shown two different patterns, = S x t sin sin L ( ) ( ) and = S x t cos cos R ( ) ( ), each of which is a spatial grating in x that oscillates in time t. This makes  µ   x t cos ( ) be two gratings drifting in opposite directions. Observers reported which of the two directions they saw. Their probability of reporting the direction associated with  + should increase or decrease, respectively, when + -g g is larger or smaller. We changed r by having the observers look at ocularly anti-correlated photographs for about a minute. These had »r 1, with the inputs to the two eyes being photo-negatives of each other. We found that their probability of reporting the  + direction duly increased [38]. Hence brief sensory adaptation could even alter the adult cerebral cortex by the prescription of efficient coding. Structurally different ocular correlations r also arise in animal species with different distances between their two eyes and indeed their stereo encodings differ as expected by the theory [61].
It is remarkable that both of the two main originators [5,6] of the efficient coding principle were distinguished experimentalists (primarily on animal neurophysiology and human perception, respectively) interested in theoretical principles and familiar with information theory [48]. Later, more theoreticallyinclined researchers [3, 4, 10, 11, 14, 17, 25-27, 34, 35, 42, 51, 57] developed this principle further, in particular to provide richer mathematical formulations and extensions to situations with noisy sensory inputs. This amplified the predictive power of the theory, and provided insights across a range of experimental data [61].

V1 saliency hypothesis: visual selection can occur before visual recognition
'It will be all too easy for our somewhat artificial prosperity to collapse overnight when it is realized that the use of a few exciting words like information, entropy, redundancy, do not solve all our problems' warned Shannon [49], the lead author of information theory. Could redundancy reduction be carried out in multiple stages along the visual pathway, ultimately revealing neural signals representing independent visual objects that putatively underlie retinal inputs [6]? Apart from the domain of stereo, for which V1 is the first point of convergence for inputs from the two eyes, efforts [8,27,43] to understand neural encoding in this visual area by appealing to efficient coding principles have turned out to be somewhat unrewarding.
Most V1 neurons prefer a spatial pattern of a particular orientation (e.g. vertical or 45 o from vertical in the examples in figure 5(B)) and a particular scale. Although such K(x)s could arise from an efficient cod- tinct from the one that captures retinal coding [34]; the rationale for this U cannot come from efficiency, since it has a null effect, at least for Gaussian signals. Yet more puzzling is the hundred-fold expansion in the number of V1 neurons compared with the number of retinal ganglion cells, creating an overcomplete representation that seems to contradict redundancy reduction.
Since retinal encoding seems merely to reduce redundancy in the first and second-order input correlations captured by our Gaussian approximation of P S ( ), could V1 be reducing redundancy in the higher-order correlations? In natural images, higher-order redundancy accounts for only a few percent of the total redundancy (measured by entropy) [34,45,61], making this unlikely. Furthermore, meaningful information about visual objects (e.g. long smooth object contours) is captured by the higher but not lower-order correlations. Indeed, figure 6(C) captures the objects despite losing much (up to frequency k p ) of the lower-order, but not higher-order, correlations in figure 6(A) by efficient coding. However, another image by inverse Four- where  k are the Fourier coefficients of the original image in figure 6(A) and f f = -k k ( ) ( ) are random values, would preserve all the lower-order, but not higherorder correlations in figure 6(A) but only show cloudlike nonsense (see figure 4.1 of [61] for a demonstration). These observations prompted the proposal [34] that the higher-order redundancy should be preserved rather than removed in early visual processing to be analyzed further. This is supported by some recent observations [20].
It is important to remember that information theory concerns the quantity of information (in bits) rather than the meaning of the information, and applies mainly to information transmission. Once input information reaches the cortex, there is no obvious bottleneck like the optic nerve to restrict transmission bandwidth. Figure 2 suggests another attentional bottleneck, presumably arising from limited processing resources in the brain. This implies that only a fraction of the visual inputs that impinge on the eye are processed further and enter perception. Roughly, our eyes receive dozens of megabytes of raw data each second (about 30 frames of images at about 10 6 pixels per image); these data are compressed to about one megabyte per second at the output of the retina, but the attentional bottleneck has a capacity of only about 40 bits per second [61]. Indeed, one megabyte is enough for all the text in a thick book; but humans can only read about two sentences of text per second.
This suggests a new, critical question: how to select this tiny fraction, especially before the brain can possibly know the contents to be selected or deleted. We might also wonder which brain areas perform the selection. Decades of psychological studies have investigated selection. Shifting our gaze to a visual location (which, in natural vision is mandatorily linked with shifting attention there) is the main way to select this fraction. Such shifts can be guided by top-down, goal directed, or voluntary factors, such as when moving one's gaze along this text while reading this article, or by bottom-up, input-driven, or involuntary factors, such as when gaze is distracted from reading to a fly that suddenly appears in the visual periphery. Since we are more aware of our voluntary selection, most theories or research frameworks on selection have focused on top-down selection [16,55], which involves areas at or near the front of the brain, e.g. FEF in figure 4, more closely associated with our conscious thoughts. However, bottom-up selection is faster (though more transient) and often more potent [39,41]. Since selection is so important-not being distracted by an approaching predator while reading a book could cost one's life-could V1, the largest and most upstream cortical area for vision, be guiding the bottom-up selection?
Let us define saliency as the strength with which a visual location attracts attention in a bottom-up manner. In psychology experiments, saliency at a location is often measured by the speed with which observers find an item at that location [59]-i.e. by the shortness of their reaction time in a visual search task. Figure 8(A) shows that items with a unique feature, e.g. color or orientation, are salient in this sense. The bottom-up characteristic is evident since even if our task is to find non-vertical bars, the red vertical bar still automatically captures our attention. That is, we are not blind to it in the way that we are to the difference between the two images in figure 2. These salient items are also said to pop out perceptually.
A location can also be salient by being unique in motion direction, such as an item moving left among rightward-moving items. In figure 8(B), however, an item is not salient if it is unique only by virtue of its particular conjunction of two features, red color and vertical orientation, each of which is separately present in the background items. Salience is subtle-a cross among bars is more salient than a bar among crosses, see figures 8(C) and (D).
It had traditionally been presumed that bottomup attentional guidance depends on a saliency map of the visual space that is built up from external inputs [23]. However, for many years, the brain area responsible for this putative map was not specified; if at all, it was assumed to be located in frontal or parietal brain areas (FEF and LIP in figure 4(A)), where neurons are not specifically tuned to specific visual features such as color or orientation. This presumption was partly motivated by the observation that visual inputs of almost any feature in any feature dimension could be salient given the right context. Hence, saliency is often said to be 'feature blind'. Thus, it was surmised that the saliency map was constructed by combining inputs (from lower visual areas like V1) across different feature values and feature dimensions, so that neurons in such a saliency map should not be tuned to any specific feature.
However, V1 provides the largest cortical input to the visual layers (non-motor, superficial layers) of the brain region SC [12,36], which drives shifts in gaze ( figure 4(A)). Further, cooling V1 (in cats and monkeys) makes SC neurons for motor outputs nonresponsive to visual inputs [47]. This suggests that V1, rather than the retina, might be involved in directly mediating gaze shifts.
What computation might V1 subserve in directing attention in this way? We described V1 neurons as having receptive fields for the visual input. These are sometimes called their classical receptive fields (CRF). CRFs typically involve tuning to one or two feature dimensions, such as orientation, color, or motion direction. Thus for example, a V1 neuron can be tuned to orientation and prefers (i.e. responds more to) vertical orientation, but not tuned to color so that its response is not affected by the input color; another neuron can prefer red color but is unaffected by orientation.
However, neighboring V1 neurons which receive input from neighboring retinal locations interact with each other, thus the CRF is only an approximation [2]. These interactions are not random; rather neighboring V1 neurons preferring the same or similar features, e.g. vertical orientation, suppress each other's activities. This is called iso-feature suppression [29] and includes iso-orientation, iso-color, and iso-motiondirection suppressions. Figure 8(E) illustrates that isofeature suppression makes a feature singleton, e.g. an orientation singleton, evoke a higher V1 response than responses evoked by background input items which are identical to each other in the visual feature. This is because neurons responding to the background items suppress each other, while the neuron responding to the unique feature in the singleton escapes iso-feature suppression.
Consider the possibility that SC, which, like V1 is retinotopic, reads the population responses of V1 neurons and identifies the maximum V1 response for a particular retina location regardless of the preferred features and feature tunings of the V1 neurons concerned. A motor command from SC to shift gaze or attention to the receptive field of this maximally responding V1 neuron would precisely match the phenomenon of the bottom-up attraction of the feature singleton. Accordingly, the map of maximum V1 responses, one for each visual location, can be exactly the saliency map, despite the feature-tuning of V1 neurons. This saliency map is operationally simpler than the one envisioned traditionally.
Exactly along these lines, it was proposed [29,32] that V1 creates the saliency map, such that the saliency of each location is dictated by the maximum V1 response to this location relative to the maximum V1 responses to the other locations. By this theory, V1 neural responses are a universal currency to bid for bottom-up attentional selection, regardless of the preferred features and feature-tuning properties of the neurons concerned. More explicitly, let x x x , ,..., n 1 2 ( ) ) . Given a location x, there are many V1 neurons whose overlapping CRFs cover this location, and let » x x i mean that the CRF of the ith neuron (with response O i ) covers location x.
Let max , the highest V1 response to visual input at location ; the bid for attention in the bottom up manner at visual location ; then is the V1 saliency hypothesis.
In reality, b(x) is sampled at a spatial resolution that is perhaps comparable to that of the size of spatial errors in saccades. The saliency map s(x) is determined completely from the bidding map b(x), such that s(x) increases with b(x) relative to b(y) at ¹ y x, and in particular s(x) increases with b(x) when b(y) at all ¹ y xare fixed.
One can verify that iso-feature suppression enables the proposed saliency map to account for the prototypical examples of salient and non-salient visual inputs such as those in figures 8(A)-(D). The orientation and color singletons in figure 8(A) are salient because V1 neurons responding to the unique feature escape iso-feature suppression that is experienced by neurons responding to the background blue vertical bars. The orientation singleton that tilts less than 20 o from vertical is less salient since the escape from isoorientation suppression is only partial. The unique red vertical bar in figure 8(B) is not salient because neither the V1 neuron tuned to red nor the V1 neuron tuned to vertical escapes iso-feature suppression in their responses to it. In figures 8(C) and (D), the unique cross among the bars is more salient than the unique bar among the crosses because the former possesses a unique horizontal bar whose evoked V1 response escapes iso-feature suppression while the unique noncross lacks any unique (orientation) feature for this purpose. A non-linear dynamic neural circuit model of V1, calibrated to the known data on V1ʼs neural interactions including iso-feature suppression and other interactions, successfully accounted for many other examples including some even more complex and subtle ones [29][30][31].
One of the most convincing confirmations of the V1 saliency hypothesis comes from its novel prediction that the ocular singleton in figure 3 should be salient. Iso-feature suppression also applies to the eye-oforigin feature [15]. Hence, like the orientation singleton, the ocular singleton in figure 3 should also evoke a high V1 response, giving two peaks in the saliency map competing for attention, one for each singleton. The salient ocular singleton is a hallmark of the saliency map in V1 because V1 is the only visual cortical area with a substantial number of neurons tuned to eye of origin-this is also why we cannot recognize the eye of origin of visual inputs because recognition occurs downstream from V1. The salient ocular singleton also demonstrates that selection can occur before recognition, i.e. looking can occur before seeing. The predicted reaction times RT CO to the CO singleton, using equation (7) derived by applying the hypothesis on a toy V1, are longer than the observed RT CO . (E) Predicted reaction times RT CMO to the CMO singleton, unique in color, motion direction, and orientation, using equation (8) derived by applying the hypothesis on an authentic V1, agree with data.
Similarly, the unique cross among bars in figure 8(D) attracts attention not because the cross is recognized but because the horizontal bar in the cross is salient.
The V1 saliency hypothesis also provides a zeroparameter quantitative prediction which has also been confirmed experimentally [62]. This predicted RT CO is statistically longer than the observed RT CO from human observers (figure 9(D)). The reason for this is that the real V1 also has a class of cells, called CO cells, that are tuned simultaneously to C and O. If the responses of the CO cells are in general higher to the CO singleton than they are to the C and O singletons (by iso-feature suppression) and are sometimes higher than the responses of the C and O cells, the CO singleton can indeed be more salient than expected from the simple, toy prediction, making the RT CO shorter than predicted by equation (7).
It turns out that the real V1 lacks CMO neurons that are simultaneously tuned to color, orientation, and motion direction (M). Using the same argument as for equation (7), we can derive the following nonspurious zero-parameter prediction [ where each a RT is the reaction time to a singleton type denoted by a = C, M, O, CM, CO, MO, or CMO for a singleton having a unique feature in one, two or three feature dimensions denoted by single, double, or triple-dimensional abbreviations C, M, or O in the corresponding feature dimensions. Hence, the distribution of RT CMO , reaction time for a singleton unique in C, M and O simultaneously, can be predicted from those of the other six types of reaction times in equation (8). It is this RT CMO that we actually predicted, and then found to be statistically indifferent from data (figure 9(E)) [62]. Furthermore, because visual cortical areas downstream from V1 do seem to have CMO neurons (see arguments from data in [61,62]), the confirmation of this prediction suggests that these higher areas do not contribute to saliency.
3. Discussion: look before we think to understand our brain Things are always clearer in retrospect. Whereas the efficient coding hypothesis [5,6] was suggested soon after initial experimental data on visual receptive fields was reported, most data motivating the V1 saliency hypothesis had been around for decades before the hypothesis was proposed. The massive, direct, anatomical projections from V1 to the SC for controlling saccades has been known since 1970 [47,58]; fish and birds without the neocortex rely on the connection from retina to SC (which is called the optic tectum in lower animals) pathway for orienting; and the prefrontal cortical region that contains FEF ( figure 4) is a late-developing region of the neocortex in phylogeny as in ontogeny [18]. Together these data suggest that some orienting guidance functions of SC and the retina might transfer to V1 through evolution. However, the research field had collectively managed to cling to the belief that brain areas towards the frontal part of the monkey brain, rather than the back areas like V1, should control even the involuntary guidance of attention.
Similarly, reports had emerged since the 1960s that V1 responses can be changed by up to several fold by stimuli lying outside their classical receptive fields, and that neural connections between V1 cells are likely to be responsible [19,46]. This led to a 1985 review article [2] in the prestigious Annual Review of Neuroscience [2], seriously undermining the concept of the classical receptive fields. Suggestions [2,22] that the contextual suppression may partly cause the psychophysical popout effect were made with great hesitation and selfcensorship, as exemplified by one in a well-known 1992 article [22] on the orientation contrast effect arising from the contextual influences in V1: 'However, the link between these physiological response properties and visual perception must remain tentative ... One thing that should be examined is whether the cells that project to the attentional control system display the orientation contrast effect. This will not be an easy task ...'.
The resistance to let data guide our progressive understanding of V1 partly arises from the following conscious pre-conceptions or intuitions about the abilities of our subconscious brain: V1, which does not project directly to the frontal, 'smarter', brain areas, could at best contribute remote signals to attentional controls and other sophisticated tasks. Accordingly, V1 is expected for a lesser role, such as in redundancy reduction which is not associated with any 'smarter' tasks. This may also explain why efforts to extend redundancy reduction (or its close relative, sparse coding) to V1 first emerged and then remained near unabated over recent decades, despite us knowing since 1950 that V1 has 100 times as many neurons as there are retinal ganglion cells [13], making redundancy reduction unlikely. Additionally, we also did not think outside the box that a seemingly complex 'featureblind' saliency map could simply be represented by responses from feature-tuned cells in V1. Eventually, V1ʼs role in saliency was fortuitously discovered in an investigation on whether V1ʼs intra-cortical interactions might help highlight neural responses to object contours made of co-aligned bar segments [28]. Even then, I still took several more years to overcome my intuitions and derive the counter-intuitive prediction of the salient ocular singleton.
In neuroscience where we use our own brains to study our brains, understanding vision is unlikely to be the only case in which we are blinded by our misleading pre-conceptions. When we succeed in letting data overwhelm our fallacious intuitions, we will be better able to ask the right theoretical questions and thus collect even more revealing data. For example, if V1 is indeed guiding bottom-up visual selection, what could the downstream visual cortical areas be doing, in light of this selection [61]?