Adaptive precision pooling of model neuron activities predicts

When performing a perceptual task, precision pooling occurs when an organism’s decisions are based on the activities of a small set of highly informative neurons. The Adaptive Precision Pooling Hypothesis links perceptual learning and decision making by stating that improvements in performance occur when an organism starts to base its decisions on the responses of neurons that are more informative for a task than the responses that the organism had previously used. We trained human subjects on a visual slant discrimination task and found their performances to be suboptimal relative to an ideal probabilistic observer. Why were subjects suboptimal learners? Our computer simulation results suggest a possible explanation, namely that there are few neurons providing highly reliable information for the perceptual task, and that learning involves searching for these rare, informative neurons during the course of training. This explanation can account for several characteristics of human visual learning, including the fact that people often show large differences in their learning performances with some individuals showing no performance improvements, other individuals showing gradual improvements during the course of training, and still others showing abrupt improvements. The approach described here potentially provides a unifying framework for several theories of perceptual learning including theories stating that learning is due to adaptations of the weightings of read-out connections from early visual representations, external noise filtering or internal noise reduction, increases in the efficiency with which learners encode task-relevant information, and attentional selection of specific neural populations which should undergo adaptation.


Introduction
Are human visual judgments based on the activities of all neurons sensitive to a stimulus or just the activities of a specialized subgroup?If the latter, can human visual learning be characterized as an improvement in the selection of neurons to include in this subgroup?
A global pooling hypothesis states that the activities of all active neurons (or perhaps all active neurons in a particular brain region) contribute to an organism's percept and, thus, to an organism's perceptual decision (e.g., Shadlen, Britten, Newsome, & Movshon, 1996).A precision pooling hypothesis states that an organism's decision is determined solely by those neurons whose activities are believed to be highly informative about the correct answer (e.g., Purushothaman & Bradley, 2005).Consider, for example, a monkey discriminating two directions of visual motion in the presence of noise (Britten, Newsome, Shadlen, Celebrini, & Movshon, 1996).According to a global pooling hypothesis, the monkey reaches its decision by indiscriminately pooling the responses of a large population of neurons in, for instance, the middle temporal (MT) cortical area, an area specialized for visual motion analysis.Importantly, this population includes neurons with widely varying properties, such as different motion direction sensitivities (Nadler & DeAngelis, 2005).In contrast, a precision pooling hypothesis states that the monkey reaches its decision by pooling the responses of just those neurons whose responses carry highly reliable information about the motion direction.In many cases, this population is a small subset of MT neurons (Jazayeri & Movshon, 2007;Nadler & DeAngelis, 2005;Purushothaman & Bradley, 2005).
Although a matter of current debate, there is now data consistent with a precision pooling hypothesis, at least under some circumstances.For example, single neurons have been found to be reliable detectors of perceptual stimuli (Barlow, 1995;Bradley, Skottun, Ohzawa, Sclar, & Freeman, 1987;Ohzawa, DeAngelis, & Freeman, 1990), the activities of individual neurons have been related to an organism's perceptual decisions (Parker & Newsome, 1998), and an organism's decisions have been found to be correlated with the activities of higher precision neurons but not with those of lower precision neurons (Purushothaman & Bradley, 2005).If the precision pooling hypothesis is at least partially correct, then this raises the possibility that there is a link between perceptual learning and the pooling of neural responses.Based on behavioral experiments, it has previously been hypothesized that improvements in performance occur when an organism starts to use perceptual features that are more informative for a task than the features that the organism had previously used (Gibson & Gibson, 1955).
To link perceptual learning and the pooling of neural responses, it is useful to restate this behavioral hypothesis at a neural level.According to this restatement, referred to as the Adaptive Precision Pooling (APP) Hypothesis, improvements in performance occur when an organism starts to base its decisions on the responses of neurons that are more informative for a task than the responses that the organism had previously used.
The APP Hypothesis has its roots in theories stating that visual learning effects are due to reweightings of the "read-out" connections from early visual representations (Dosher & Lu, 1998, 1999;Law & Gold, 2008;Petrov, Dosher, & Lu, 2005;Saarinen & Levi, 1995). 1 A contribution of this article is its examination of an extreme form of this idea in which the weightings are biased to be sparse, meaning that only a relatively small number of weights are nonzero, and thus, only a small subset of neurons contribute to decision making (see also Liu & Weinshall, 2000).
We trained human subjects on a visual slant discrimination task and measured their learning performances.We also computed the performances of a statistical model, referred to as an "ideal probabilistic observer", on the same task (Barlow, 1957;Geisler, 1989;Green & Swets, 1966).The results show that subjects did not learn as well as the observer, meaning that they did not learn as much during training as they theoretically could have.Why were subjects suboptimal learners?Our simulation results suggest that the APP Hypothesis provides a possible explanation, namely that there are few neurons providing highly reliable information for the perceptual task, and learning involves searching for these rare, informative neurons during the course of training.
Although the APP Hypothesis potentially accounts for many characteristics of human visual learning, a key motivation for the work presented here is the need to account for differences in individuals' learning performances.A common observation (or frustration) among scientists studying visual learning is that participants in experiments often show different dynamics in their performances with some individuals showing no performance improvements, other individuals showing gradual improvements during training, and still others showing abrupt improvements (Rubin, Nakayama, & Shapley, 2002).As illustrated below, the APP Hypothesis accounts for all three types of learning dynamics.

Experiment Subjects
Eight naive subjects with normal or corrected-to-normal vision participated in the experiment.Subjects gave written informed consent before participation.

Stimuli
Stimuli depicted planar surfaces visually defined by "noisy" grids of horizontal and vertical lines (see Figure 1).A noisy grid was created in two stages.In the first stage, a noise-free grid was placed on the surface when the surface was at a frontoparallel orientation.In the second stage, the spatial location of each of the points where the horizontal and vertical lines intersect was perturbed by a random value sampled from an isotropic two-dimensional Normal distribution.Surfaces were then rotated in depth about their vertical axes.The noisy grid created a visual texture pattern whose gradients of element density, area, and compression were (ambiguous) cues to a surface's orientation in depth (Blake, Bu ¨lthoff, & Sheinberg, 1993;Cumming, Johnston, & Parker, 1993;Cutting & Millard, 1984;Knill, 1998).Lastly, surfaces were placed behind a circular aperture so that their outer edges were not visible, thereby removing a contour cue to surfaces' orientations.Displays were viewed at a viewing distance of approximately 65 cm.At this distance, a display subtended 14.5of visual angle in the horizontal and vertical dimensions.

Procedure
On each trial, a subject (monocularly) viewed a stimulus display for 500 ms, and then judged whether the display depicted a surface whose left side or right side was closer.The subject responded by hitting an appropriate key on a keyboard.On practice and training trials, a response was followed by an auditory sound indicating whether the response was correct or incorrect.On test trials, a response was followed by an auditory "click" sound indicating that the response was recorded by the computer.
Subjects participated for 3 experimental sessions where each session lasted approximately 50 minutes.On Day 1, subjects performed 40 practice trials in which surface slants were set to either j40or +40-.This was followed by 240 test trials, 16 trials at each of 15 surface slants (0-, T5-, T10-, T15-, T20-, T25-, T30-, T40-).Using maximum likelihood estimation, a psychometric function (a logistic function) was fit to a subject's test data (the absolute value of a surface slant was the independent variable, and the probability that the subject responded correctly was the dependent variable).Based on this function, the slant value corresponding to a subject's 70%-correct threshold, denoted C, was calculated.The subject then performed 4 blocks of training trials in which each block consisted of 240 trials.Surface slants on training trials were either jC or +C.A subject performed 5 training blocks on each of Days 2 and 3.

Results
Subjects' learning curves are shown in Figure 2 (blue lines).Importantly for our purposes, performances sometimes gradually improved and sometimes abruptly improved during the course of training.In addition, final performances were less than 100%.These results will be interpreted in the context of ideal probabilistic observers defined below.

Computational simulations
Within the vision sciences, there is debate over whether visual learning effects are best characterized as resulting from changes in early sensory representations (Karni & Sagi, 1991;Schoups, Vogels, Qian, & Orban, 2001) or as resulting from learned reweighting of the read-out connections from early visual representations with no changes in the representations themselves (Dosher & Lu, 1998, 1999;Law & Gold, 2008;Petrov et al., 2005;Saarinen & Levi, 1995).In this work, our approach to developing probabilistic observers for visual learning has been to develop models with a fixed visual representation but with adaptable read-out parameters governing models' judgments about visual stimuli.The input to our observers was the output of a fixed representational front end.The observers were adaptable Gaussian mixture models.

Representational front end
The representational front end was a model neural circuit intended to mimic important aspects of primate visual cortex.We used the same front end (with the same parameter values) as Petrov et al. (2005; see their article for additional details).This front end can be regarded as an instance of a "back pocket model" (Chubb & Landy, 1991), a type of model commonly used by vision scientists to account for aspects of visual texture perception (Chubb & Talevich, 2002 list 23 articles in the scientific literature using a back pocket model).These models typically start with a stage of processing in which In each of the first 8 graphs, the blue line plots the performance of an individual subject, and the red and green lines plot the average performances of the Anisotropic (Gaussian Mixture Model with diagonal covariance matrix) and Isotropic (Gaussian Mixture Model with diagonal covariance matrix with identical values along the diagonal) Observers, respectively, when trained under the same conditions as the subject.The bottom right graph shows the averages of the first 8 graphs.
a battery of spatially local, linear filters is applied to the visual input.The receptive fields of these filters resemble simple cell receptive fields of various spatial frequencies and orientations (e.g., Gabor filters).The models often also include nonlinear processing in the form of a pointwise nonlinearity (such as a rectifier) and in the form of lateral interactions between neural units (such as cross-orientation or cross-frequency normalization; Heeger, 1992), as well as pooling of local signals to create signals with more global properties.
The input to the front end was a pixel-based representation of images from our psychophysical task (i.e., each image depicted a planar surface with a noisy grid slanted in depth).An image was filtered by convolving it with 140 receptive fields, where each field was a Gabor function with orientation E Z {0, T15, T30, T45}, spatial frequency f Z {1, 1.4, 2, 2.8, 4}, and phase 7 Z {0, 90, 180, 270}.The filtered images were then rectified using the halfsquaring operator.The resulting values formed a set of phase-sensitive maps (one for each orientation, frequency, and phase), which can be interpreted as activation patterns across a retinotopic population of simple cells in area V1 (Heeger, 1992).The retinotopic maps of phase-sensitive units were then combined into phase-invariant maps (analogous to complex cells) by pooling phase-sensitive units that form quadrature pairs, as is done in energy models of visual motion perception (Adelson & Bergen, 1985).Each phase-invariant map was normalized by dividing it by a frequency-dependent normalization term, thereby producing a map whose total activation was approximately constant for above-threshold stimulus contrasts (Heeger, 1992).Lastly, each phase-invariant map was pooled across space using a Gaussian weighting kernel.
In our simulations, the displays used in the experiment were downsampled to create 128 Â 128 images.Each image was divided into 9 partially overlapping 64 Â 64 regions (regions were defined by dividing the original image into 3 [partially overlapping] rows and 3 [partially overlapping] columns).Spatial averaging took place within each region.The final output of the representational front end was the response values of 315 output units (9 regions Â 5 frequencies Â 7 orientations).These response values were the inputs to the probabilistic observers.

Gaussian mixture models
Each ideal probabilistic observer was a Gaussian mixture model with two Gaussian distributions, one for images depicting left-closer surfaces and the other for images depicting right-closer surfaces.On each trial of the slant discrimination task, an observer decided whether an image depicted a left-closer or right-closer surface.This decision was made as follows.Let x Y t denote the values of the output units of the representational front end on trial t (because the front-end outputs are the inputs to a model, x Y t also denotes these inputs).In addition, let y t * be a binary variable indicating whether the image depicted a leftcloser (y t * = 1) or right-closer surface (y t * = 0) on trial t.Finally, let w Y t denote the observer's parameter values (mean vectors and covariance matrices for the left-closer and right-closer distributions) on trial t.The observer computed the posterior probability that the image depicted a left-closer surface via Bayes' rule: where the prior probabilities P(y t * = 1) and P(y t * = 0) were each set to 1/2.If P(y t * = 1ªx Y t , w Y t ) Q 0.5, then the observer decided "left-closer"; otherwise, it decided "right-closer." The probabilistic observer adapted its parameter values on each trial.These parameter values always equaled the best possible values, and thus, our probabilistic observers can also be regarded as ideal learners (Eckstein, Abbey, Pham, & Shimozaki, 2004;Michel & Jacobs, 2008).On trial t, the observer had viewed the images on all trials up to and including trial t, along with the corresponding correct judgments of these images.The observer should use all of this information to optimally set its parameters.Let X 1,I,t denote the values of the front-end output units on trials 1 through t (i.e., x Y 1 ,I, x Y t ), and let Y 1 * ,I,t denote the corresponding values of the binary target variable (i.e., y 1 *,I, y t *).The optimal setting of the parameter values (in a maximum a posteriori sense) is the values that maximize the posterior distribution of the parameter values given the data X 1,I,t and Y 1 * ,I,t .Using Bayes' rule, this distribution can be written as The second term on the right-hand side is the prior distribution of parameter values.We assume that this is a uniform distribution.The first term is the likelihood function.Assuming that the data on each trial are independent, the likelihood function can be rewritten as An ideal probabilistic observer maximized this likelihood function as follows.Let I L be the set of trial indices for trials in which a left-closer surface was depicted, and let I R be the set of trial indices for trials in which a right-closer surface was depicted.I L and I R are disjoint sets whose union is the set {1, I, t}.The mean vector of an observer's left-closer Gaussian distribution was set to the mean of the data {x Y i } iZI L , whereas the mean vector of the right-closer distribution was set to the mean of the data {x Y i } iZI R .
In the simulations described below, the covariance matrices of the left-closer and right-closer distributions were restricted so as to reduce the number of parameters whose values needed to be estimated, to make the estimation of the parameter values easy and robust, and to make the simulation results easy to analyze.Let the matrices S L and S R denote the covariances of the data {x Y i } iZI L and {x Y i } iZI R , respectively.In some simulations (those of the Anisotropic Observer described below), S L and S R were restricted to be diagonal matrices, meaning they only represented the variances of the inputs, and covariances among pairs of inputs were assumed to be zero.In other simulations (those of the Isotropic Observer described below), each of these matrices was further restricted to have the same value along the diagonal, equal to the average variance of the inputs.Lastly, our simulations assumed that the left-closer and right-closer distributions had equal covariance matrices.This common matrix, denoted S, was set to the pooled covariance matrix: where n L and n R are the number of elements in I L and I R , respectively.In the field of statistics, this pooled covariance matrix is also referred to as the average classconditional covariance matrix.

Simulation results
We simulated two sets of probabilistic observers.One set is consistent with a global pooling hypothesis in the sense that its members received the responses of all front-end output units.The second set is consistent with a precision pooling hypothesis in the sense that its members received the responses of a subset of front-end output units.
The first set consisted of two observers referred to as the Anisotropic and Isotropic Observers.For the Anisotropic Observer, its two Gaussian distributions shared a common covariance matrix, and this matrix was restricted to a diagonal matrix.The term "anisotropic" refers to the fact that each front-end output unit (or, equivalently, each input to the observer) was modeled with its own variance parameter, meaning that the Anisotropic Observer scaled its input space differently in different input dimensions. 2 This observer had 945 parameters: 315 mean parameters for each distribution and 315 variance parameters shared by both distributions.
The Isotropic Observer was identical to the Anisotropic Observer except that its covariance matrix was further restricted such that its diagonal entries were equal to each other.Because the same variance parameter was used for all inputs, the Isotropic Observer used a uniform scaling in all input dimensions.This observer had 631 parameters: 315 mean parameters for each distribution and 1 shared variance parameter.
For each subject in the psychophysical experiment, 25 simulations of the Anisotropic and Isotropic Observers were performed.Each simulation used the same sequence of slants as the subject.That is, if the subject viewed a surface with slant C on training trial t, then an observer was exposed to an image depicting a surface with slant C on its tth training trial.The replications of each observer differed in the noise added to the grids defining each surface (see Figure 1).
We computed the Anisotropic and Isotropic Observers' average learning curves (see Figure 2).A notable feature of these curves is that they are relatively flat, i.e., the observers' learning algorithms converged quickly and their performances did not significantly improve beyond the first training block.In this sense, the human subjects' learning curves and the observers' learning curves have different shapes suggesting that their underlying learning mechanisms have different dynamics.This important point will be discussed below in the Adaptive precision pooling hypothesis section.
The Anisotropic Observer performed nearly perfectly starting from the first training block (red lines in Figure 2).This result indicates that, in the 315-dimensional space defined by the responses of the front-end output units, the left-closer and right-closer surfaces are far apart and, thus, easy for the Anisotropic Observer to distinguish.In contrast, the Isotropic Observer showed relatively poor performance (green lines in Figure 2).A comparison of the Anisotropic and Isotropic Observers highlights the importance of individually scaling each front-end output unit's response.When each unit's response was individually scaled, as in the Anisotropic Observer, the visual slant discrimination task is easy.When the same scaling is applied to all units' responses, as in the Isotropic Observer, the task is difficult.
The results also indicate that our human subjects were inefficient learners (inefficiency of human visual learning has previously been reported by Eckstein et al., 2004 andMichel &Jacobs, 2008).Whereas the Anisotropic Observer achieved near-perfect performance starting from the first training block, none of our subjects achieved near-perfect performance after fourteen blocks of training.An observer's discrimination performance can be quantified in units of dV , a measure based on signal detection theory (Green & Swets, 1966).If we let d subj V (b) denote a subject's performance on training block b, and let d AO V (b) denote the Anisotropic Observer's performance on training block b, then (d subj V (b)/d AO V (b)) 2 is the subject's discrimination efficiency at block b (this quantity goes between 0 and 1; a value of 1 means that the subject performed optimally in the sense that the subject performed as well as the ideal probabilistic observer, whereas a value less than 1 indicates that the subject performed suboptimally; Tanner & Birdsall, 1958).On average, subjects' discrimination efficiencies were 0.09 on block 1 and increased to 0.31 on block 14.These data indicate that subjects did not learn as much as they theoretically could have (based on the assumptions underlying the Anisotropic Observer).
Above, we stated that the Anisotropic Observer is consistent with a global pooling hypothesis because it receives the responses of all front-end output units.Although it receives the responses of all units, this does not necessarily mean that it makes significant use of all responses.It is possible that this observer uses the responses of a subset of the units and ignores all other units.We evaluated this possibility as follows.The Anisotropic Observer is an instance of a Gaussian Mixture Model with two components, and thus, it is possible to convert it to an equivalent logistic regressor (Bishop, 2006).This regressor maps the activities of the front-end output units to the posterior probability that an image depicts a surface whose left side is closer to the observer (one minus this value is the probability that a surface's right side is closer).Let To quantify the degree to which the regressor makes use of unit i at trial t, we define the relative weight for unit i as r i (t) = ªw i (t)ª/~jªw j (t)ª.
Based on their average relative weights at the end of training, we rank ordered the front-end output units.The results are that 10%, 20%, and 30% of the units account for 44%, 67%, and 82% of the regressor's total relative weight, respectively.To account for 95% of the regressor's total relative weight requires 46% of the units.In summary, the logistic regressor derived from the Anisotropic Observer makes nearly no use of about half of the front-end output units and makes at least mild use of the remaining units.We conclude that, although the Anisotropic Observer has the potential to follow the decision making strategy postulated by a global pooling hypothesis, it makes nonnegligible use of only half of its inputs, which means that the observer does not closely follow this strategy.
The second set of ideal probabilistic observers consists of observers that are consistent with a precision pooling hypothesis.Each observer was identical to the Anisotropic Observer with the exception that it received the responses of only a subset of front-end output units.The size of the subset was fixed, but the units comprising the subset were Figure 3. Learning curves for the subjects and the ideal probabilistic observers consistent with a precision pooling hypothesis.The horizontal axis of each graph indicates the training block number, and the vertical axis indicates the percent correct.In each of the first 8 graphs, the blue line plots the performance of an individual subject, and the red, green, and cyan lines plot the average performances of the observers that used 1, 8, or 16 front-end output units chosen at random, respectively, when trained under the same conditions as the subject.The bottom right graph shows the averages of the first 8 graphs.
chosen at random.For each subject in the psychophysical experiment, 25 simulations of a probabilistic observer were performed.The replications of an observer differed in the noise added to the grids defining a surface shown on a trial and in the subset of front-end output units whose activities were the inputs to the observer.
The data are shown in Figure 3.As before, the observers' learning curves are flat indicating that their learning dynamics were different than the dynamics characterizing learning in the human subjects.Again, this point will be addressed below.
An observer that used only a single front-end output unit performed poorly on average on the visual slant discrimination task (red line in Figure 3).An observer that used 8 units (2.5% of the total number of front-end output units) performed well (green line), and an observer that used 16 units (5%) performed extremely well (cyan line).Subjects' performances (blue line) at the end of training were nearly equal to those of the observer that used 8 units.These data reveal that there is a "point of diminishing returns" when considering how many units to use during decision making.For example, the Anisotropic Observer used many more units (315) than the observer that used 16 units, but its performance was only mildly better.

Adaptive precision pooling hypothesis
Intuitively, it seems that a decision network computing on the basis of a global pooling strategy would be inefficient because this network would include in its calculations neural responses that are irrelevant for the current task, responses that are highly noisy, or both.Thus, it seems sensible that a neural decision network might be biased toward the use of only a limited subset of inputs.Speaking broadly, this intuition has been formalized previously in different ways by both computational scientists and neurobiologists.
First, a decision network consistent with global pooling could perform its calculations on the basis of the activities of hundreds of thousands of neurons.However, it has been found that computational techniques that work well with low-dimensional input spaces often perform poorly with high-dimensional input spaces, a problem known as the "curse of dimensionality" (Bellman, 1957).Limiting the inputs to a network, as in precision pooling, can potentially be a form of dimensionality reduction, which ameliorates this problem.Second, a decision network receiving the inputs of many neurons has many parameters it could adapt (e.g., synaptic strengths).In the computational literature, it has been found that methods with many parameters are prone to "over-fitting" their training data, thereby leading to poor generalization.A bias toward the use of only a limited subset of inputs can be regarded as a type of regularization, which may lead to better task performance with novel stimuli (MacKay, 2003).Third, computational neuroscientists have argued that sparse representations in which few neurons are concurrently active have important representational advantages (Fo ¨ldia ´k, 1990;Olshausen & Field, 1996;Simoncelli & Olshausen, 2001).For example, neurons in networks biased toward the use of only a limited set of inputs become sensitive to additive components, often referred to as independent components, underlying non-Gaussian source signals.Lastly, visual neuroscientists have argued that the high metabolic energy costs of spikes motivate both the use of representational codes that rely on few active neurons and the need for mechanisms of selective attention (Lennie, 2003).
Neurophysiological data reviewed above and elsewhere (Purushothaman & Bradley, 2005) suggest that biological organisms may make perceptual decisions using precision pooling under some circumstances.Our computer simulation results reported above suggest that precision pooling can be an effective decision making strategy showing near-optimal performance with relatively few computational resources.The arguments presented in the preceding paragraph further motivate the study of precision pooling.Despite the potential benefits of precision pooling, it also has significant shortcomings.Importantly for our purposes, it does not provide an adequate account of the learning dynamics and performances of the human subjects in our psychophysical experiment.This failure stems from the fact that, by itself, precision pooling does not provide a strong link between perceptual learning and decision making.In the remainder of this article, we examine the Adaptive Precision Pooling Hypothesis, which links learning and decision making.
The inputs to our ideal probabilistic observers were the activities of the model neurons, which serve as the representational front-end outputs.Therefore, the observers made their perceptual judgments on the basis of these neurons' activities.We examined the information carried by each of these model neurons as follows.As above, let x À i L (t) and x À i R (t) denote the means of the ith neuron's activities over trials 1, I, t when images depicted leftcloser and right-closer surfaces, respectively, and let A i (t) denote the neuron's average class-conditional standard deviation.The information carried by model neuron i at trial t can be quantified as (Green & Swets, 1966).If neuron i's activities are very different when images depict left-closer versus rightcloser surfaces, then d i Vwill be a large number.Otherwise, it will be near zero.
For each subject in the psychophysical experiment, we performed 25 simulations of the Anisotropic Observer.We then rank ordered the model neurons based on their average dV values at the end of training.The results are shown in Figure 4.It is consistently the case that a small number of neurons are highly informative for the visual slant discrimination task, whereas the vast majority of neurons are only moderately or mildly informative. 3 Motivated by these data, we begin to evaluate the APP Hypothesis by examining an observer that represents an extreme instance of a decision maker consistent with this hypothesis.As with previous observers, this observer was a Gaussian Mixture Model with two Gaussian distributions.Importantly, this observer performed the visual slant discrimination task on the basis of the responses of a single neuron in which the selection of this neuron was performed as follows.For each experimental subject, we rank ordered the model neurons according to their average dV (t) values at the end of training.We then simulated observers based on neurons ranked 1, 5, 10, and 20.The results are shown in Figure 5.When averaged across subjects, observers based on neurons ranked 1, 5, 10, and 20 achieved performances of 99%, 95%, 87%, and 80% correct at the end of training, respectively.Surprisingly, to achieve approximately the same level of performance as the human subjects, an observer needed to use the 10th most informative neuron (out of the population of 315 neurons).
The fact that an observer needed to use a model neuron in the top 3% of most informative neurons to match human performance suggests an explanation as to why human subjects in our experiment were suboptimal learners.It may be that, during the course of training, subjects "searched" for the neurons in visual cortex which were most informative for the experimental task.
When a subject found a neuron that was more informative than the neurons it had previously identified, the subject's performance improved.Because highly informative neurons are rare, a subject's performance was frequently imperfect.
To evaluate this explanation, and to further evaluate the APP Hypothesis, we implemented an observer that is an adaptive decision maker consistent with this hypothesis.This observer used a competitive learning scheme to perform a greedy hill-climbing search for a good model neuron.Consequently, this scheme closely resembles the model using competitive learning and constrained computational capacity previously proposed by Liu and Weinshall (2000) to account for aspects of generalization during human perceptual learning.At the start of a simulation, two model neurons were selected at random.One of these was randomly selected as the "winner," whereas the other is referred to as the "competitor."During a training block, the observer's judgments were based on the winning neuron.However, judgments based on the competitor were also calculated.At the end of a training block, the neuron with the best performance was designated as the new winner.The other neuron was discarded, and a new model neuron was randomly sampled to serve as the competitor.This learning process continued for the duration of training.
The results are shown in Figure 6 where each graph plots an observer's learning curve.In some simulations, performance improved gradually with training time.In The model neurons that served as input to the observer were rank ordered based on their average d Vvalues at the end of training.The horizontal axis of each graph plots a neuron's rank order, and the vertical axis plots a neuron's average d V.The first 8 graphs correspond to the simulations associated with the 8 subjects.The bottom right graph is the average of the first 8 graphs.Figure 6.An observer implementing an adaptive precision pooling strategy was simulated 25 times using the same training conditions as a typical subject (Subject 7).Within a training block, this observer performed the visual slant discrimination task on the basis of the responses of a single neuron.Over training blocks, the observer performed a greedy hill-climbing search to identify a neuron whose responses were highly informative for the experimental task.
Figure 5.For each human subject, an ideal probabilistic observer was simulated 25 times using the same training conditions as the subject.This observer performed the visual slant discrimination task on the basis of the responses of a single neuron.The model neurons that served as input to the observers were rank ordered based on their average dVvalues at the end of training.The red, green, cyan, and magenta lines show the performances of the observers based on neurons ranked 1, 5, 10, and 20, respectively.The blue lines show the performances of the human subjects.The first 8 graphs plot the data associated with the 8 subjects.The bottom right graph is the average of the first 8 graphs.
other simulations, performance rose abruptly, sometimes to a near-optimal level.In still other simulations, performance improvements were small or nonexistent.In terms of their shape and scale, these learning curves qualitatively resemble those of the human subjects.That is, the learning process studied here produces a range of learning performance dynamics that resembles the range of dynamics exhibited by human subjects.
To highlight the learning dynamics of the observer consistent with the APP Hypothesis, we contrast them with those of an observer that implemented an adaptive global pooling strategy.This observer performed the slant discrimination task using the responses of all front-end output neurons.On each trial, it updated its mean vectors and diagonal covariance matrix using conventional online error-correcting rules whose general forms are commonplace in the computational modeling literature (e.g., Petrov et al., 2005;Rescorla & Wagner, 1972;Widrow & Hoff, 1960).To update its mean vectors, it used the following rule: where the diag operator turns a vector into a diagonal matrix, and where the square of a vector is performed on an element-by-element basis.The only free parameter is the learning rate (, which was set to a value (( = 0.001) that produced levels of performance at the end of training approximately equal to those of the human subjects.The results are shown in Figure 7.In all simulations, the learning curves have a stereotyped pattern in which performance improved in a slow, steady manner.Although these learning curves resemble those obtained by averaging the curves of many subjects, they do not resemble those of individual subjects.That is, the learning process of the adaptive global pooling observer does not produce a range of learning performance dynamics matching the range of dynamics exhibited by individual human subjects.
The adaptive precision pooling model differs from other models in several respects.It differs from linear models (or quasi-linear models such as logistic regression), such as that of Petrov et al. (2005) as well as the adaptive global pooling observer, because it includes a nonlinear adaptation mechanism.This mechanism produces a Figure 7.An observer implementing an adaptive global pooling strategy was simulated 25 times using the same training conditions as a typical subject (Subject 7).This observer performed the visual slant discrimination task on the basis of all front-end neuron responses.On each training trial, it used an online error-correcting rule to update its mean vectors and covariance matrix.
potential discontinuity in the model's behavior when it adds or deletes neurons from the limited set of neurons influencing decision making.This nonlinearity allows the model to exhibit a wide range of learning dynamics.In addition, the form of this nonlinearity distinguishes this model from other nonlinear models.For example, the model differs from multilayer neural networks, which learn new nonlinear features by adapting the weights of internal units (Rumelhart, Hinton, & Williams, 1986).While both forms of nonlinear adaptation are likely to play a role in human visual learning, this article has focused on the form exhibited by the adaptive precision pooling model.This focus is motivated, at least in part, on new neurophysiological evidence favoring precision pooling and on existing evidence regarding neural competition.Indeed, neural competition is thought to play a role in many forms of learning and development (Purves & Lichtman, 1985), and our simulations suggest that it also plays a role in the type of perceptual learning studied here.

Discussion
We have described an account of human visual learning on a visual slant discrimination task, referred to as the Adaptive Precision Pooling (APP) Hypothesis.Relative to other possible accounts, we believe that this account has several appealing features.
First, the APP Hypothesis links behavior with its neural underpinnings.At a behavioral level, the hypothesis states that improvements in performance occur when a learner starts to use perceptual features that are more informative for a task than the features that the learner had previously used (Gibson & Gibson, 1955).At a neural level, the hypothesis states that an organism makes perceptual decisions by pooling the responses of a small number of highly informative neurons (Purushothaman & Bradley, 2005).In addition, a learner can adapt the set of neurons it uses during decision making to include more informative neurons.The behavioral consequence of this neural adaptation is that the learner's performance improves because it starts to use more informative features.In this article, we have considered the APP Hypothesis primarily from psychophysical and computational perspectives.We have not attempted to address its neurophysiological implementation.One possibility is that the selection of neurons to use during decision making can be performed through a competitive learning scheme based on Hebbian and anti-Hebbian mechanisms.Future work will need to explore this possibility.Future work will also need to explore more realistic models of neural function including models containing correlated noise among neural responses and models containing noise at the level of pooling and decision making.
Second, the APP Hypothesis potentially links theories of visual learning, attention, and decision making.We have examined a simple implementation of the hypothesis in this article.In particular, the simulations described above used a random hill-climbing search for potentially highly informative neurons.Our approach, however, provides the opportunity for new theories linking learning and attention based on mechanisms for selective attention which guide the search for informative neurons.We see this as an important area for future research.Early steps in this direction have already been taken (e.g., Dayan, Kakade, & Montague, 2000;Fahle, 2004;Hochstein & Ahissar, 2002).
Third, the vision sciences literature contains many theories of visual learning, including theories stating that learning is due to adaptations of low-level sensory representations (Karni & Sagi, 1991;Schoups et al., 2001), adaptations of the weightings of read-out connections from early visual representations (Dosher & Lu, 1998, 1999;Law & Gold, 2008;Petrov et al., 2005), external noise filtering or internal noise reduction (Dosher & Lu, 1998), increases in the efficiency with which learners encode task-relevant information (Gold, Bennett, & Sekuler, 1999), and attentional selection of specific neural populations which should undergo adaptation (Hochstein & Ahissar, 2002).The APP Hypothesis contains several of the best characteristics of many of these theories.The hypothesis clearly involves adaptations of the weightings of read-out connections from early visual representations.Because it also includes the addition of more informative neurons in the decision making process and the deletion of less informative neurons, a learner consistent with the hypothesis can increase the efficiency with which it encodes task-relevant information and can also perform external noise filtering and internal noise reduction.Furthermore, as mentioned above, the search for more informative neurons can be characterized as an attentional process guiding visual performance and learning.The key point here is that other theories in the vision sciences literature may be able to account for our human subjects' learning performances, but these theories are not necessarily competitors of the APP Hypothesis.Rather, the APP Hypothesis may provide a single mechanism for implementing these other theories.If so, it would provide a unifying framework for many current theories of perceptual learning.
Fourth, the APP Hypothesis provides new explanations for behavioral phenomenon which are currently poorly understood.It is often the case, for instance, that some human subjects in a visual learning experiment show gradual improvements in performance, whereas others show abrupt improvements (Rubin et al., 2002).The APP Hypothesis provides a single mechanism that can show both patterns of performance improvements (see Figure 6).As a second example, researchers have reported that perceptual learning is often highly specific.Subjects trained on a task at one retinal position, or with one eye, or with one stimulus orientation may show improvements with practice, but their performances often return to baseline when tested under novel conditions (Fahle, Edelman, & Poggio, 1995;Fiorentini & Berardi, 1980;Karni & Sagi, 1991).According to the hypothesis, a learner will fail to generalize acquired perceptual knowledge when the neural responses selected by the learner's decision network are informative under training but not test conditions.Further experimentation is needed to evaluate these explanations.
Fifth, the APP Hypothesis has a solid statistical foundation.The implementation described here relies on statistical techniques which are well defined and well understood, such as the use of Gaussian Mixture Models for density estimation and the use of Bayes' rule for decision making (Bishop, 2006).In addition, the use of a greedy hill-climbing strategy to search for informative neurons resembles feature selection techniques that are commonly used in statistical applications when many features are available but only a small subset are likely to be relevant for a given task (Guyon & Elisseeff, 2003).
Sixth, the APP Hypothesis is simple.The implementation described here relies on low-order summary statistics such as means and variances.There are good reasons believe that neurons could calculate these quantities.Indeed, computational neuroscientists have shown how neural circuits can encode and manipulate probability distributions (Pouget, Dayan, & Zemel, 2003).
Seventh, the APP Hypothesis suggests possible new research directions, including new experiments and predictions about the outcomes of those experiments.Consequently, the hypothesis is falsifiable.At a behavioral level, the hypothesis states that improvements in task performance are due to the use of new perceptual features.Vision scientists have recently begun to use behavioral techniques to determine the stimulus components, template, or "classification image" that an observer uses to perform a binary discrimination task (Abbey & Eckstein, 2002;Ahumada, 1967Ahumada, , 2002;;Levi & Klein, 2002;Lu & Liu, 2006).In principal, and sometimes in practice, these techniques can be used to study how classification images change with learning (Beard & Ahumada, 1999;Gold, Sekuler, & Bennett, 2004;Michel & Jacobs, 2008).Consider an experiment in which stimuli are created from a large number of perceptual features, but only a small subset of these features are diagnostic of a stimulus category.One possibility, consistent with a global pooling hypothesis, is that people's templates contain the full set of potentially important features at all stages of training.The APP Hypothesis predicts, in contrast, that people's templates will be characterized by small subsets of features whose members vary over time.Next, consider an experiment in which many features are equally diagnostic of a stimulus category.A global pooling hypothesis would claim that people's templates will contain all of these features.In contrast, the APP Hypothesis predicts that these templates will contain only a small subset of features.If attention and learning are linked, then the subset will contain those features that are most salient on the basis of endogenous and exogenous selective attention.
At a neurophysiological level, recent data by Law and Gold (2008) are consistent with basic predictions of the APP Hypothesis.These authors trained monkeys to perform a visual motion-direction discrimination task.They found that performance improvements were accompanied by changes in neural responses in cortical area LIP, an area associated with decision making, but not in neural responses in area MT, an area associated with sensory representation.Moreover, during the course of training, the responses of the most sensitive neurons in MT became more strongly correlated with a monkey's decisions.These data support the APP Hypothesis because they are consistent with the idea that performance improvements are due to a more selective read-out of the most sensitive sensory signals.
The APP Hypothesis motivates additional experiments evaluating finer details of the hypothesis, such as experiments in which image regions or features to which an observer attends are manipulated.Consider, for example, a monkey discriminating two very close directions of visual motion, such as 0-and 3-.In addition, consider the subset of MT neurons whose activities are informative for this task, such as neurons that are reliably more active with 0-stimuli than with 3-stimuli.In one experimental condition, these neurons could be electrically stimulated when the monkey views 0-stimuli, thereby causing these neurons to become highly active.Would this manipulation make it easier for the monkey to identify these highly informative neurons?If the APP Hypothesis uses competitive learning implemented via a Hebbian mechanism, then the hypothesis predicts that the answer to this question is yes, and that the manipulation will cause the monkey's decision network to use the activities of these neurons in future decisions, thereby leading to better performance and faster learning.In another condition, area MT neurons selective for relatively uninformative motion directions could be stimulated.Would the monkey's decision network then use the activities of these neurons?Again, the hypothesis predicts that the answer is yes.In this case, the manipulation would lead to worse performance and slower learning.
Taken as a whole, the APP Hypothesis is an early step toward a neural theory of perceptual learning.In its current state, it leaves open many questions regarding, for example, its neural implementation, the relationships between learning, attention, and decision making, and its computational properties.Perhaps its greatest strength is that it provides a single, coherent framework for addressing many issues in the perceptual sciences.We hope that psychophysicists, neurophysiologists, and computational neuroscientists will all contribute to its future development.There exist in the literature at least two explanations at the level of neural processing for visual learning, namely that learning results from changes in the tunings of neurons' sensitivity functions (tuning functions might shift, broaden, or sharpen) and that learning results from changes in the weights assigned to the responses of neurons contributing to a psychological response.These two explanations are compatible because reweightings of neural responses will necessarily result in changes to the tuning functions of all mechanisms (including decision mechanisms) subsequent to the reweighting.
2 Our reasons for using the term "scaling" can be illustrated as follows.As discussed below, a Gaussian mixture model can be converted to an equivalent logistic regressor that maps a model's inputs to the probability that the inputs represent the image of a left-closer surface.The regressor corresponding to the Anisotropic Observer multiplies input i by 1/A i 2 , where A i 2 is input i's average classconditional variance.That is, the regressor scales each input dimension by the variance along that dimension.The regressor corresponding to the Isotropic Observer multiplies all inputs by the same value, meaning that this regressor uses a uniform scaling in all input dimensions.

3
Recall that a subject was trained to distinguish two surface slants (C and jC).In addition, recall that the model neurons are tuned to specific spatial locations, frequencies, and orientations.Depending on the slant values used when training a subject, the receptive field properties of a neuron may make its responses highly diagnostic of whether a left-closer or right-closer surface was displayed on a trial, or may make it only moderately or mildly diagnostic.

Figure 1 .
Figure 1.Illustrations of planar surfaces defined by "noisy" grids.The surfaces were viewed through a circular aperture so that their outer edges were not visible.The left surface has a slant of 40-(left side closer), whereas the right surface has a slant of j30-(right side closer).

Figure 2 .
Figure 2. Learning curves for the subjects and the ideal probabilistic observers consistent with a global pooling hypothesis.The horizontal axis of each graph indicates the training block number, and the vertical axis indicates the percent correct.In each of the first 8 graphs, the blue line plots the performance of an individual subject, and the red and green lines plot the average performances of the Anisotropic (Gaussian Mixture Model with diagonal covariance matrix) and Isotropic (Gaussian Mixture Model with diagonal covariance matrix with identical values along the diagonal) Observers, respectively, when trained under the same conditions as the subject.The bottom right graph shows the averages of the first 8 graphs.
denote the mean of the ith unit's activities over trials 1, I, t when images depicted left-closer surfaces, and let x À i R (t) denote the corresponding value when images depicted right-closer surfaces.Let A i 2 (t) denote the unit's average class-conditional variance.The regression coefficient corresponding to the ith frontend output unit at training trial t is w

Figure 4 .
Figure 4.For each human subject, the Anisotropic Observer was simulated 25 times using the same training conditions as the subject.The model neurons that served as input to the observer were rank ordered based on their average d Vvalues at the end of training.The horizontal axis of each graph plots a neuron's rank order, and the vertical axis plots a neuron's average d V.The first 8 graphs correspond to the simulations associated with the 8 subjects.The bottom right graph is the average of the first 8 graphs.
Rg ðtÞ; ð5Þ where t indexes the trial number, ( is a learning rate parameter, x Y (t) is a vector of front-end output neuron responses on trial t, and 2 Y {L,R} (t) is the mean vector of either the left-closer or right-closer Gaussian distribution, depending on which type of visual stimulus was displayed on trial t.The update rule for the observer's covariance matrix was AEðt þ 1Þ ¼ AEðtÞ þ (½diag½ð x Y ðtÞ j 2 Y fL; Rg ðtÞÞ 2 jAEðtÞ; ð6Þ Commercial relationships: none.Corresponding author: Robert A. Jacobs.Email: robbie@bcs.rochester.edu.Address: Center for Visual Science, Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627, USA.