Luminance, Colour, Viewpoint and Border Enhanced Disparity Energy Model

The visual cortex is able to extract disparity information through the use of binocular cells. This process is reflected by the Disparity Energy Model, which describes the role and functioning of simple and complex binocular neuron populations, and how they are able to extract disparity. This model uses explicit cell parameters to mathematically determine preferred cell disparities, like spatial frequencies, orientations, binocular phases and receptive field positions. However, the brain cannot access such explicit cell parameters; it must rely on cell responses. In this article, we implemented a trained binocular neuronal population, which encodes disparity information implicitly. This allows the population to learn how to decode disparities, in a similar way to how our visual system could have developed this ability during evolution. At the same time, responses of monocular simple and complex cells can also encode line and edge information, which is useful for refining disparities at object borders. The brain should then be able, starting from a low-level disparity draft, to integrate all information, including colour and viewpoint perspective, in order to propagate better estimates to higher cortical areas.


Introduction
Disparity plays an important role in our perception of the environment, giving us precious information for survival. Our brain extracts it from the information that reaches the hypercolumns of V1 via the Lateral Geniculate Nucleus (LGN), which relays information of the left and right retinae. At this early stage, disparity is already key for broad and precise motor control (e.g., walking/running while avoiding obstacles, eye-hand coordination while picking up a pencil), low-and high-level Focus-of-Attention (FoA), object and background segregation, as well as recognition, even with partial occlusions [1].
Computer vision research has significantly advanced the state-of-the-art in disparity estimation models, with many different approaches and applications [2]. However, there is a significant lack of biologically motivated models that computationally implement the Disparity Energy Model (DEM), which integrates key biological evidence from research on the cat's visual cortex and pathways by [3], and more recently from the rhesus monkey's visual cortex [4]. Alternative models also exist for building and combining disparity energy neurons [5]. The DEM allowed to explain how neurons tuned to horizontal disparities can have the implicit ability to discriminate vertical disparities [6]. This ability is an emerging property from a neuronal system tuned to horizontal disparities, by decoding vertical ones as a deviation from the expected neuronal responses. This ability also illustrates how the neuronal system can encode much richer information than would be expected and, at the same time, concentrate neuronal resources on the most common cases while keeping the possibility of encoding rare ones.
Most DEM computational implementations found in the literature were unable to give good results on real-world images. Therefore, we first focused on building upon a state-of-theart theoretical DEM implementation by [6] until we could reliably extract disparity estimations from real-world data. This was documented in [7]. It is still the only DEM-based method ranked on the Middlebury Stereo Evaluation Website [8], against 153 other disparity methods.
Some authors have proposed alternative biological models which are not based on the DEM, e. g., [9] combining geometric information and local edge features, [10] using multiscale lines and edges to retrieve a disparity wireframe model of the scene-the Line and Edge Disparity Model (LEDM) which is further explored in this paper in §5.1-and also du Buf et al. [11], employing the phase differences of simple cell responses to the left and right views. The latter model is often applied to real-world problems, although it has been shown to be very imprecise in terms of localisation of depth transitions.
Most DEM research has considered theoretical or synthetic data, while biological models applied to real-world scenes appeared only recently [7,9,10,12]. This is mainly due to the fact that computational DEM implementations are usually focused on evaluating theoretical results using very specific stimuli, like bar/grating patterns or random-dot stereograms [6], or in psychophysical experiments [4].
In this paper, we propose a disparity map composed of different cell maps built on top of each other, each refining the previously extracted disparity. We also propose that the first, rough disparity (disparity gist) is provided by the DEM model [7], after which refinements based on colour, perspective correction (viewpoint) and border information are integrated to achieve the final disparity map. Although the model is still feed-forward or bottom-up, in the future it can be supplemented by feedback loops from higher visual areas V2 and V4 in order to further improve results [1].
In our improved DEM implementation we use two neuronal populations for obtaining disparities: 1. An encoding population which uses a set of binocular neurons with a diverse range of cell parameters, e. g., horizontal disparities, spatial frequencies and orientations. This population is trained on random-dot stereograms in order to learn activity codes for many different disparities. The method is similar to that of [6], which is based on the DEM model of [3], with proper normalisation to yield local correlations with neighbourhood weighting [13][14][15]. Finally, the population is applied to real stereograms in order to obtain local activity codes. This is further explained in §3.1.

2.
A higher-level decoding population which compares a local activity code, at each image position, with all learned (trained) activity codes, for estimating local disparity. This is further explained in §3.2. Basically, this second population implements a template-matching process similar to those of [16] and Read [6]. This initial DEM model (disparity gist) is then integrated with colour and different viewpoints ( §4), and finally with object border information retrieved from the multi-scale line and edge disparity model (LEDM) [10] and lowlevel processes from object salience research [17] ( §5).
Our main contributions in this paper are: (a) Improving previous DEM results in real-world images. (b) The integration of the DEM model with luminance, colour information and viewpoint perspective correction. (c) The integration of two disparity models DEM and LEDM, to improve object boundary precision of the DEM. (d) The integration of different layers of disparity cell maps, with each layer improving the results from layer to layer. (e) The quantitative evaluation of results with real-world scenes, showing that the model can compete with state-ofthe-art computer vision algorithms.

Disparity-sensitive cells
The primary visual cortex (V1) is composed mostly of simple, complex and end-stopped (hypercomplex) cells arranged into ocular dominance hypercolumns. Computationally, the receptive fields (RFs) of monocular simple cells can be modelled by Gabor wavelets [7,18,19], with parameters to specify orientation θ, spatial frequency f (or the wavelength λ = 1/f ), receptive field size σ and spatial phase ϕ, which will be discussed below. We can then model binocular simple cells using pairs of monocular simple cells with either a position-or phase-shift between RFs (or a combination of both), signalling disparity when both RFs of the binocular cell are fully excited. However, binocular simple cells are also sensitive to stimulus contrast and pattern position within their RFs [3,18], which makes them unsuitable as disparity detectors.
In contrast, binocular complex cells can solve these problems, as there are no separate excitatory and inhibitory subregions within their RFs, making them only sensitive to position, orientation and stimulus size [20]. They also show other desirable properties like sensitivity to fine disparities, immunity to anti-correlated stimuli [3] and they respond accurately to dynamic random-dot stereograms [21]. Two binocular simple cells S 1 and S 2 can be combined into a phase-independent binocular complex cell, provided that their phase difference jϕ S 1 − ϕ S 2 j equals π/2. Therefore, the response of a binocular complex cell can be obtained by summing the responses of two binocular simple cells with phases in quadrature.
Mathematically, two monocular RFs can be used to model a binocular simple cell, with the same size, orientation and spatial frequency, but with different phases ϕ and/or RF positions on the retina (Δx, Δy) [22]. The left (ρ L ) and right (ρ R ) RFs of binocular simple cells are then defined by r L;R ðx; y; y; s; f ; ; Since we will use phases in quadrature ϕ 2 {0, −π/2} and both ρ L and ρ R actually consist of two RFs: the sine and cosine components. In Eq. 1, _ x and _ y are the coordinates relative to the binocular cell's centre, which is (0, 0) at the fovea, and rotated according the cell's preferred orientation θ: The left disparity viewpoint is used as reference, requiring the use of binocular cells with left predominance. The main reason for using the left view is that it is often used for defining the ground-truth of real scenes, thus allowing for a quantitative analysis of experimental results. Mathematically, the offset coordinates Δx and Δy, which correspond to the cell's preferred horizontal and vertical disparities, are defined as follows: when the activity code is trained (learned) with random-dot stereograms, the left RF is centred at (0, Δy) and the right one at (−Δx, Δy). When the cells are applied at all input stereogram positions, then (x L , y L ) = (x, y + Δy) and (x R , y R ) = (x − Δx, y + Δy). We note that Δy = 0 is taken for all cells, as vertical disparity in the fovea is zero [22]. For a detailed mathematical transformation from monocular to binocular simple cells see [18].

Luminance Disparity-Energy Model
In this section, we describe the lL-DEM or L-DEM, and show how disparity maps can be extracted by exploiting binocular cell responses and comparing them with previously learned stimuli, using cells sensitive only to luminance variations. The L-DEM was first presented in Martins et al. [7] and is adapted partly for this section, serving here to provide a performance baseline. Understanding this model is also fundamental for understanding all further improvements described in this paper. For the L-DEM implementation, we use two neuronal populations: (1) an encoding population and (2) a higher-level decoding population. As explained above, for presenting our stereo results we use by default the reference viewpoint (image) of the left eye.

Disparity encoding population
For the encoding population's binocular simple cells defined in Eq. (1), we selected RF parameters based on [6]: a. Orientations θ i 2 (i × π)/N θ , with the number of orientations N θ = 8. Our empirical tests showed that using more orientations yielded slightly better disparity estimates, but increases the total cell population. Using eight orientations is a good compromise.
b. Receptive field sizes (scales) s 2 2 ffiffi ffi 2 p ; 2; ffiffi ffi 2 p È É . These are scaled by a factor of ffiffi ffi 2 p , as is the spatial frequency. Empirical results showed that bigger sizes increase the blur at objects' border regions and smaller sizes lead to errors in disparity estimates.
f. RF phase disparity Δϕ = 0, implying no extra phase difference between the left and right RFs of each simple cell (equal phases ϕ for both). It is to be expected that in naturally occurring images, the maximum response of a phase-shift disparity neuron is elicited when there is a different pattern of the same stimulus in the left and right RFs, something that never occurs in the real world [4,5]. Our empirical tests also showed that the use of phase differencesodd-symmetric disparity tuning curves-did not add significant information and sometimes even degraded the quality of disparity estimates. Other alternative roles for neurons tuned to phase disparities are explained further in [23].
In total, the above selection yields a population of 8 θ × 3 σ, f × 2 ϕ × 60 Δx × 1 Δϕ = 2880 binocular simple cells as inputs for 1440 binocular complex cells; see below. The values were chosen to replicate physiological parameters of real cells, for yielding precise disparity estimates in real-world images. The disparity encoding population is then built and trained as follows, based on Read [6]: Stereo energy coding. Responses of the left and right RFs of binocular simple cells (v L and v R ) are obtained by convolving ( Ã ) the RFs with the corresponding left and right grayscale images I L, R (x, y): v L;R ðx; yÞ ¼ I L;R ðx; yÞ Ã r L;R ðx; yÞ: ð3Þ To simplify notation, below we skip (x, y). I L, R are obtained from sampling an RGB colour stereogram using physiologically perceived weights from the luminance Y channel of the CIE XYZ colour-space, which closely resembles human colour perception: At each image position, the response S of a binocular simple cell combines the squared responses of the left and right RF components [3,18]: S can be split into the monocular term M ¼ v 2 L þ v 2 R and the binocular term B = 2 v L v R . Biologically, this can be realised by combining the outputs of two energy neurons with phase disparities π apart. If such neurons are identical except for their phase disparities, then the first one computes (M + B) and the second (M − B). Both M and B are then available from the sum and difference of the two responses, i. e., 2M and 2B [6].
For obtaining the local stereo energy E of a binocular complex cell which is invariant to the phases of local patterns in the input, one can either sum the responses of (a) many binocular simple cells with scattered phases ϕ in [0, 2π], or (b) only two cells with phases in quadrature. We could therefore apply the second case with ϕ 2 {0, −π/2}: E = S ϕ = {0, −π/2} S ϕ . This stereo energy E, for each frequency, orientation and disparity, can be related to the cross-correlation between filtered and windowed images [15]. However, the local stereo energy E cannot be used directly to estimate disparities, as it also reflects monocular energy (stimulus contrast inside each RF) along with binocular energy (stimulus disparity between RFs). This shortcoming is addressed below by using spatial pooling and effective binocular correlation.
Spatial pooling. Complex cells are normally modelled by taking the square root of the sum of the squared responses of the sine and cosine components of the simple cells. This implies that the RF size of such complex cells is equal to that of the simple cells: the same Gaussian. However, RFs of real binocular complex cells are larger than those of simple cells [18]. Therefore we apply this property by averaging M and B, using grouping cells with a Gaussian RF: G sp (x, y) = k exp (−(x 2 + y 2 )/2σ 2 ). The normalisation factor k = 1/(2πσ 2 ) and σ equals the RF size of the corresponding simple cells: s 2 2 ffiffi ffi 2 p ; 2; ffiffi ffi 2 p È É . This yields, for the two phases, This pooling operation involves using simple grouping cells with a dendritic field size defined by σ and it is crucial to stabilise results in case of real-world images with noise and non-uniform disparity ranges.
Effective binocular correlation. In order to differentiate monocular energy from binocular energy, it is necessary to use normalised binocular correlation detectors [6,[13][14][15]. These detectors respond maximally (+1) when the left and right RF views are identical, and minimally (−1) when one RF view is an inverted-contrast version of the other. They are implemented by dividing the pooled binocular term by the pooled monocular term, after which the result is pooled once more for increasing robustness: The value of ψ sp relates to the correlation between local, filtered regions of the left and right views [23]. The population of binocular correlation detectors ψ sp is used for encoding disparity in the model. Disparities estimated by using the effective binocular correlation instead of the local stereo energy E are immune to the detrimental effect of monocular contrast, allowing the extraction of disparity from peaks in the population's activity code. ψ sp has also the useful property that it exactly equals 1 when the actual disparity matches a cell's preferred disparity [6]. Please recall that ψ sp is the short notation for c sp f ;y;Dx ðx; yÞ, i. e., there are three scales, eight orientations and 60 horizontal position disparities, hence 1440 binocular correlation cells which are later applied at all image positions.
Learning the population code. We trained the energy model to discriminate horizontal stimulus disparities (Δx stim ) ranging from 0 to 59 pixels with a stepsize of 1 pixel. Population activity codes were gathered from cell responses to stimuli with known disparities: random-dot stereograms with an uniform disparity, sampled randomly from a Gaussian distribution with zero mean and unit standard deviation, with a Δx stim horizontal offset between the left and right images. Offset gaps were also filled with randomly sampled pixels; see Fig 1. For each Δx stim step we generated 1000 random-dot pairs. Hence, training involved 60,000 stereograms; for details see Martins et al. [7]. For each stereogram, with I L,R the left and right views, we applied Eq (3) and Eq (4), but only at the centre of the left and right images of each stereogram. The values of ψ were computed without spatial pooling, i. e., because the results are pooled over 1000 random-dot stereograms for each disparity. During training, and later when applying the population to real images, the effective binocular correlations ψ and ψ sp are encoded as a mean spike count, where u = 8 is the average number of spikes elicited by a binocularly uncorrelated stimulus within the temporal discrimination window. We used parameters similar as [6], with typical values of u around 8 spikes, assuming a firing rate for the optimal disparity of 100 Hz and a temporal window of 160 ms. This yields values of C in the range [0,2u], where 2u represents the mean number of spikes that neurons tuned to a specific disparity will fire in the presence of a perfect binocular stimulus of that disparity (maximum correlation). Finally, C was averaged [A(Á)] over the 1000 different stereograms for each Δx stim , which serves to eliminate random stimulus-dependent noise. This yields an activity code for each trained horizontal disparity Δx stim : In summary, W represents the number of spikes produced by neurons tuned to frequencies f, orientations θ and horizontal disparities Δx, averaged over all 1000 stimuli with the same uniform disparity Δx stim . The population code thus consists of 1440 binocular correlation cell responses (3 scales, 8 orientations and 60 horizontal position disparities) for each of the 60 different horizontal stimulus disparities Δx stim of the random-dot stereograms. The adaptation and learning of the encoding cell population to discriminate disparities can be thought of as kin to visual learning in early childhood, assuming that basic neural circuitry is the result of evolution, or, at least, needs adequate training to reach its full potential.

Disparity decoding population
As mentioned before, learning is done only once and in the centre of the random-dot stereograms. After training, the encoding population can then be applied at all pixel positions (neighbourhoods) of real world input stereograms, excluding the border region. The disparity at each position is estimated by comparing the activity code there with all learned codes. This is done by a second, higher-level decoding population. The disparity assigned to each pixel position is the disparity of the best-matching code. Local disparity estimation is a simple matching process [16]: the input code of 1440 responses is matched or correlated with the 60 sets of 1440 trained codes. The final output is selected by the decoding population by a winner-takes-all strategy. Biologically, this probably involves associative memory, which can also be based on a training process [24]. The matching process uses 60 correlation cells ("Corr") which compare C sp f ;y;Dx ðx; yÞ with W Dx stim f ;y;Dx , i. e., the 1440 spike counts at each image position with all previously learned 60 sets of 1440 spike counts: where [Á] + is half-wave rectification. This avoids the problem of disparity in anti-correlated stereograms by setting any negative correlations to zero [25]. Note that r Δx stim is a vector of 60 correlation values, each related to a specific Δx stim disparity that the population was trained to recognise, from 0 to 59. The maximum correlation yields the luminance-disparity map

Experimental results
The obtained results for this method were first published in Martins et al. [7], where we tested the Luminance Disparity Energy Model (L-DEM) on various reference stereograms from the Middlebury stereo evaluation set. These are: tsukuba, venus, teddy and cones [26,27], aloe and cloth3 of the 2006 dataset, and dolls, moebius and reindeer of the 2005 dataset [28]. For reference, Fig 2(a)-2(e) shows the L-DEM results for the tsukuba stereo pair [7,27]. This algorithm was able to obtain good results for the Middlebury evaluation test (ranked there as "BioDEM") [7], which are detailed in §6. We will compare further disparity improvements using these results as baseline.

Luminance, Colour and Viewpoint DEM
This section addresses an improved disparity model, the Luminance, Colour and Viewpoint Disparity Energy Model(LCV-DEM), which integrates colour and viewpoint (perspective) information to increase accuracy of the L-DEM.
Research involving the chromatic representation in area V1 has shown that cone responses from the retina turn into three relatively independent spatio-chromatic colour channels after the LGN [29], which are then transformed in several neural pathways, mixing colour responses with those of other cells [30]. The majority of neurons in V1 seem to respond to pure isoluminant stimuli (i. e., they are colour sensitive even in the absence of luminance changes), and around 50% of all neurons are sensitive to both luminance and isoluminant stimuli. They are classified as either "colour-luminance" or "luminance-preferring" cells with a varying degree of cone opponency [31]. There is also evidence that chromatic features are useful for binocular correspondence in complex images, suggesting the possibility of independent contributions from both luminance and colour channels [32,33]. In addition, it has been reported that there exist V2 neurons of macaques that are sensitive to both colour and disparity, supporting the notion that the primate visual system combines disparity and colour as early as in area V2 [34].
For the LCV-DEM implementation we initially chose the LMS colour space, which mimics the trichromatic neuronal encoding of cone responses after the LGN [30]. However, the results obtained with the LMS colour space were not significantly better than those with a simple variation of RGB (each channel codes both luminance and colour). This is not surprising. Since neuronal cells have so many different combinations of luminance or colour predominance, the system is able to be independent of the colour method used, as long as there is enough variety of weight predominance between the different colour channels. We did, however, get better results when using physiologically perceived colour weights for encoding luminance (the Y channel of the XYZ colour space), suggesting that not only disparity is heavily luminance based, but also that it depends on luminance being perceptually representative of the scene being observed.

Disparity encoding population
The extended model uses the same population parameters as L-DEM, defined in §3.1, with, in addition to points (a) to (f), point We can improve disparity estimates by using two more RF dominances. As previously mentioned, the binocular simple cell RFs are defined by ρ L,R in Eq (1), where ð _ x; _ yÞ are offset coordinates relative to the centre (0,0) and rotated to the cell's preferred orientation according to Eq (2). For μ = Left we use (x L, R , y L, R ) as shown in §2, representing both RFs centered around −Δx/2. For μ = Centre the RFs are equidistant from (0,0) and their coordinates are (x L , y L ) = (x + Δx /2, y + Δy/2) and (x R , y R ) = (x−Δx/2, y − Δy/2). For μ = Right the RFs are shifted to the right and centered at Δx/2, resulting in coordinates (x L , y L ) = (x + Δx, y + Δy) and (x R , y R ) = (x, y).

Disparity decoding population
The implementation uses the same decoding method as L-DEM, as specified in §3.2. However we are processing each of the four colour channels c independently-this allows us to show the benefits of colour without having to train the population again. For each (x, y), the correlation (Corr) coefficient is now calculated between C sp m;c and W Dx stim f ;y;Dx . The correlation vector r Δx stim , μ, c now holds 60 × 3 × 4 cell responses, 60 for each μ and c combination. At this step, three viewpoint-based D LC m disparity maps are built independently (examples are shown in Fig 2f-2h). The disparities assigned to each position (x, y) will be the values The resulting map can be seen in Fig 2(i). Combining viewpoints effectively increases the accuracy of disparity estimates at the left and right borders of objects, which are usually inaccurate due to viewpoint occlusion (i. e., each eye will see some information that the other does not). This leads to a correspondence problem, which is greater when the distance between left and right images of the pair is larger. For illustration purposes, Fig 3 shows a better example of the benefits of combining viewpoints, for the cones stereo pair. Here, the left and right images are more separate, with a maximum disparity of 59 pixels vs. only 15 pixels for tsukuba. Cones' disparity maps highlight the greater differences between viewpoints. The fusion of all three maps is shown in image (i), where black pixels represent uncertain disparity regions, which we address below.
Background and occlusion correction layer. The mapD LCV ðx; yÞ needs to be corrected in order to eliminate uncertain/unknown disparities due to incorrect disparity assignments in background regions or from occluded regions where disparities were shifted. To remove these, we use a two-step approach: First, we determine which disparity is the probable background and assign it as the farthest disparity inD LCV ðx; yÞ. Computationally, this process is done in four steps:

Experimental results
We tested the LCV-DEM on the same Middlebury stereograms used in L-DEM [26][27][28].  (Fig 2i), and D LCV shows the final LCV-DEM map after the background and occlusion correction layer (Fig 2j).
The quantitative results from the Middlebury stereo evaluation are discussed in §6, comparing L-DEM with LCV-DEM. We can visually verify (see Fig 2c and 2j) that there are several improvements from D L to D LCV , nevertheless, the edges and regions around objects still lack a precise boundary definition. In the next section we will explain a complementary stereo model to assign disparity to line and edge features, and show how the integration of both disparity maps can be achieved.

Boundary enhanced LCVB-DEM
Another role for monocular simple and complex cells in V1 is the ability to extract multiscale lines and edges that are significant for object categorisation and recognition [19]. If lines and edges are extracted in V1, where left and right retinal projections are close together, one might even assume that depth is attributed to them. In other words, a "3D wire-frame representation" could be built in V1 for handling 3D objects and scenes. Although this idea is speculative, many V1 cells have been found to be tuned to different combinations of frequency (scale), orientation, colour and disparity. If not coded explicitly, disparity could be coded implicitly. This allows us to develop an alternative disparity model, where we assume that lines, edges and disparity are coded explicitly-the Line and Edge Disparity Model (LEDM).
Since disparity along object borders is the biggest problem for the presented DEM models, we also integrate at this step a low-level object salience model [17] that complements line and edge information from LEDM. This allows us to combine edge conspicuity with line/edge disparity information readily available in V1/V2. Using both on top of the LCV-DEM allows us to correct disparity values astride object borders. This yields our final model, the Luminance, Colour, Viewpoint and Boundary enhanced Disparity Energy Model (LCVB-DEM).

Line and Edge Disparity Model
Line and edge detection is based on responses of even and odd monocular simple cells, corresponding to the real and imaginary parts of a Gabor filter [19]. These responses are denoted by R s;i E ðx; yÞ and R s;i O ðx; yÞ, with scale s given by λ and orientation i according to θ. We used the same 8 orientations as for the binocular cells in the previous models, and scales s corresponding to 4 λ 24 with a step size Δλ = 2. Positive/negative lines are detected where R E has a local maximum/minimum and R O has a zero crossing. For edges, the even and odd responses are swapped. In total, there are four possibilities for positive and negative Line/Edge features (L/E). An improved scheme [19] consists of combining responses of monocular simple and complex cells, i. e., simple cells serve to detect positions and L/E types, whereas complex cells are used to increase confidence. Monocular complex cell responses are modelled by the modulus Keypoint maps are also exploited in the LEDM model, as these code line and edge crossings, singularities and points with large curvature. They are built from two types of end-stopped cells, single and double, which are modelled by the first and second derivatives of C s, i . Endstopped responses are refined by tangential and radial inhibition to obtain precise keypoint cell maps KP s (x, y) [35]. Fig 4(c) shows the tsukuba keypoint map at a coarse scale (λ = 24).
The disparity assigned to each L/E is based on a left-right correspondence over scales: 1. First, we suppress L/Es which may be due to noise: at each scale s of the left and right maps LE s L;R ðx; yÞ, we compute the maximum response of the monocular complex cells C s, i where L/Es have been detected. Any L/Es with a small amplitude (C s, i below 5% of the maximum response) are inhibited, yielding c LE s L;R ðx; yÞ. The 5% threshold is necessary to eliminate detected L/Es at small gradients that do not represent region transitions. This value depends on the noise sensitivity of the Gabor responses and it was empirically determined. We found 5% to be consistently stable across many cases.
The weights for each factor were empirically determined after several trials. Finally, the horizontal disparity Δx belonging to the maximumĈ value is stored in the depth map D LE (x, y). For more implementation details see Rodrigues et al. [10]. LEDM was applied to the Middlebury stereo pairs, exemplified with tsukuba in Fig 4. The results were very good, with disparities correctly assigned to object borders in image (d). The disparity error image (e) displays the incorrect values as black pixels, showing that almost all lines and edges have a correctly assigned disparity (80.7% at a 0.5 max error and 90.6% at a 1.0 max error).

Line and Edge region enhancement
To enhance disparity accuracy in line and edge regions and to remove small gaps we combine LCV-DEM with LEDM into an intermediate representationD LCVE , similar to Rodrigues et al. [10]. For each L/E pixel in the D LE map we define a small cluster at the L/E position plus its N 4 neighbourhood (left, right, top and bottom neighbours) and compare its median to the median of a similar cluster in D LCV , at the same position. If the clusters have similar median values (less than a threshold t), the D LCV cluster response at the L/E position is propagated intoD LCVE as detailed below. Mathematically, 8(x, y), where t 2 {1,. . .,5} is an integer value that represents the maximum allowed difference and med(Á) the median. If Eq (14) is false, then the D LCV cluster response is assumed to be wrong, and its region is filled inD LCVE using the value of D LE (x, y). This way, we correct the LCV-DEM results using the LEDM responses. This process starts with t = 1 and it is applied in several cell layers, recursively, on top of the newly createdD LCVE map, i. e., if it is not possible to fill it any more, but there are still gaps, we increment t by 1 and repeat the same procedure. In our experiments 5 was the maximum value. Biologically, this could correspond to 5 layers ofD LCVE that activate neighbouring "idle" cells. The result can be seen in Fig 2(k), where many small regions have been corrected.

Object Boundary enhancement
Despite the above process to correct ambiguous regions, some boundaries can still be improved. In real scenes, disparity borders are mostly found at the contours of real objects, so we use a disparity sharpening process based on local contrast of disparity values, conspicuity information and line/edge boundaries to reach the final stage of this whole process, yielding D LCVB -Luminance, Colour, Viewpoint and Boundary enhanced Disparity Energy Model (LCVB-DEM). This process requires three steps: Edge conspicuity. In general, object borders are perceptually salient in a scene. In order to detect them, we first define edge conspicuity f Coðx; yÞ as a low-level V1 process. Mathematically, it is the maximum difference between colours in I c L ðx; yÞ, with c 2 {l, r, g, b}, at four pairs of symmetric positions with pixel distance kd i k ¼ 1 from point (x, y), i. e., on horizontal, vertical and two diagonal lines [17]. Conspicuity f CoðxÞ is the maximum Euclidean distance of all four colour pairs, In order to remove low responses due to small colour gradients that do not represent edges, responses lower than 10% of maxð f CoÞ are inhibited. A value of 10% for this threshold was found to be a consistently good choice for many cases. This value is linked to the perceptual nature of differentiating colours and is an empirically determined constant. We can think of this inhibition process as following the Weber-Fechner law (just-noticeable differences) in psychophysics, with this threshold being Weber's constant. The remaining active cells are selected by Non-Maximum Suppression (NMS), which yields conspicuity edge positions c Co. Fig 4(f) shows the tsukuba c Co map after NMS. Border Detection. We use a specific binary border-detection cell layer B d that combines cell responses from c Co, D LE and D LCV . B d (x, y) cells are only active when the following condition is true: 8ðx; yÞ : c Coðx; yÞ > 0 _ D LE ðx; yÞ > 0^j D LE ðx; yÞ À D LCV ðx; yÞ j> 0 ½ , i. e., at conspicuous borders and at lines/edges when they correspond to object borders and not to homogeneous disparity regions. Then, we devise two approaches to detect and correct bad disparity estimations by analysing regions that are far or near B d active cells: • The far case will cover regions where there are no active B d cells nearby, i. e., regions that should have a homogeneous disparity value. Here we analyse relationships between small disparity peaks or bumps and their surrounding areas. For peaks, if the inside median disparity of a small cell cluster (10px radius) M in is different from that of its border (outside perimeter) M out , and if M in > M out then the cell cluster disparity is reduced to the value M out , eliminating the disparity peak. For bumps, if M in < m out , with m out the minimum value of the border region (perimeter), then the cell cluster disparity is increased to m out , slightly bumping the disparity depression to a coherent region background value (using M out here could lead to wrong results near regions with objects, as bumps could wrongly be raised to their disparity instead).
• For regions near active B d cells, i. e., near object borders, every active border in B d activates a filling in process. We assume that the entire disparity map D LCV is covered by overlapping F cells with RFs of 3 × 3 pixels and one pixel distance between their centres, which compute the median disparity in their RF. On each side and orthogonal to a B d edge, a cluster of three orthogonal neighbouring F cells starts close to the edge and moves until a maximum distance of 25 pixels. If the three neighbouring cells are denoted by F 1 (closest to border), F 2 (middle) and F 3 (farthest from border), then disparity F 2 is propagated to the border at the first position where |F 2 − F 3 | ! 2 and F 2 = F 1 . Hence, a stable disparity value (before the first significant disparity transition) is propagated until a B d edge. In this process we apply median disparities in order to skip disparity changes which do not likely correspond to true object borders.
The completion of both approaches returns an enhanced disparity map D LCVE ; see Fig 2(l). Median smoothing. Finally, the last step serves to correct all locally inconsistent disparities by assigning to each (x, y) position the most probable disparity within a small RF. This process is similar to a median smoothing filter and is achieved by applying circular cell clusters to D LCVE (x, y) (6px radius; slightly bigger or smaller sizes do not affect the global ranking in the Middlebury test, despite slightly improving/degrading individual images). This yields the final disparity map LCVB-DEM denoted by D LCVB , shown in Fig 2(m).

LCVB-DEM Experimental Results
Fig 2 shows the D LCVE map in (l) and the final disparity map D LCVB in (m). By subjecting the last result to the Middlebury evaluation test we obtain the "Bad pixels absolute disparity error ! 0.5" and "Signed disparity error" of D LCVB , respectively shown in (n) and (o). When comparing (m) with the results obtained from L-DEM in (c) we can observe significant improvements. Nevertheless, the number of pixels with wrong disparity estimates, although reduced, is still significant (see Fig 2n, at regions near depth discontinuities) and the biggest errors are located at the border of the desk-lamp and its support (Fig 2o).  [7] (detailed in §3 as the L-DEM) is highlighted in grey. In the next section we will show results for other images and discuss the different disparity models qualitatively and quantitatively.

Results
As mentioned in §3.3, we also tested the model at the different implementation steps on various stereograms, including tsukuba, venus, teddy, cones, aloe, cloth3, dolls, moebius and reindeer [26][27][28]. We can see that our model performs best in non-occluded regions but it is not as good near depth discontinuities. This was expected, because L-DEM and LCV-DEM struggle at border transitions, which is why the LEDM model is used to improve the LCV-DEM; it improves results but without yet achieving outstanding results-still, the error for regions near depth discontinuities decreases more than a factor of two in the venus case. The all regions columns refer to entire images, even regions which are half-occluded. Avg % bad pixels gives a general indication of how well the methods perform, as it shows the average percentage of bad pixels (wrong estimates) over all twelve columns. In all cases, the bad pixels were counted by applying the smallest error criterion possible: a disparity difference with the ground-truth greater than 0.5; for details see Scharstein and Szeliski [8].
Overall, best results were obtained for images without many small details. This is related to the size of the RFs in the cell population; smaller RFs are required to resolve the smallest details, but unfortunately they also increase binocular correspondence errors. Fig 8 shows our result when compared to the ranked results of other methods, which can include more sophisticated post-processing and top-down methodologies, like image segmentation, for yielding massively improved pixel-to-pixel correspondences. This table was replicated from the Middlebury online evaluation webpage, applying the smallest available error threshold ( 0.5) to emphasise that a biologically-inspired algorithm can achieve competitive results.
We can also see that the LCVB-DEM method improves the results achieved with the L-DEM (BioDEM) method. Overall, we achieved a good position in the average ranking table: rank 95.6 between 5.4 (best) and 159.5 (worst), on a total of 162 evaluated methods. With LCVB-DEM we significantly rise 31 positions, from position 126.6 to 95.6 (table retrieved on 13th January 2015) relative to BioDEM. If we average the rankings of individual results in the columns devoted to non-occluded regions, our method would rise 20.6 positions, to rank 75. This confirms that the biggest improvement can be achieved by even more accurate estimates near depth discontinuities. Finally, to the best of our knowledge, our method is ranked highest when compared with other biologically inspired methods .

Discussion
We presented a hierarchical model of four disparity estimation methods, based on the biological lDEM. It can achieve good results if compared with computer vision methods [8] and it advances the state-of-the-art of biologically inspired methods [12,36]. The advantage of the proposed DEM approach is that it does not rely on extrinsic knowledge of cell parameters to estimate disparities, requiring only trained cell populations. All used DEM-like models rely on two neuronal populations: (1) an encoding population that learns to discriminate disparities    from repeated presentations of random and binocularly uniform stimuli, resulting in a population activity code (i. e., a mean spike count) for each stimulus disparity; and (2) a decoding population that associates each code to a specific disparity value, using synaptic weights that store the mean activity of the population [1]. After foveal training, the populations are ready to evaluate disparities at any retinotopic (image) position, each local activity code being decoded into a single disparity value. Although not explored here, we also expect the decoding population to have some degree of neural plasticity and context-awareness, dynamically adapting itself to correlate the decoding weights to local image content.
All proposed models use a large number of cells: the L-DEM model starts with 2880 binocular simple cells which are combined into 1440 complex cells, at each retinotopic (image) position; LCVB-DEM increases that number to 17,280 complex cells. Nevertheless, these are trivial numbers when compared to total V1 size, estimated at about 190 million cells [37], but that number could well be near 243 million (average volume of V1 of 5,405 mm 3 × 45,000 cells/ mm 3 ).
The role of colour in biological disparity models is still rather speculative [32], with little research into biological disparity models that employ colour, even in view of already existing evidence that disparity-sensitive neurons can also be isoluminant-sensitive [33,34]. Meanwhile, our empirical evidence suggests that mixing colour weights may definitively play a significant role in improving the luminance discrimination of cells, which can significantly improve disparity estimations. Empirically, using different Y-channel luminance formulas in the XYZ colour space significantly affected the accuracy of the disparity maps, suggesting that the brain's luminance pathway (where L-and M-cone responses are combined) plays a key role in the stereo matching process by maximising the differences between regions of a scene. This is expected evolutionarily, since the brain needed to develop a robust disparity system that worked well for various survival-related tasks, especially in the dark, when scotopic colour perception is unreliable. Nevertheless, colour can still play an important role in defining disparity transitions by highlighting conspicuous object borders [17].
The role of perspective correction, to shift the viewpoint of disparity maps in order to yield better estimates, is also biologically plausible: even uV1 cells display the ability to shift their RFs [38]. Basically, this process increases the robustness of binocular correspondence (i. e., stereo-matching) by combining the responses of three binocular RF perspectives, instead of just one, at each image position. This is especially useful for scenes with many occlusions or periodic textures. The method chosen for perspective shifting, shown in Eq (12), could also be particularly useful for combining many different perspectives in multi-view stereo. In this paper we considered the left view, but this was because of a practical reason. In biological vision models this should be the central view in order to mimic cyclopean vision and minimise object border occlusions between left/right perspectives.
A big advantage of the models is that they exploit cell types that are already available in the cortex: monocular simple cells can be paired to construct binocular cells. They are also useful for coding lines and edges, as in the lLEDM, or even for object segregation or brightness perception [19]. Also, as shown by Pugeault et al. [9], different spatial structures can be linked both in 2D and 3D by using constraints like good continuity. These structures can be complemented with other features, like optical flow, colour and texture, to help in object recognition. The LEDM exploits the structural organisation of V1 hypercolumns, with very close left and right retinal projections, associating depth to detected lines and edges at a low level, i. e., a sort of "wireframe" representation [1]. This is useful for post-processing of DEM estimates in occluded regions, where some detail is visible in one projection but not in the other. This allows the LCVB-DEM to use LEDM and conspicuity edges to steer and correct disparity estimations on both edge sides, while smoothing disparities in regions without edges. The role of phase tuning in sharpening edge disparities is also yet to be explored [11].
Finally, we propose and illustrate that the classical DEM (L-DEM) and LEDM can be used to create a disparity "gist" map, i. e., they are robust enough to quickly draft the environment, either from binocular energy complex cells or from object contours (the bottom layer of Fig 5). Such maps are sufficient for person or robot navigation, as they are based on quickly extracted visual features in a very low-level layer. In a second layer, the DEM is combined with colour and perspective correction, giving a more accurate disparity map, but still lacking well-defined borders around objects. In the third layer, information about edges is integrated into the LCVE-DEM disparity map. The fourth and final layer sharpens object borders using saliency data on top of LCVE-DEM, yielding LCVB-DEM. In summary, we have two disparity gist-like maps, one with localised edge information (LEDM) and one with spatially inaccurate, but precise region information (L-DEM), which are later combined with colour and viewpoint to form a more robust map (LCVB-DEM).
For further research, it makes sense to explore some alternative and promising combinations of binocular cells that proved to yield more biologically accurate disparity tuning curves in rhesus monkeys [4,5]. The role of phase-tuned cells is also an interesting topic [19,23], as their use can be seamlessly integrated into our model, signalling false disparity matches that can be immediately corrected at a low-level.