Confidence-based iterative efficient large-scale stereo matching

: In this study, we integrate confidence into efficient large-scale stereo (ELAS) matching to produce a more accurate approach to binocular stereo for high-resolution image matching. Elas ensures good performance in the presence of poorly textured and slanted surfaces, but one of its deficiencies is its unsatisfactory ability to capture disparity discontinuities. Our formulation explicitly models the effects of confidence as a likelihood term in a principled manner using the Bayes rule. Because it is an iterative method, we associate each point with a variable confidence value and update this value based on a given confidence updating rule. Meanwhile, complementary support points are selected from stable points whose confidence value exceeds a predefined threshold, which differs from ELAS, whose support points are matched in advance and kept unchanged in the subsequent process. Confidence also plays a vital role in avoiding expensive computation, and the adjustment of support points makes disparity estimation more flexible. Quantitative evaluation demonstrates the effectiveness and efficiency of the proposed formulation in improving the accuracy of disparity estimation.


PUBLIC INTEREST STATEMENT
As an important technique of three-dimensional reconstruction, binocular stereo matching is undergoing rapid development. ELAS is an efficient stereo matching method for large-scale images. To improve the performance of ELAS, CI_ELAS is proposed in this paper by integrating the concept of confidence. High confidence means high reliability. After an evaluation on data-set from Middlebury and KITTI benchmarks, the proposed method outputs a more reliable disparity map, which can be further used to generate threedimensional information and reconstruct the scene.

Introduction
The estimation of disparity maps from binocular imagery has played a fundamental role in computer vision and multimedia processing for decades. Many researchers are dedicated to the solution of stereo matching with high accuracy and efficiency. Related algorithms mainly include local algorithms and global algorithms (Facciolo, de Franchis, & Meinhardt, 2015;Scharstein & Szeliski, 2002;Tombari, Mattoccia, & Di Stefano, 2007). Local algorithms have an advantage over the global algorithms in terms of inexpensive computation. Disparity maps are estimated in a pixel-wise fashion by comparing features over a support region (a concatenation of the features of pixels in a window centered in the pixel) of the reference and target images. However, they are susceptible to image noise and are largely ambiguous in poorly textured or repetitive regions. Global algorithms integrate prior constraints into a correlation-based stereo model to decrease the matching ambiguities (Boykov, Veksler & Zabih, 1998;Cheng & Caelli 2007;Felzenszwalb & Huttenlocher, 2006;Kolmogorov & Zabih, 2001;Woodford, Torr, Reid, & Fitzgibbon, 2009). A compatibility function that expresses the compatibility between neighboring disparities is typically introduced to aggregate support (Tappen & Freeman, 2003), giving rise to the Markov random field (MRF) based model. Optimization of the resulting MRF based energy function is generally considered to be NP-hard. Many techniques such as graph cut (Boykov et al., 1998;Kolmogorov & Zabih, 2001;Taniai, Matsushita, & Naemura, 2014) or belief propagation (Felzenszwalb & Huttenlocher, 2006) have been proposed to solve them effectively, but these algorithms are all too slow for images of a reasonable size (Szeliski et al., 2008).
To deal efficiently with large images, efficient large-scale stereo (ELAS) is proposed as a generative probabilistic model for stereo matching that allows for dense matching with small aggregation windows by reducing ambiguity in the correspondences (Geiger, Roser, & Urtasun, 2011). A piecewise linear prior is constructed over the disparity space by forming a triangulation on a set of robustly matched correspondences known as support points. This method can achieve state-of-the-art performance and enables real-time stereo matching at resolutions greater than 1 megapixel. The proposed prior of ELAS assures good performance in the presence of poorly textured and slanted surfaces, but its deficiencies include unsatisfactory performance in capturing disparity discontinuities (Geiger et al., 2011).
To address this problem, we incorporate into ELAS the concept of confidence (Hu & Mordohai, 2010), which can describe the correctness of the estimated disparity assigned by the matching strategy. We named the resulting algorithm confidence-based iterative ELAS (CI_ELAS). Our formulation explicitly models the effect of confidence as a likelihood term in a principled manner using the Bayes rule. We associate each point with a variable confidence value and update this value in an iterative manner based on a predefined confidence updating rule. In the first iteration, the disparity estimation process is the same as that of ELAS; during the next iteration, the support points, which are used to construct a new triangulation, are those selectively obtained in the preliminary and previous steps with a high level of confidence. Other stable points with high confidence levels are kept unchanged during subsequent iterations. The entire process is halted once the maximum number of iterations is achieved or the number of unstable points shows no obvious change, which means that the disparities of most image pixels are stable and reliable. Depth estimation can be improved by introducing the notion of confidence. As demonstrated in our experiments, the proposed method can achieve satisfactory behavior and outperforms ELAS with a tolerable sacrifice in efficiency because of its iteration scheme. This paper is organized as follows. Section 2 briefly describes ELAS. Section 3 introduces the proposed method in detail. Section 4 gives the qualitative and quantitative results of the proposed algorithms and makes comparisons with other well-known stereo methods. Section 5 includes conclusions and directions for future work.

ELAS
In this section, we briefly describe the general process of ELAS before introducing our method. ELAS (Geiger et al., 2011) is an efficient approach to binocular stereo that was inspired from the observation that many stereo correspondences are highly ambiguous, whereas some are support points and can be robustly matched due to their texture and uniqueness. Specifically, ELAS makes use of a generative model to estimate the disparity maps. The input images are assumed to be rectified such that the correspondences are restricted to the same horizontal line in both images. The support points are first obtained to construct a triangulation, and every triangular mesh corresponds to a plane defined by the disparity value and location in an image plane of its vertices. The mean disparity value of each pixel located in a triangular mesh can be computed using the plane model. The seeking range for possible disparities of each pixel is then either an interval centered in its mean disparity value or in a set of its surrounding support point disparities in a small 20 × 20-pixel neighborhood.
The support points are matched on a regular grid using the l 1 distance between vectors formed by concatenating the horizontal and vertical Sobel filter responses of the 9 × 9-pixel windows. The Sobel masks are 3 × 3 pixels, and the grid has a fixed step-size of 5 pixels. The support points are consistent with the following parameters: 1) support points should be matched from left to right and from right to left; 2) their ratio between the best and second-best matches is less than a threshold; where d n is a disparity value and the conditional probability of d n given (l) n and S is formulated as where μ(S, (l) n ) is a mean function that links the support points and the observations and N s is the set of the disparities of its surrounding support points, which are contained in a small 20 × 20 neighborhood of (l) n . Specifically, for the ith triangulation, the mean disparity value of an observation (l) n in this triangulation can be calculated with the following formula: The conditional probability of (r) n given (l) n and d n is formulated as n , respectively, and β is a constant. Finally, taking the negative logarithm yields an energy function that can be easily minimized only if | d−μ| < 3σ or if d is an element of the neighboring support point disparities: The algorithm is pixel-wise and can normally be done in parallel because the disparity value of any pixel that requires estimation in an image plane depends only on the support points around it and its corresponding triangular mesh, which is constructed based on the locations and disparity values of the support points and remains unchanged once the support points are given.
ELAS can reduce the disparity search space by considering the already-constructed triangular mesh and its neighboring disparities, compute accurate disparity maps of high-resolution images at frame rates close to real time, and decrease the stereo matching ambiguities by introducing a prior distribution estimated from robust support points. A high level of performance is obtained in the presence of poorly textured and slanted surfaces, but its ability to capture disparity discontinuities is not satisfactory and can be further improved. To address this problem, in our paper, each pixel whose disparity is to be estimated is associated with a confidence value, which gives a determination for updating its disparity. In addition, the set of support points is adjusted to produce a better result. Without the adjustment of support points, the disparity value assigned by ELA cannot be changed. A detailed description of our improved method is given in Section 3.

Confidence-based iterative ELAS
This section describes the proposed stereo matching method. Given the reference and target images, disparity maps are estimated with CI_ELAS. Without loss of generality, in the following section, we consider the left image as the reference image.

Confidence calculation and updating
To begin, we must give the definition of confidence for each pixel. Confidence is defined based on the ratio between the second-best match cost and the best match cost obtained by an arbitrary disparity estimation method using the Peak Ratio (PKR) rule (Egnal, Mintz, & Wildes, 2004;Hu & Mordohai, 2010): Cost(x, y, d) is a cost value obtained by assigning a disparity hypothesis d for a pixel (x, y). D(x, y) indicates the disparity value at (x, y) with the minimum matching cost. To limit the confidence value within the range from 0 to 1, the following formula is adopted in this paper instead: Inspired by the process of obtaining the support points of ELAS, we can define the confidence value using the same match criterion, and the primary confidence value for each pixel can then be calculated using Equation (8). The confidence updating rule is described in Section 3.2.

Formulation of the proposed method
For clarity, we use the same notation as ELAS; in addition, d (i) n , cf (i) n , and S (i) indicate the disparity values, confidence values, and the set of support points in the ith iteration, respectively. Assuming Based on ELAS, with the introduction of confidence values, we obtain a new posterior of d (i+1) n , whose maximum is equivalent to that of the product of three probabilistic components: the fourth line in (9) is derived from the assumption that n are independent, and the third component of the last line is a likelihood term for updating disparity; we model this term as n . Compared with ELAS, because confidence values are introduced, disparities are calculated based on ELAS and are updated according to the constraint of the penalty function Ψ in each iteration.
For the construction of function Ψ, we use the robust penalty function derived from the total variation model (Rudin, Osher, & Fatemi, 1992;Wang & Yang, 2011).
where η and γ d are two parameters of the total variation model that control the sharpness and upper-bound of Ψ, respectively, for our likelihood term. To make it more reasonable, we substitute γ d with cf (i) n. Ψ is then defined as To gain some intuition for Equation (12), we set η to zero. Then, which penalizes the new disparity value of a pixel that deviates from a previous one according to its confidence value. If the value of confidence is sufficiently high, the difference between d (i) n and d (i+1) n should not be excessive; otherwise, the penalty will also be large, which apparently is not the correct direction for disparity updating. If the value of confidence is low, a disparity assignment that diverges from the current one is allowable and acceptable because the current one (with a low confidence value) is not sufficiently reliable, and a larger change is definitely needed.
Above all, the new disparity value is updated by minimizing the following energy function: Parameter α is added to control the strength of the confidence-based term, which is the third term of Equation (14). The minimization process is iterative and is the same as ELAS for each iteration under another energy function. Note that the set of support points and the confidence value of each unstable point must be updated in an iterative manner. Support points S (i) (i ≥ 1), which are used to construct a new triangulation, are those selectively obtained in the preliminary and previous steps with a high confidence level. S (0) obtained at the beginning of the first iteration is the same as with ELAS and is applied to obtain the primary disparity d (0) . Other stable points with high confidence levels are kept unchanged during the succeeding iterations.
When one iteration is finished, every point's confidence level, with the exception of the stable ones, must be updated according to the confidence updating formula in Equation (15), which is defined based on not only its previous and current disparity, but also its corresponding point's current disparity (Shi, Wang, Yin, & Pei, 2015): where when taking the left image as the reference image. Specifically, the confidence updating process obeys three rules: (1) the updated confidence level should be based on the current level; (2) updating will increase the confidence level if the newly updated disparity is close to the previous disparity in the last iteration, and reduce the confidence level otherwise; and (3)  are set to 0.25, 2, and 2, respectively, in our experiments. The parameter settings are discussed specifically in Section 3.3.

Updating of support points
Updating of support points is the kernel part of CI_ELAS. Based on the fundamental and theoretical description in Sections 3.1 and 3.2, the pipeline of updating support points is illustrated in Figure 1.

Initialization
First, S (0) is obtained in the same manner as ELAS. Among the support points in S (0) , the value of a pixel with high confidence is set to 1.1 in CF (0) , and that of other pixels are set to 0 in CF (0) . Here, the value 1.1 functions as an indicator of a stable point regardless of the range of confidence. In a similar manner, D 0 L and D 0 R are initialed as one negative value besides stable points in S (0) . Once {S (i) , d i n , cf i n } is calculated, disparity maps D(i + 1) L can be derived from Equation (14), and D i+1 R is calculated in the same way. For clarity, CF (i) is the set of confidence maps of the reference image and the target image in the ith iteration.

Stable points set
A stable point meets three conditions: (1) its confidence value is greater than a predefined threshold (set to 0.95 in our experiments); (2) its correspondence's confidence exceeds this threshold; and (3) the left-right consistency meets with a threshold of 2.

Remove redundant points
Not every stable point must be a support point; an excessive density of support points will result in the unsatisfactory construction of a triangular mesh. It is thus necessary to remove redundant points when selecting the support points from stable points. The principle behind this step is that a stale point is viewed as a redundant point once more than one support point in a window is centered at this pixel, and the size of the window is the step-size mentioned in Section 2. The entire process is halted once the maximum number of iterations is achieved or the number of unstable points shows no obvious change.

Parameter settings
The parameters of CI_ELAS contained in the energy function are mainly composed of two parts: (1) One from ELAS: β = 0.03, σ = 3, γ = 15, and τ = 0.9 are suggested as the best values for good empirical performance, and we keep them unchanged for the evaluation of all image pairs, where τ is used to find support points in the preliminary step (Geiger et al., 2011)

mentioned in Section 2; and
(2) Another from the confidence-based term in Equation (14) and the confidence update formula: η, α, λ, and Conf. To find an appropriate setting for the three parameters, given an image pair Cones, we evaluated the proposed algorithm for all possible combinations of η, α, and λ, with η = 0.005, …, 0.05 and η incremented in steps of 0.005 and including an additional value 0.001; the reciprocal of α ranges from 0.02 to 1 at intervals of 0.02 and an additional value 0.01, and Note: i starts from 1.
λ from 1 to 10 at intervals of 0.5. Conf defined in the confidence update formula is set to a constant of 0.25, and a pixel becomes stable when the confidence value exceeds 0.95. The three parameters η, α, and λ are analyzed in the next paragraph.
We first set η to two widely different values, 0.03 and 0.001, and obtain Figure 2(a) and (b). Visual comparison of Figure 2(a) and (b) shows that the proposed method's performance is slightly influenced by η.
By setting λ to 1.5 and 8, we reach a similar conclusion on λ. In Figure 2(e) and (f), two close values of α (2.5 and 1) lead to some change. The parameter α, which is a sensitive one for the proposed algorithm, actually controls the strength of the confidence-based term in Equation (14), which plays a vital role in CI_ELAS; it is not surprising that a small fluctuation in α results in an obvious variation in the error rate caused by the strength. For Cones, the best behavior of CI_ELAS is found when η = 0.03, α = 2.5, and λ = 1.5 among all kinds of parameter settings. Therefore, throughout our experiments, we use the same parameter settings, which simultaneously perform well for most image pairs compared with ELAS.
The percentage of non-converged pixels and the running time at each iteration are shown in Figure 3, and the corresponding experimental image is taken from the KITTI data-set with a resolution of 1,226 × 370. Figure 3(a) shows that disparities of plenty of pixels become stable at the first three iterations, and as the iteration proceeds, the number of the converged pixel increases to a much lesser degree and tends to remain steady. After five iterations, the procedure is convergent. In Figure 3(b), the running time is spent to reconstruct the triangulation and to estimate the unstable pixels' disparity; thanks to the dramatic decrease in the number of unstable points, fewer and fewer pixels' disparities require estimation during the iteration, and additional time consumption is not obvious.
For efficiency, the number of iterations can be set to 4, and the corresponding running time is one time longer than ELAS by 5 s in terms of the computer's setting (dual-core CPU E5200 running at 2.50 GHz; only one core is used in our experiments), which could be further accelerated on a more advanced computer and is actually acceptable among many stereo matching methods that spend more than 1 min when operating on large images.
In addition, as already validated by the authors of ELAS's paper, those compared algorithms are at least two times slower than ELAS, and most of them are unable to achieve a satisfactory balance between accuracy and computational efficiency; our method, the improved ELAS based on confidence, still runs faster than those stereo matching methods, which means that the proposed method accounts for both accuracy and efficiency.

Experimental results
In this section, we evaluate our method and obtain some quantitative results for a variety of stereo images from the Middlebury data-set, including Cones, Teddy, Art, Aloe, Dolls, Baby3, Cloth3, Lamp2, and Rock2 (Scharstein & Szeliski, 2002), the 2014 Middlebury data-set (Scharstein et al., 2014), and the KITTI data-set (Geiger, Lenz, & Urtasun, 2012). It can be concluded from the experimental results that the performance of our method is competitive with that of ELAS.

Figure 3. (a) Percentage of non-converged pixels and (b) running time in each iteration.
In Figure 4, the three rows from top to bottom list the performance for three image pairs: Rock2, Cones, and Art. The first column depicts the left camera images, the second column is the disparity map estimated by ELAS, and the last column is that estimated by CI_ELAS. The red regions indicate erroneous areas. Table 1 gives information regarding the image size, and the percentages of error estimated pixels among the non-occluded ones are shown in Tables 2 and 3. Note that non-occluded pixels are used to compute error rates, and occluded pixels are obtained using left-right consistency (Cochran & Medioni, 1992;Fua, 1993). For highly textured objects such as those in Rock2 and Cones, ELAS and CI_ELAS achieved significantly small errors among all other image pairs, mainly due to the piecewise linear prior in ELAS. In addition, the disparities of pixels near object boundaries and in poorly textured regions become much smoother than with ELAS, as shown in the blue and green ellipses, respectively, in Figure 4. The best result improves almost 0.6% (Teddy), which is equivalent to 4,050 pixels for 0.675-megapixel images. For other image pairs, accuracy is also enhanced over that of ELAS, as shown in Tables 2 and 3. For the Middlebury 2014 data-set, in version 3 of the Middlebury stereo evaluation code, 15 training image pairs are used to evaluate proposed stereo matching algorithms; our evaluation was therefore conducted with these images. In addition, half-size views at resolutions above 1 megapixel are adopted here. The evaluation metric is from the KITTI vision benchmark. Out-Noc is percentage of erroneous pixels in non-occluded areas, Out-All is percentage of erroneous pixels in total, Avg-Noc is average disparity error in non-occluded areas and Avg-All is average disparity error in total. Tables 4 and 5 show the evaluation results for ELAS and CI_ELAS, with error thresholds of 1 and 2, respectively. These quantitative results validate that our method outperforms ELAS on the Middlebury 2014 data-set. For comparisons with other state-of-the-art algorithms, newly proposed methods together with classical semi-global matching (SGM) (Hirschmuller, 2008) are considered here. Specifically, linesegment-based ELAS (LS ELAS) (Ait-Jellal, Lange, Wassermann, Schilling, & Zell, 2017), dense stereomatching using a local adaptive multi-cost approach (LAMC DSM) (Stentoumis, Grammatikopoulos, Kalisperakis, & Karras, 2014), SGM, and recursive edge-aware filters for stereo matching (REAF)     (Cigla, 2015) are introduced. LS ELAS is an extension of the ELAS algorithm that extracts edges and samples candidate support points along them. For every two consecutive valid support points, a (straight) line segment is created. The triangulation is forced to include the set of line segments (constrained Delaunay) for better preservation of the disparity discontinuity at the edges. LAMC DSM is an adaptive local stereo method. It is integrated into a hierarchical scheme that exploits adaptive windows. By integrating reverse directions and rate calculation, new approaches for REAFs are presented with comprehensive analyses via computational complexity and filter characteristics.  The quantitative data of the compared methods are taken from the online Middlebury evaluation system. Because Avg-Noc and Avg-All are not provided, we omit the two metrics for them. In the first columns of Tables 4 and 5, "F" represents the full-size images to be tested, "H" represents half size, and "Q" represents quarter size. The superscripts in the second and third columns of Tables 4 and 5 indicate the rank among all of the compared methods. Our method does not seem competitive with the state-of-the-art methods such as LS ELAS and SGM. However, newly proposed methods may not outperform our method either.
In addition, more detailed information can be acquired from Figures 5 and 6. CI_ELAS was not an excellent performer, but not the worst overall. On "Jadeplant," CI_ELAS had the lowest error percentage, and it outperformed SGM on "Recycle." Interestingly, CI_ELAS and LAMC DSM had similar trends on these 15 training images. Although the proposed method is less effective than the stateof-the-art algorithms, it provides a fresh perspective to the field of stereo matching, just like the emerging newly proposed methods such as LAMC DSM.
For the KITTI data-set, 30 randomly selected image pairs were evaluated in our experiments. This data-set contains 194 training image pairs and 195 test image pairs for evaluation of stereo matching algorithms. Figure 7 provides visual comparisons of ELAS and CI_ELAS and indicates that the bad image of ELAS is much denser than that of CI_ELAS. The data in Table 6 show that CI_ELAS showed slightly better performance than ELAS on the KITTI data-set, but the values of the four evaluation metrics were much smaller than those of the Middlebury data-set, which suggests that ELAS and CI_ELAS behave better on the KITTI data-set. To compare with other methods, SGM, Deep-Raw (Chen, Xun, Liang, Yinan, & Chang, 2015), SymST-GP (Ralha et al., 2016), and HLSC mesh (Hadfield, Lebeda, & Bowden, 2017) are considered here. Deep-Raw, trained via the Convolutional Neural Network on a large set of stereo images with ground truth disparities, is a new measure of pixel dissimilarity that outperforms the traditional matching cost. SymST-GP creates a pipeline capable of generating three-dimensional volumes in real time for high-and low-resolution images using dense stereo-based photosymmetry. HLSC mesh uses high-level cues to improve the stereo matching performance and allows standard stereo reconstruction to be unified with a wide range of classic topdown cues from urban scene understanding. Table 6 shows that CI_ELAS has the best performance among all of the methods compared except for SGM. Actually, the conclusions from the comparison with the KITTI data-set are similar to those with the Middlebury benchmark: CI_ELAS outperforms ELAS and some of the newly proposed methods, but it is still outperformed by the state-of-the-art methods.

Conclusions and future work
We propose an improved stereo matching method based on ELAS. ELAS exhibits excellent performance in the presence of poorly textured and slanted surfaces owing to the novel design of piecewise linear prior, but its ability to capture disparity discontinuities is not satisfactory. CI_ELAS has reserved the advantage and improved the defect in some ways. By introducing the concept of confidence, we associate each point with a changeable confidence level and successfully update the disparity and confidence levels in an iterative manner. In addition, support points are selectively obtained at each iteration from these stable points with sufficiently high confidence, which naturally leads to a more accurate result for disparity estimation. The performance of the proposed algorithm can still be improved. In future studies, we plan to assign a weight to each support point to make an appropriate estimation of the pixels that surround them to further improve the model's overall behavior. Funding