An Improved Adaptive Window Stereo Matching Algorithm

In order to solve the problem that the existing adaptive window stereo matching algorithms have insufficient feature extraction in low-texture regions, resulting in low matching accuracy. An adaptive window stereo matching algorithm based on the gradient is proposed. Firstly, the Sobel operator is used to extract the gradient value of each pixel in the image. Then, each pixel is divided into high, medium and low texture regions according to the gradient value. Next, different arm length thresholds are assigned to different region pixels, and matching windows are generated dynamically according to arm length and color threshold. Finally, the pixels closer to the center of the window are given higher weights by generating windows several times. It solves the problem that the stereo matching algorithm can not select a matching window dynamically. Experimental results on Middlebury dataset show that the proposed method improves the matching accuracy by 5.5% compared with the latest adaptive window stereo matching algorithm.


Introduction
Binocular stereo vision is one of the most important branches of computer vision, and binocular stereo vision technology is mainly composed of camera calibration, image correction, stereo matching and 3D reconstruction [1]. The task of the stereo matching algorithm is to calculate the pixels with the same name in two images and to generate the disparity map. Stereo matching, as a prerequisite for obtaining 3D information in binocular stereo vision, is the most critical step in the whole process. In the stereo matching algorithm, how to select a matching window dynamically is very important [2], many researchers have done a lot of work on the adaptive window stereo matching algorithm. Zhang et al. [3] proposed an adaptive window stereo matching algorithm according to the relationship between pixel color information and spatial distance, which provided a good solution for how to select a matching window. However, this method uses a single weight for all pixels in the window, so the matching accuracy is poor. Veksler et al. [4] proposed a point by point adaptive selection method to obtain the appropriate matching window. However, this method needs to do a lot of computation on each pixel, so the efficiency of the algorithm is very low. Qu et al. [5] used the color difference between surrounding pixels and central pixels to construct an adaptive matching window. Fusiello et al. [6] proposed a matching algorithm that sets nine different window types for each pixel and retains the minimum matching cost of window disparity. However, due to the limited window types, it cannot summarize the size and shape of all types of windows, the robustness of the algorithm is poor. Based on Zhang's method, Lv et al. [7] proposed a method of multiple generations of adaptive windows, and introduced the method of multiple weights. However, this method does not consider the difference between low texture region and high texture region, and the matching accuracy is poor in the low texture region.
Based on the above problems, this paper proposes a gradient-based adaptive multi-weighted window stereo matching algorithm. Firstly, the gradient information of each pixel is extracted by the Sobel operator. According to the gradient information, the pixel is divided into high texture, medium texture, or low texture region. Then the transverse arm length threshold of the window is determined according to the gradient of high, medium and low regions, and the adaptive window is generated by combining the longitudinal arm length threshold and color threshold. Finally, the weight of pixels is introduced into the algorithm by generating adaptive windows many times. This paper compares the algorithm proposed by Zhang and Lv with the algorithm proposed in this paper on the Middlebury platform. Experiments show that the algorithm proposed in this paper has higher matching accuracy in object edge and low texture region.

Proposed Algorithm
The workflow of the proposed algorithm is shown in figure 1. First, we need to calculate the gradient of each pixel to initialize the lateral arm length threshold. Then the matching window is generated by three times according to the threshold. Finally, the matching cost of the image is calculated by the matching algorithm, and the final disparity map is generated.

Initialize Parameters
To add the weight of regional texture information to the adaptive window algorithm, firstly, Sobel operator is used to extracting the gradient information of each pixel. As shown in figure 2, the Sobel operator contains two 3 × 3 convolution kernels, which can calculate the horizontal and vertical gradients of pixels, respectively. . The analysis shows that the gradient value of the high texture region is larger, and a smaller window should be selected. However, the gradient change of the low texture area is very small, so we need to choose a larger window. Therefore, we can determine the threshold of the lateral arm length of the pixel according to the gradient. Let the current pixel lateral arm length threshold as L, three lateral arm length thresholds Lh 1 > Lh 2 > Lh 3 , and double threshold T1 > T2. The specific judgment formula of the lateral arm length threshold is shown in formula 1.

Generate Matching Window
After determining the lateral arm length threshold of each pixel, as shown in figure 3, the matching point p in the image is selected, and the cross window skeleton is constructed with the p point as the center.
In above formulas, pi is a pixel on the same row or column of point p. IC (pi) represents the value of this point in RGB space, and τ is the selected color threshold. In Formula 3, Lh (pi) represents the abscissa of the pixel, and Lh is the threshold of the lateral arm length of each pixel calculated in the previous step. In formula 4, Lv (pi) represents the ordinate of the pixel, and Lv is the selected threshold of longitudinal arm length. According to the above criteria, window expansion is carried out in four directions with point p as the center. When a certain direction does not meet any of the above conditions, the extension ends. Finally, a cross skeleton region centered on point p is formed. In figure  3, it is {HP-∪HP+∪VP-∪VP+}. For all q points in the vertical direction, the expansion process is repeated in the horizontal direction to obtain the region {Hq-∪Hq+}. The final adaptive window area is denoted as A (p), as shown in equation 5.
According to the above steps, an adaptive window can be generated. In order to give a higher weight to the pixels near the center of the window, the window is generated by changing the arm length threshold several times, and the pixels generated in each round are given different weights. Let the generation coefficient be , and taking the cubic generation window as an example, and taking cubic generation coefficient 1 = 1, 2 = 1.2, 3 = 1.5. Taking point p as an example, if point p is the point in the high texture region, the lateral arm length threshold of point p in the first round is Lh3 and that of longitudinal arm length threshold is Lv. In the second round, the lateral arm length threshold is 1.2Lh3, and the longitudinal growth is 1.2Lv. In the third round, the lateral arm length threshold is 1.5Lh3, and the longitudinal arm length threshold is 1.5 Lv. In addition, the weight α is assigned to each point in the region based on the rounds. Among them, α1 = 3 in the first round, α2 = 2 in the second round and α3 = 1 in the third round. The final generated matching window is shown in figure 4. p Figure 4. Final generated matching window In figure 4, the point p is taken as the center, and the darkest gray area is the adaptive window area generated in the first round, which is the closest to the center point with the weight value of 3; the medium gray area is the adaptive window area generated in the second round with the weight value of 2; the light gray area is the adaptive window area generated in the third round, with the weight value of 1.

Cost Matching
After generating the adaptive window, we need to search the similarity of window pixels in a specific range. In this paper, the Sum of Absolute Differences (SAD) algorithm is used to evaluate the similarity of the two windows. As shown in figure 5, a point C in A graph is selected as the center point to be matched to generate an adaptive reference window. Then the matching window is generated at the same position in the B graph. In the maximum search range S, matching windows are generated one by one. Then the similarity measure function is used to evaluate the similarity between the matching window and the reference window. After evaluating S+1 matching windows, the center point of the window with the largest similarity is selected as the best matching point of the matching point. SAD [8] algorithm uses similarity measure function as shown in formula 6: After searching S+1 matching windows in the maximum search range, the two window's center pixel with the smallest cost matching are selected as the best matching points. Finally, the abscissa difference of the best matching points is recorded, which is the disparity of two points, and the disparity map is generated according to the disparity.

Experiment and Analysis
The datasets used in this paper is the Middlebury dataset. And we use three standard images of Baby, Bowling and Flowerpots provided by the Middlebury database to test. The codes are written with C++ on Visio Studio 2019, Windows 10 with Intel Core i7-9570H @ 2.6GHz and 8GB RAM. As shown in figure 6, on the Middlebury dataset, by comparing the results of the improved algorithm proposed by Zhang, and the improved algorithm proposed by Lv, the matching accuracy of the proposed algorithm is verified.  Figure 6. (a) input left image (b) input right image (c) disparity map generated by Zhang's algorithm (d) disparity map generated by Lv's algorithm(e) disparity map generated by the proposed algorithm From the three sets of disparity maps, it can be seen that compared with the other two adaptive stereo matching algorithms, the proposed algorithm has higher matching accuracy in the object edge region and the low texture region. Table 1 shows the evaluation value of the three algorithms on the Middlebury website platform. In table 1, all is the algorithm's mismatch rate on the entire image, nonocc is the algorithm's mismatch rate on the low texture region, and disc is the algorithm's mismatch rate on the object edge region. The mismatch rate of the three algorithms in each region on three images is shown in Table 1. It can be seen from table 1 that, compared with the algorithm proposed by Zhang, in the three images, the algorithm proposed in this paper reduces the average mismatch rate on the entire image by 10.93%. The average mismatch rate in the low-texture region dropped by 5.8%. The average mismatch rate in the edge region of the object dropped by 3.8%. Compared with the algorithm proposed by Lv, the average mismatch rate of the proposed algorithm in the whole image has dropped by 5.5%. The average mismatch rate in the low-texture area decreased by 3.6%. The average mismatch rate in the edge area of the object decreased by 1.6%. In summary, the proposed algorithm has a higher matching accuracy than the current adaptive window stereo matching algorithm.