Binocular stereo matching algorithm based on MST cost aggregation

: For common binocular stereo matching algorithms in computer vision, it is not easy to obtain high precision and high matching speed at the same time. In this paper, an improved binocular stereo matching algorithm based on Minimum Spanning Tree (MST) cost aggregation is proposed. Firstly, the performance of the parallel algorithm can be improved by reducing the height of the tree. Then, an improved Root to Leaf (L2R) cost aggregation algorithm is proposed. By combining stereo matching technology with parallel computing technology, the above method can realize synchronous parallel computing at the algorithm level. Experimental results show that the improved algorithm has high accuracy and high matching speed for binocular stereo vision.


Introduction
Image sensors are widely used in automatic and sensing devices to realize object detection and recognition, thanks to the fast development of imaging and computer technologies. Stereo matching is of great importance for computer vision appositions, such as autonomous driving [1,2], target detection and recognition [3,4]. It also plays important role in the fields of robot navigation [5], and space detection [6,7]. Various stereo matching algorithms have been developed so far. While most algorithms are cable of meeting the matching quality, it tends to have the problem of high cost of computer time. Real time applications demand enhanced matching quality and reduced processing time. Therefore, applicable methods are needed to meet and balance these two seemly contradictory requirements for real time processing.
Stereo matching algorithms can be classified into global and local algorithms, normally. The local algorithms tend to have lower computational complexity and lower accuracy comparing with global algorithms. Recently developed algorithms target to find balance between quality and computation time. Among those, the non-global stereo matching method based on Minimum Spanning Tree (MST) is proved to be efficient and advantageous [8]. The method however still suffers the problem of high computation time when deals with real time applications. Parallel processors and graphics processor (GPU) are great help to increase the computational speed. Especially parallel processor, having the advantage of low power requirements, are of many research concerns. There are some promising results have been obtained in the development of parallel-sequential matching algorithms. In this paper, an advanced method is proposed to build the MST based stereo matching algorithm by Yang on parallel processors, which leads to reduced computation time at the same maintain the original algorithm's advantage of high accuracy.
If ′ is the MST of , the ′ satisfies 0 ⊆ for any 0 . When | 0 | = − 1, the equation (1) (1) where， () We represents the weight of an edge e . The MST of a connected graph can be calculated by Prim algorithm or Kruskal algorithm [9]. The difference between these two algorithms is that the Prim algorithm starts from an origin, and continuously expands its range to select the edge with the smallest weight; The Kruskal algorithm looks at the whole, and constantly selects the smallest weight edge from the whole.
In most local stereo matching algorithms, the matching cost of a pixel is calculated within a certain domain window of . In other words, the matching cost is determined by the pixels which are located inside the domain widow. The pixels which are outside of the domain widow have no influxes on the matching cost.
In order to make the calculation of matching cost global, that is, all pixels in the image can affect the matching cost of the pixel, image can be regarded as an undirected weighted connected graph = ( , ). The vertex set is all pixels of and edge set is the relationship between adjacent pixels of . Therefore, is a 4-connected grid graph. The weight of the edge ( , ) of is defined as the gradient of the pixel gray value as shown in equation (2).
According to the structure of MST, for the two points and in the image, if is far away from and the color differences between and is large, the number of path hops between and in MST is large and the influence on the calculation of matching cost of is weak. Therefore, the distance ( , ) between the two points and in the MST can represent not only the spatial difference of pixels, but also the intensity difference of pixel values between them. The definition of similarity ( , ) between and is shown in equation (3).
where,  is the adjustment parameter of similarity. Therefore, under MST, bilateral filtering function used in cost aggregation stage can be extended to equation (4). where, Cp represents the aggregation cost of pixels at parallax , Cq represents the matching cost of pixels q at parallax , () cprepresents the node connected with p . The similarity of two pixels in an image depends on the distance between the two nodes in the MST. Therefore, the MST can be organized into a tree structure. The leaf to root (L2R) cost aggregation process is used to calculate the cost of each pixel affected by the nodes in its subtree. The root to leaf (R2L) cost aggregation process is used to calculate the cost of each pixel affected by nodes other than those in its subtree in MST. After L2R and R2L processes, the cost of each pixel affected by all other pixels can be obtained. The final matching costs were aggregated over each influencing pixel and then over each pixel included.

An improved BFS algorithm
Breadth first search (BFS) is a search algorithm in graph structure [10]. The algorithm starts from a source node, traverses the child nodes of the node in turn, and then traverses the child nodes of these child nodes, and so on, until traversing the complete graph. Therefore, BFS algorithm needs to use an open closed table to record the traversed nodes. The open records the nodes to be traversed, and the closed records the traversed nodes. Similar to the BFS algorithm, Level Synchronous Parallel BFS (LSP-BFS) algorithm maintains three node sets: visited node set , current level node set , and next level node set . The algorithm takes out the elements in and places the adjacent nodes in in parallel. This process is iterated until is empty set. Before the next iteration process, let = ， = ∅.
Since the process of cost aggregation algorithm based on MST is carried out on the MST of the reference image, it is necessary to reduce algorithm to tree structure BFS algorithm. The BFS algorithm is mainly used for the general graphs which may have loops. The tree structure is an acyclic graph. The nodes in the tree structure have clear partial sequence relations. When a node is accessed, the parent node of the node must have been accessed, and the child node of the node must not have been accessed. Therefore, it doesn't need to check whether the adjacent points have already been accessed in the tree structure BFS algorithm. It can save the closed container.

3.1.
LSP-BFS algorithm for tree structure BFS algorithm on tree structure can omit closed container. Similarly, LSP-BFS algorithm on tree structure can omit the visited node set . In the extended traversal range, it only needs all the child nodes to join the next level node set . It is not necessary to determine whether the child node has been traversed. The pseudo code of the improved LSP-BFS algorithm is shown in algorithm 1. It can be seen that LSP-BFS algorithm has two characteristics: (1) Each layer requires additional synchronization. Synchronization is a time-consuming operation. If the MST height is too high, more synchronization operations are required.
(2) The number of parallel operations in a BFS layer depends on the number of nodes in that layer. The BFS layer with more nodes is more parallel, which can take better advantage to the performance of parallel devices. There are four images in Figure 1. The four images in are converted into MST. The vertex corresponding to the upper left pixel as the root node. A tree is constructed with the root node. The results of the number of nodes in the tree hierarchy are shown in Table 1. The node hierarchy distribution is shown in Figure 2. The statistical method is to sum the number of nodes per 50 layers. In the distribution map, the abscissa is the serial number of the hierarchy (numbering every 50 levels), and the ordinate is the number of nodes in the hierarchy (summing the nodes of every 50 layers). It can be seen from Figure 2 that the number of nodes in higher level and lower level is less, and the number of nodes in middle level is more. Therefore, in general, the node distribution of tree structure is sparse in the higher level and lower level, and the middle level is more denser. According to these two characteristics, in order to give full play to the performance of parallel devices, the parallel BFS algorithm needs to reduce the height of the tree, and increases the number of nodes in each layer of the tree. As shown in Figure 3, tree (a) and tree (b) originate from the same acyclic graph, but the root node of tree (a) is 1 , the root node of tree (b) is 2 . The height of tree (a) and (b) are 5 and 4, respectively. The node density in the second layer of tree (a) is significantly lower than that in the tree (b). Therefore, for the same spanning tree, when the root nodes are different, the lower the height of the tree, the higher the nodes density in the middle level.
When the number of nodes in the same layer is large enough, LSP-BFS can play the largest role. Therefore, in order to make LSP-BFS algorithm can be applied to solve the aggregation cost of MST, it is necessary to reduce the height of tree. If the middle node of the longest path of acyclic graph is taken as the root node of MST, the MST with the lowest height can be obtained. Therefore, it is necessary to find the longest path of acyclic graph. A simple method is to use BFS twice to find the longest path. The basic process of the algorithm is as follows: first, let any point 1 of the graph be starting point, use BFS to find the point 2 which is farthest from 1 . Then from the point 2 start, use BFS to find the point 3 which is farthest from 2 . The path of 2 to 3 is the longest path of the graph.
This algorithm can find the longest path, but it is need to traverse all of the longest paths to find . This process takes extra time. In order to solve this problem, this paper proposes a pruning method. The node can be found out in twice level traversal time by using the pruning method. In an acyclic graph, nodes with only one adjacent node are called branch nodes. The node can be found by deleting these branch nodes. For a given acyclic graph = ( , ) and a source node , the process of the algorithm is as follows: (1) Starting from the source node , all the branch nodes are searched by BFS, then all the branch nodes are put into the branch node set of the next layer; (2) Let the branch node set = ， = ∅. For each node in , if the adjacent node of node is not in the next level branch node set , then is added to and deleted from . The pseudo code of pruning method is shown in algorithm 2. In this method, the time complexity of finding is (2 ) = (3 ) = ( ) . where is the longest path length of . Because is traversed twice, the time complexity of pruning method is (2 ) = ( ). The time complexity of the algorithm for finding the longest path through BFS twice is also (2 ) = ( ) . Therefore, the performance of the pruning algorithm is equivalent to that of BFS twice method. It does not need to find on the longest path, so the pruning method can reduce one traversal time, and the algorithm design is more simpler.
For the test samples in Figure 1, after reducing the height of the tree by using pruning method, the statistics of nodes for each layer are shown in Figure 4 and table 2. In the Figure 4, the black dotted line represents the node distribution of each layer when the tree height is not reduced; the red solid line represents the node distribution of each layer when the tree height is reduced. It can be seen from Figure 4 that as the tree height decreases, the density of nodes for the middle layer also increases.
It can be seen from

Improved L2R cost aggregation algorithm
The acyclic graph and a source node are obtained for a certain image. The improved L2R cost aggregation algorithm can maintain a queue and a linked list , where is the storing linked list. The process of the algorithm is as follows: (1) Put in the team; (2) Use the variable to save the length of and insert a new node at the end of ; (3) Set an element from as , insert the into the tail of , and queue all the child nodes of ; (4) Perform step (3) times; (5) Continue to perform steps (2) to (4) until is empty; (6) Traverse from the tail to the head. When each node of is traversed, the aggregate cost of all nodes in the node is calculated. This step can be calculated in parallel.
Obviously, the nodes in the linked list L store the nodes of each layer of . After traversing , the length of is the height of the tree with as the root node. Variable stores the number of nodes in each layer. The pseudo code of the algorithm is shown in algorithm 3.

Experimental results and analysis
All the algorithms are implemented and tested on Windows 10 operating system computer by using Visual C + + and NVIDIA CUDA. The hardware conditions of the computer are Intel (R) core (TM) i5-6300HQ CPU @ 2.30 GHz, NVIDIA GeForce GTX 960m graphics card and 8.00 GB memory. Middlebury stereo vision test set is widely used test set in academia. it includes test sets for binocular vision, multi vision and other technologies. Cones, Teddy, Tsukuba and Venus binocular vision test sets are used in this paper, as shown in Figure 5. The first row images are the reference images of the test set (the image of the left imaging system); the second row images are the target images of the test set (the image of the right imaging system); the third row images are the real disparity images. The parameters of the test set are shown in Table 3.
The process of parallel cost aggregation based on MST is as follows: (1) median filtering; (2) calculating the weight of edges; (3) reordering by the weight; (4) generating MST by the Kruskal algorithm; (5) pruning and constructing tree; (6) L2R process; (7) R2L process.     Image median filtering is a nonlinear filtering method. The calculation formula for each pixel of image median filtering is shown in equation (5).
( ( 0 , 0 )) = where, is the size of filtering window, the value of is odd. When the window is too large, the image will become blurred. The parameter selection in this paper starts from the minimum window (k = 1) and gradually grows.
The test results using image median filter are shown in Figure 6 and table 4. In the Figure 6, the images in the first row are real disparity images, and the images in the rows 2,3,4 and 5 are the result of setting = 1, = 3, = 5, and = 7, respectively. It can be seen from Figure 6 and table 4 that the image median filtering has a great influence on the results. When = 3 , the effect of image median filtering is better. Equation (6) and equation (7) are selected to test the edge weight. The test results are shown in Figure 7 and table 5. In the Figure 7, the images in the rows 1,2 and 3 are the real disparity images, the result images of a function 1 ( 1 , 2 ) is selected and the result images of a function 2 ( 1 , 2 ) is selected, respectively. It can be seen from Figure 6 and Figure 7 that the effect is the better when the function 2 ( 1 , 2 ) is selected.
When calculating MST by using Kruskal algorithm, it is need to sort all the edges in descending order of weight. The time complexity of common sorting algorithms is ( ( )). However, the weight of the edge will not be greater than 255 in this paper, so histogram sorting can be used. The time complexity of this method is (1).
The time test results of the algorithm are shown in table 6. The numbers in brackets in the table 6 represent the running speed ratio. The running speed ratio of MST algorithm using GPU is compared with the corresponding algorithm using CPU. The running speed ratio of the algorithm proposed in this paper is compared with the corresponding MST algorithm using GPU. It can be seen from table 6 that the MST algorithm can be accelerated by about 17 times by using GPU, and the optimization algorithm proposed in this paper can increase the speed by about 20% on this basis. Therefore, the optimization algorithm proposed in this paper can speed up about 20 times compared with the algorithm using CPU, and it can meet the requirements of real-time processing.

Conclusion
An advanced method is developed in this paper to improve the computation efficiency for stereo matching. The proposed method is to reduce the height of the tree used for MST algorithm in parallel processing, hence dramatically speed up the computation. An aggregation of matching cost is also developed simultaneously in order to achieve the hierarchical synchronous parallel computing. The cost is based on Leaf-to-Root aggregation.
The simulation results show that proposed parallel processing stereo matching algorithm using reduced high of spanning tree and L2R aggregation matching cost enhance the computation speed. It is a matter of 20 times speed-up as proved. This method therefore provides a promising approach to real-time stereo processing, which can of great importance of Binocular stereo matching.