An end-to-end stereo matching algorithm based on improved convolutional neural network

Deep end-to-end learning based stereo matching methods have achieved great success as witnessed by the leaderboards across different benchmarking datasets. Depth information in stereo vision systems are obtained by a dense and accurate disparity map, which is computed by a robust stereo matching algorithm. However, previous works adopt network layer with the same size to train the feature parameters and get an unsatisfactory efficiency, which cannot be satisfied for the real scenarios by existing methods. In this paper, we present an end-to-end stereo matching algorithm based on “downsize” convolutional neural network (CNN) for autonomous driving scenarios. Firstly, the road images are feed into the designed CNN to get the depth information. And then the “downsize” full-connection layer combined with subsequent network optimization is employed to improve the accuracy of the algorithm. Finally, the improved loss function is utilized to approximate the similarity of positive and negative samples in a more relaxed constraint to improve the matching effect of the output. The loss function error of the proposed method for KITTI 2012 and KITTI 2015 datasets are reduced to 2.62 and 3.26% respectively, which also reduces the runtime of the proposed algorithm. Experimental results illustrate that the proposed end-to-end algorithm can obtain a dense disparity map and the corresponding depth information can be used for the binocular vision system in autonomous driving scenarios. In addition,our method also achieves better performance when the size of the network is compressed compared with previous methods.


Introduction
Deep learning is the most useful tool for many applications, such as parameters compressing [1], sentiment analysis [2], information security [3], person re-identification [4], compressive sensing [5,6], object tracking [7], image classification [8][9][10], etc.. Nowadays, image sensors are widely used in the fields such as robotics [11], automatic-driving [12], medical diagnosis [13], security monitor [14], Augmented Reality (AR) [15], which are the core components of vision systems. Stereo matching aims at estimating the disparity map between a rectified image pair, which is of great importance to various applications such as obstacle avoidance for robot navigation [16,17], 3D scene reconstruction for augmented and virtual reality system [18], and 3D visual object tracking and location [19,20]. Depth information captured from image sensors can be calculated by the following typical structure: Time of Flight (ToF) [21,22], structured light [23,24], and binocular vision [25,26], in which ToF and structured light have high accuracy in a particular scene, but they need high implementation cost. In the field of automatic-driving [12], binocular stereo vision technology [27] is widely adopted due to the advantages of low cost, information abundant, and high robustness in object recognition, which can extract the dense parallax map and achieves the segmentation of road lane lines. The binocular vision system architecture used in the autonomous driving scenario is shown in Figure 1. As shown in Figure 1, f denotes the focal length of the binocular, indicating the distance from point p to point l p and r p . d represents the difference between l p and r p in the x-axis direction. B denotes the distance between two camera centers. Z is the depth distance from point p to point l O or r O , which can be calculated by Eq (1.1): Stereo matching [28,29] [30][31][32], local matching [33,34] and semi-global matching (SGM) [35][36][37]. The global matching method attempts to solve the optimal solution in the global, but the calculation is very laborious and the global optimal solution may not be found. The global method does not require the cost aggregation step. In addition, the selection of different calculation methods and optimal strategies have a great influence on the global method. The local matching algorithm matches local features within a certain range of matching points, which is depend on the rationality of the matching window. And local matching algorithm performance worse when processing the weak texture area and the occlusion area. Semi-global block matching (SGBM) is a combination of the global method and local method. SGBM adopts a pixel-by-pixel matching cost calculation method and a dynamic programming algorithm to achieve optimal path search in the one-dimensional smoothing constraints. However, the performance of the traditional stereo matching methods is severely limited by the handcrafted features adopted by cost functions, which cannot meet the demands of computation accuracy in complex scenes.
With the development of artificial intelligence, increasing researchers have begun to attempt to solve the stereo matching problem using deep learning methods. LeCun [38] trained a convolutional neural network to compute the stereo matching cost. The matching cost is refined by cross-based cost aggregation and semi-global matching. Subsequently, Zbontar [39] presented Siamese convolutional neural network architecture for learning a similarity measure on image patches and applied them to the problem of stereo matching. Siamese network architecture has two effects: One tuned for speed, the other for accuracy. The output of the convolutional neural network is used to initialize the stereo matching cost. However, the above network structures are too complex and have application limitations in the autonomous driving scenarios.
Aiming at improving the matching accuracy of the network training for autonomous driving scenarios, an end-to-end stereo matching algorithm based on improved convolutional neural network is proposed for autonomous driving applications. The main contributions of our work are summarized as follows: · An end-to-end training algorithm pipeline based on improved convolutional neural network for autonomous driving application is designed. · The "downsize" full-connection layer combined with subsequent network optimize is employed in the proposed network to improve the efficiency of the algorithm. · A batch normalization layer is introduced after each convolutional layer to accelerate the convergence of the proposed network for autonomous scenario training with a large learning rate. · An improved loss function is adopted to approximate the similarity of positive and negative samples in a more relaxed constraint and improve the matching effect of the outputs. The KITTI 2012 and KITTI 2015 datasets are used to verify the effeteness and robustness of the proposed algorithm. Experimental results show that the end-to-end algorithm has a better performance in stereo matching to obtain a dense disparity map for the binocular vision system in autonomous driving.

Related works
End-to-end deep stereo networks have not been extensively studied until the first large scale synthetic stereo datasets disclosed by Mayer et al. [40]. There are lots of traditional stereo matching methods were proposed in recent years, such as GM, SGM, SBGM and so on. Specifically, SGM is a method based on block similarity while SBGM is a method based on disparity learning. Different from the traditional matching cost calculation method, Zbontar [39] and Lecun [38] firstly employed convolutional neural network named it MC-CNN to learn the similarity measurement between two image patches and used it to initialize the matching cost. Luo et al. [41] proposed a Siamese network which can produce excellent accurate results in less than a second of GPU computation. However, our method does not focus on low-level vision tasks such as optical flow, so the above methods are not appropriate. Shaked and Wolf [42] deepened the network for matching cost calculation using a highway network architecture with multi-level weighted residual shortcuts. It was demonstrated that this architecture outperformed several networks, such as the base network from MC-CNN.
Subsequently, Kendall et al. [43] proposed a method for learning parallax based on three dimensional (3D) convolution end-to-end. The method uses the geometrical characteristics of the image to extract depth features, the 3D convolution layer in the network is used to improve disparity estimation to realize parallax learning of sub-pixel precision without post-processing. However, the road matching effect in the above method is not satisfactory in complex scenes. Guney et al. [44] designed the Displet network, which uses sparse parallax estimation and image semantic segmentation to normalize the parallax, which can effectively solve the mismatching problem of stereo matching. Mayer et al. [40]. presented a novel approach named DispNet, which is an end-to-end CNN using synthetic stereo pairs for training. In parallel with the proposal of DispNet, similar CNN architectures are also applied to optical flow estimation, leading to FlowNet [45] and its successor (FlowNet 2.0 [46]). GWCNet [47] introduces group-wise correlation to provide better similarity measures than previous work and it can cooperate with the concatenation volume to further improve the performance.
After that，Pang et al. [48] proposed a cascaded residual learning (CRL) network with richer input information, which consists of two parts: DisFullNet and DisResNet. Specifically, the input of DisResNet includes both the output disparity maps of the DisFullNet network and the left graph generated by the warp operation, such a network structure can not only improve the training efficiency, but also optimize the initial parallax map, however, the matching accuracy is lower than FlowNet. Different form CRL, Liang et al. [49]. propose to calculate reconstruction error in feature space rather than color space and share features between disparity estimation network and refinement network.
To improve the matching accuracy of the network training for autonomous driving scenarios, an end-to-end stereo matching algorithm based on an improved convolutional neural network is proposed, which has a better performance in stereo matching. In addition, a better dense disparity map can be obtained in the proposed method for binocular vision system in autonomous driving.

The pipeline of the proposed algorithm
The proposed algorithm is used to extract the depth information of autonomous driving scenarios utilizing the images captured by the left and right cameras carried in cars. The flow chart is shown in Figure 2. in which L and R represent the left images and the right images captured by the binocular camera respectively. Firstly, taking the image pairs and corresponding labels as inputs of the proposed neural network in both training and testing stages. Then the relevant parameters are set and the training model is obtained after several iterations. Finally, the trained feature model is loaded to obtain the depth information, which is represented by a dense disparity map.

Computing the matching cost
Generally, the first step of the stereo matching method is to compute the matching cost at each position for each disparity image. The purpose of this process is to find the corresponding pixel points in matching maps.
We use deep learning methods to compute the matching cost in the stereo matching procedure instead of finding matching points between the left and right pairs. The images are divided into multiple blocks, the red box (a) and (b) in Figure 3 are the blocks in the left and right images respectively, in which P and Q denote one pixel in the corresponding block separately. By comparing the similarities of the blocks to find the corresponding pixel points, the similarity can be calculated by the sum of absolute differences (SAD) in the image blocks. As shown in Eq (3.1), the smaller the similarity, the larger the matching cost.

Network architecture
The architecture of the proposed network is depicted in Figure 4. The convolutional neural network layer is used to extract different feature information of the left and right image blocks. A batch normalization layer is added after each convolutional layer to speed up the convergence of network training. Meanwhile, batch normalization is employed to keep the training process more smoothly with a higher learning rate and less initialization. Relu function [50,51] with a stronger expression ability is adopted to maintain the convergence speed of the model in a stable state. After four iterations, the left and right features are concatenated into the full connection layer. There are 512, 384, 256, and 128 neurons respectively in the "downsize" full connection layer, which can be calculated by Eq Where s is the output of the similarity comparison network, t denotes the sample tag, 1 t = and 0 t = represent a positive input sample and a negative input sample separately.

FC-layers
depth information accuracy of the previous results could not meet the requirements of the experiment, the parallax was recalculated by fitting the conic through the adjacent cost at the adjacent parallax and calculate the final disparity maps.

Loss function design
The are set to 4 and 18 respectively, which relates to the stereo matching algorithm that was used later. When the matching cost of the correct match and the approximate correct match is small, the cross-cost aggregation performs better. In the subsequent stage of network training, the benchmark disparity for the positive and negative sample classification design is [1,0] , for each positive sample, the network expects it approach the similarity 1, and the similarity of negative samples is getting closer to 0.
In this paper, the positive and negative samples in the algorithm are presented in order to satisfy the perfect matching of image blocks. When the network is about to convergence, the cross-entropy loss function is utilized to fit the network, which can be described by the following cross entropy loss function: As shown in Eq (3.6), s represents the output of the network; 1 t = denotes positive samples and 0 t = is negative samples. In our experiment,  is set to 0.05. The algorithm performs better when the positive sample similarity is close to 1 and the negative sample similarity is close to 0.

Training KITTI datasets
Small-batch gradient descent is adopted during the training process, after a certain number of training-validation iterations, the batch size is set to 150 and the momentum is set to 0.9. In neural network training, the learning rate is one of the most important factors affecting training speed and training accuracy. If the learning rate is too small, convergence is easy to guarantee, but the convergence speed will be slower; if the learning rate is too large, the learning speed is fast, but it may cause gradient disappearance [52] or gradient explosion [53]. The learning rate in our experiments is set to 0.02 in the training process. In order to fit the correction range of weights in different stages, the learning rate is reduced gradually in the later iteration. When the 18th epoch is completed (1 epoch is to train all samples through the network once), the loss function is close to convergence. The experimental comparison found that the test results are the best when the block size is set to 9 × 9, so the subsequent experiments are based on 9 × 9 blocks. The experimental platform is NVIDIA K80 based on the TensorFlow environment.

Loss function error comparison
From left to right in Table 1 As shown in Table 1, our method achieves the lowest error in the four experiments, which proves the validity of the proposed loss function. Table 2 gives the final parallax error comparison obtained by different Δ values in the improved loss function, which can be seen that the error result is the smallest when Δ is 0.05.

Comparison with other methods
The proposed algorithm is compared with traditional stereo matching algorithms using in autonomous driving scenarios, including Efficient Large-Scale Stereo Matching [54] (Elas), SGM [55], Slanted Planar Smoothing (SPS) [56], Fast R-CNN [57] Matching, and MC-CNN [39] Fast and Slow, the objective error indicators are used for the comparison of experimental results. As shown in Table 3, the indicator is the ratio of the pixel points, in which the disparity value to the reference disparity is greater than m (m = 2, 3, 4, 5) pixels. It can be seen from Tables 3 and 4 that the error value of the proposed algorithm is smaller than other algorithms when the pixel radio is large than 3 pixels, which indicates the effectiveness of the proposed algorithm.
In addition, Tables 3 and 4 also show qualitative results on the runtime of the proposed method in KITTI 2012 and KITTI 2015 separately, which can be observed that our method has a clear advantage in runtime compared with other algorithms.

Performance of the proposed algorithm
In Figures 5-8, we show qualitative parallax results of our method on KITTI 2012 and KITTI 2015. By learning our method is often able to handle challenging scenarios, such as exposure and backlit autopilot scenes. The error map refers to the difference between the calculated disparity map and the pixel points of the reference disparity. It can be seen from the disparity map that the algorithm can obtain a smooth and dense disparity map, especially in the edge region of the target, and the edge information of the original target is retained obviously.
In addition, our method also has effect on exposure and backlit scenes. For example, in Figure 5a,b, although the car in the shade, our method can still get a better dense parallax map and error map, as shown in Figure 5c,d. For the exposure road in Figure 6a,b, our method can obtain the clear dense parallax map and error map, as shown in Figure 6c,d.

Conclusions
In the field of autonomous driving, the accurate computation of depth information is crucial to the driving safety. Recent studies using CNNs for stereo matching have achieved prominent performance. In this paper, an end-to-end stereo matching algorithm based on the improved convolutional neural network structure is proposed for autonomous driving scenarios. A "downsize" full connection layer and an improved loss function are introduced in the proposed network. The experiments are carried out on the KITTI 2012 and KITTI 2015 datasets. The experimental results show that our algorithm has higher accuracy and efficiency compared with the traditional algorithms and some deep learning-based methods. The loss function error of the proposed method for KITTI 2012 and KITTI 2015 datasets are reduced to 2.62 and 3.26% respectively, which also reduces the runtime of the proposed algorithm. The end-to-end stereo matching method proposed in this paper may provide the location algorithm supports for the automatic driving area. For future work, we plan to focus on using segmentation information to optimize parallax images after matching for more complicated driving tasks.