Unsupervised Monocular Depth Estimation Method Based on Uncertainty Analysis and Retinex Algorithm

Depth estimation of a single image presents a classic problem for computer vision, and is important for the 3D reconstruction of scenes, augmented reality, and object detection. At present, most researchers are beginning to focus on unsupervised monocular depth estimation. This paper proposes solutions to the current depth estimation problem. These solutions include a monocular depth estimation method based on uncertainty analysis, which solves the problem in which a neural network has strong expressive ability but cannot evaluate the reliability of an output result. In addition, this paper proposes a photometric loss function based on the Retinex algorithm, which solves the problem of pulling around pixels due to the presence of moving objects. We objectively compare our method to current mainstream monocular depth estimation methods and obtain satisfactory results.


Introduction
In contemporary research, the methods of monocular depth estimation based on deep learning are divided into the following six types: supervised, unsupervised, semi-supervised, conditional random field (CRF), joint semantic segmentation, and information-assisted depth estimation. In practical applications, the six methods overlap each other, and there are no strict boundaries.
When monocular vision research emerged, many scholars trained neural networks in a supervised manner. In 2014, Eigen et al. [1] used deep neural networks for monocular depth estimation for the first time. They proposed the use of neural networks of two different scales to estimate the depth of a single picture. The coarse-scale network predicted the global depth of an image, and the fine-scale network optimized local details. In 2015, Eigen and Fergus et al. [2] proposed a unified multi-scale network framework based on the aforementioned work, and used it for depth prediction, surface normal vector estimation, and semantic segmentation. Liu et al. [3] combined a deep convolutional neural network with a conditional random field to propose a deep convolutional neural field to estimate the depth of a single image. Based on the work of Trigueiros et al. [4], Liu et al. [5] proposed a comparative study of four classification algorithms for static hand gesture classification using two different hand features data sets. Li et al. [6] proposed a multi-scale depth estimation method: First, a deep neural network was used to regress the depth of the super-pixel scale, and then multi-level conditional random field post-processing was used to optimize the combination of the super-pixel scale. Laina et al. [7] proposed a fully convolutional network architecture based on residual learning for monocular depth estimation. The network structure is deeper and does not require post-processing. Cao et al. [8] treated the depth estimation problem as a pixel-level classification problem. Conditional random fields (CRFs) have always performed well in the field of image semantic segmentation. Considering the continuity of depth values, researchers have begun to apply CRFs to solve depth estimation problems and have achieved some results in recent years [9]. In addition, some researchers [10][11][12][13] combined semantic segmentation with depth estimation. They used the similarities between depth and semantic information to make the two complement each other to achieve the goal of improving accuracy.
Due to the particularity of monocular depth estimation, the supervised training of neural networks is often limited by the scene. Thus, to overcome the need for ground truth data, unsupervised training of a network is a popular research topic. The basic idea is to use either left and right images or inter-frame images, in combination with epipolar geometry and automatic encoders to solve the depth. Many scholars have begun to study the monocular depth estimation of unsupervised learning. Zhou et al. [14] proposed a method that uses a sequence of images taken by a monocular camera as a training set and uses an unsupervised method to train a neural network for monocular depth estimation. Yin et al. [15] improved upon the aforementioned methods by adding a part to estimate the optical flow, extracting the geometric relationship in the prediction of each module, merging them for image reconstruction, and integrating depth, camera motion, and optical flow information for joint estimation. Reza et al. [16] proposed an unsupervised learning monocular image depth and motion estimation method using 3D geometric constraints. Clement et al. [17] used image reconstruction loss to train the network, and output the disparity map through the neural network. Zhang et al. [18] solved the problem of scale uncertainty in unsupervised learning using binocular data to jointly train depth estimation and visual odometer networks. Garg et al. [19] proposed the use of stereo image pairs to achieve unsupervised monocular depth estimation without the need for depth labels, similar to that of automatic encoders. Godard et al. [20] further improved upon the above method, using the consistency of the left and right images to achieve unsupervised depth prediction. Kuznietsov et al. [21] proposed a combination of supervised learning methods labeled with sparse depth maps and unsupervised learning methods, namely semi-supervised learning, to further improve performance.
The current unsupervised monocular depth estimation studies used similar pixel value subtraction methods (some researchers also used the SSIM algorithm) in terms of photometric loss. The SSIM algorithm is shown in Equation (1) SSIM(x, y) = 2µ x µ y + C 1 2σ x σ y + C 2 where x and y represent two images to be compared, C 1 and C 2 represent constants, µ represents the average gray level, and σ represents the structural similarity of the image. We think that photometric loss primarily affects the depth of the edge of the object in the image, which leads to an unclear depth map around the contour of the object. In addition, the current convolutional neural network used for monocular depth estimation has strong expressive ability, but it cannot evaluate the reliability of the output result. In this study, the variance of the training neural network was used to construct the uncertainty loss function equation. Uncertainty estimation has a long history in neural networks as well, starting with Bayesian neural networks. Different models are sampled from the distribution weights to estimate the mean and variance. This method is simple and effective. Many scholars have integrated uncertainty and neural networks [22,23].
In this paper, our main contributions are two-fold: 1.
An unsupervised depth estimation network based on uncertainty is proposed to improve the problem of low prediction depth accuracy in monocular depth estimation. This method of uncertainty learning solves the problem in which the convolutional neural network currently used for monocular depth estimation has a strong expressive ability but cannot evaluate the reliability of the output result. By modeling the uncertainty, the confidence of the estimated depth can be predicted while the model prediction accuracy is improved and the uncertainty of the output result is quantified.

2.
Retinex lighting theory is used to construct the photometric loss function to solve the interference problem caused by dynamic objects in the scene.

Materials and Methods
Given two consecutive frames I t and I t−1 sampled from an unlabeled video, we first estimate their depth maps D t and D t−1 using the depth network, and then predict the relative 6D camera pose P ab between them using the PoseNet network. With the predicted depth map D t and the relative camera pose P ab , we synthesize I * t by warping I t−1 , where differentiable bilinear interpolation [24] is used as in [14]. Similarly, we obtain the image I * t−1 . Finally, we input (I * t , I * t−1 ) into the DepthNet to obtain (D * t , D * t−1 ). We construct the loss function L U between (D t , D * t ) and D t−1 , D * t−1 using uncertainty analysis. The structure of the network is shown in Figure 1.
The total loss function of the target network is: where L R represents the photometric loss, L s represents the loss of smoothness, and L U represents the uncertainty of the neural network.
Sensors 2020, 20, x FOR PEER REVIEW 3 of 11 depth can be predicted while the model prediction accuracy is improved and the uncertainty of the output result is quantified.

2.
Retinex lighting theory is used to construct the photometric loss function to solve the interference problem caused by dynamic objects in the scene.

Materials and Methods
Given two consecutive frames and sampled from an unlabeled video, we first estimate their depth maps and using the depth network, and then predict the relative 6D camera pose between them using the PoseNet network. With the predicted depth map and the relative camera pose , we synthesize * by warping , where differentiable bilinear interpolation [24] is used as in [14]. Similarly, we obtain the image * . Finally, we input ( * , * ) into the DepthNet to obtain ( * , * ). We construct the loss function between ( , * ) and ( , * ) using uncertainty analysis. The structure of the network is shown in Figure 1.
The total loss function of the target network is: where represents the photometric loss, represents the loss of smoothness, and represents the uncertainty of the neural network.

Photometric Loss
The basic theory of the Retinex algorithm is shown in Figure 2.

Photometric Loss
The basic theory of the Retinex algorithm is shown in Figure 2. R(x, y) is incident light and L(x, y) is reflected light. The incident light directly determines the dynamic range that the pixels in the image can reach, and the reflected light represents the image of the reflective nature of the object.
The change in the moving object directly affects the reflected light of L(x, y) but does not affect the incident light of R(x, y). Therefore, the network can be supervised from the R(x, y) direction as a loss function to avoid the interference problem of dynamic objects.
According to the basic theory of the Retinex algorithm, the expression is as follows: The change in the moving object directly affects the reflected light of ( , ) but does not affect the incident light of ( , ). Therefore, the network can be supervised from the ( , ) direction as a loss function to avoid the interference problem of dynamic objects.
According to the basic theory of the Retinex algorithm, the expression is as follows: The single-scale Retinex algorithm is often used for image enhancement. We apply it here to the establishment of the monocular depth estimation loss function. The main principle of the single-scale Retinex algorithm is convolving the three channels of the image with the center surround function. The image after the convolution operation is regarded as an estimate of the illumination component of the original image.
The process of using a low-pass filter to solve the incident component through a convolution operation can be expressed as: From a mathematical perspective, solving ( , ) is a singular problem that can only be calculated by approximate estimation using mathematical methods. Assuming that the illumination image is estimated as a spatially smooth image, the incident light ( , ) can be obtained according to the single-scale Retinex algorithm: where represents the color channel, ( , ) represents the pixel value of the reflection image of the color channel, ( , ) represents the pixel value of the original image ( , ) of the color channel, * represents the convolution operation, and ( , ) represents the Gaussian surround function: where represents the standard deviation in the Gaussian function, which is called the scale function here. The size of the standard deviation greatly affects the Retinex algorithm.
In summary, the photometric loss function can be transformed from Equations (7) and (8): The single-scale Retinex algorithm is often used for image enhancement. We apply it here to the establishment of the monocular depth estimation loss function. The main principle of the single-scale Retinex algorithm is convolving the three channels of the image with the center surround function. The image after the convolution operation is regarded as an estimate of the illumination component of the original image.
The process of using a low-pass filter to solve the incident component through a convolution operation can be expressed as: L(x, y) = I(x, y) * G(x, y).
From a mathematical perspective, solving R(x, y) is a singular problem that can only be calculated by approximate estimation using mathematical methods. Assuming that the illumination image is estimated as a spatially smooth image, the incident light R(x, y) can be obtained according to the single-scale Retinex algorithm: where i represents the color channel, R i (x, y) represents the pixel value of the reflection image of the i color channel, I i (x, y) represents the pixel value of the original image I(x, y) of the i color channel, * represents the convolution operation, and G(x, y) represents the Gaussian surround function: where σ represents the standard deviation in the Gaussian function, which is called the scale function here. The size of the standard deviation greatly affects the Retinex algorithm. In summary, the photometric loss function can be transformed from Equations (7) and (8): where N represents pixels in the image.

Smoothness Loss
Before regularizing the estimated depth map of the existing work, the smoothness loss needs to be added. We adopt the edge-aware smoothness loss used in [24], which is formulated as: where ∇ is the first derivative along spatial directions, which ensures that smoothness is guided by the edge of images.

Uncertainty Analysis
The uncertainty of neural networks is generally divided into two categories: model uncertainty and random uncertainty. Model uncertainty mainly refers to the uncertainty of model parameters. When there are multiple models with good results, the final model parameters need to be selected from them. When the amount of input data is large enough, the model uncertainty is very low. In this paper, the training data were large enough, so the model uncertainty was not considered.
Sensor noise and motion noise may cause the observation data to be inaccurate, resulting in random uncertainty. These observation noises cannot be eliminated by large-scale data training. We assume that the data have a Gaussian distribution when modeling random uncertainties, and the likelihood function is shown in Equation (10).
where D represents the depth observation data, D * represents the depth of the model output, and σ 2 represents the noise variance. According to Equation (10), we take the logarithm of both sides of the equation and solve the negative log likelihood function: The random uncertainty of heteroscedasticity assumes that the noise variance is variable under different inputs. For example, uncertainties such as the edges of objects and distant scenes are usually higher, while other positions are more reliable. Therefore, the objective function of learning is as follows: where N represents the number of pixels, (D t , D * t ) represents the depth value of the depth map, and σ 2 i represents the variance output at the end of the network. Depth estimation is a regression task. The most common loss functions for regression task optimization include the L2 loss function and the L1 loss function. The square operation makes the L2 loss function sensitive to outliers and it has a good optimization effect for large prediction errors, but has poor ability to further optimize for small prediction errors. The L1 loss function has a better optimization effect for smaller prediction errors, whereas the optimization effect for large prediction errors is general. The L1 loss function is slightly better in actual training. The uncertainty loss function proposed in this paper combines L1 loss and heteroscedastic random uncertainty in neural networks. In addition, the linear growth rate of the L1 loss makes it insensitive to loud noises, thus inhibiting adverse effects. The objective function of uncertainty can be expressed as Equation (14): To avoid the denominator being zero and to ensure the loss function has better numerical stability, the uncertainty loss function is transformed into: where σ 2 i still represents the variance output at the end of the network, W i represents the value logσ 2 i , and i represents the index value.

Network Architecture
For the depth network, we experimented with DispNet [14], which takes a single RGB image as input and outputs a depth map. For the PoseNet network, we used a network without a mask prediction branch [14]. Using the total loss function proposed in this paper to train the network obtained a relatively ideal result.

Evaluation Index
To objectively evaluate the proposed monocular depth estimation model, this paper uses the following five evaluation criteria to quantify the model: Average relative error (Rel): Root mean squared error (RMSE): Average log10 error (log10): log 10 d gt − log 10 d p .
Accuracy with threshold thr: where d gt and d p are the ground-truth and predicted depths of pixels, respectively, and N is the total number of pixels in all the evaluated images.

Comparisons with the State-of-the-Art Methods
We evaluate the evaluation model on the KITTI dataset [25]. Figure 3 shows the results obtained, showing that the pixels around the moving object are not excessively deviated. The depth of pixels around moving objects is also not blurred. Table 1 provides the comparison between the results of this paper and other algorithms.
The experimental results showed that the algorithm proposed in this paper is as effective as the state-of-the-art algorithms. Our algorithm is slightly inferior to [16] in terms of SqRel and RNSlog. Reference [16] used a combination of supervised and unsupervised methods, using true depth labels. It also shows that the unsupervised learning method in this paper can achieve the accuracy of supervised learning. To better prove the effectiveness of the proposed method, we performed an ablation study, as described in Section 3.5.
Sensors 2020, 20, x FOR PEER REVIEW 7 of 11 Percentage (%) of s. t: max , = δ < thr, where and are the ground-truth and predicted depths of pixels, respectively, and N is the total number of pixels in all the evaluated images.

Comparisons with the State-of-the-Art Methods
We evaluate the evaluation model on the KITTI dataset [25]. Figure 3 shows the results obtained, showing that the pixels around the moving object are not excessively deviated. The depth of pixels around moving objects is also not blurred. Table 1 provides the comparison between the results of this paper and other algorithms.

Ablation Study
In this section, we verify the contributions of two innovations in this paper: luminosity loss and uncertainty analysis. We used the DispNet network for ablation study. The image resolution input in Table 2 is 416 × 128, and the image resolution input in Table 3 is 832 × 256. Among them, the methods are: 'Basic', 'Basic + Retinex', 'Basic + Uncertainty', and 'Basic + Retinex + Uncertainty'. The black bold in the Tables 2 and 3 indicate the best results. The result clearly showed the overall improvement of the monocular depth estimation using our proposed scheme. When the basic network part was optimized with Retinex, the error parameters of AbsRel, SqRel and RMS significantly reduced. We think that this occurred due to the reduction in the error rate of the proposed algorithm in the small part around the object, which also serves the purpose of constructing loss function L R . After adding uncertainty analysis, the overall accuracy of monocular depth estimation increased, and the error rate decreased. This illustrated the importance of improving model prediction accuracy through modeling uncertainty.

Discussion
Unlike the general regression task loss function, the uncertainty loss function proposed in this paper can not only estimate the depth, but also obtain the confidence of the estimated depth through the predicted variance. The smaller the noise variance, the closer the predicted depth to the real depth; the larger the noise variance, the higher the deviation between the predicted depth and the real depth. Figure 3 shows a detailed comparison of the mainstream algorithms and the algorithms in this article in recent years. According to Figure 4, there is no fuzzy pulling around the depth estimation objects of two adjacent frames, indicating that the proposed method is effective in solving the monocular depth estimation problem of moving objects. The pulling phenomenon around the moving object is improved and the network is monitored with uncertainty analysis. As can be seen from Table 1, compared with other algorithms in terms of accuracy, there is room for improvement.

Discussion
Unlike the general regression task loss function, the uncertainty loss function proposed in this paper can not only estimate the depth, but also obtain the confidence of the estimated depth through the predicted variance. The smaller the noise variance, the closer the predicted depth to the real depth; the larger the noise variance, the higher the deviation between the predicted depth and the real depth. Figure 3 shows a detailed comparison of the mainstream algorithms and the algorithms in this article in recent years. According to Figure 4, there is no fuzzy pulling around the depth estimation objects of two adjacent frames, indicating that the proposed method is effective in solving the monocular depth estimation problem of moving objects. The pulling phenomenon around the moving object is improved and the network is monitored with uncertainty analysis. As can be seen from Table 1, compared with other algorithms in terms of accuracy, there is room for improvement.

Conclusions
This paper proposed a method of monocular depth estimation based on uncertainty and a method of optical flow loss function based on the Retinex algorithm as a supervised network. The proposed method solves the problem of pulling around pixels due to the presence of moving objects. State-of-the-art performance is achieved on the KITTI dataset. In future work, we will focus on the effectiveness of unsupervised depth estimation in more complex scenarios.
Author Contributions: C.Q. designed the method, performed the experiment, and analyzed the results. C.S. provided overall guidance for the study. F.X. and S.S. reviewed and revised the paper. X.Z. offered crucial suggestions about the experiment and participated in the writing of driver module code and algorithm verification. J.C. put forward the idea and debugged the model in Python. All authors have read and agreed to the published version of the manuscript.