1 Introduction

Autonomous robots are composed of different modules that allow them to perceive, learn, decide and act within their environment. The perception module processes cues that inform the robot about the appearance and geometry of the environment. Particularly when working in outdoor or underground scenarios, these cues must be robust to unseen phenomenon. A fully autonomous robot must execute all operations, monitor itself and be able to handle all unprecedented events and conditions, such as unexpected objects and debris on the road, unseen environments and adverse weather. Therefore, reliable and robust perception of the surrounding environment is one of the key tasks of autonomous robotics.

Among the main inputs for a perception system are the distances of the robot from multiple points in its environment. This input can be obtained directly from a sensor or estimated by a depth estimation module. Depth estimation can be performed by processing monocular camera images, stereo vision, radar or LiDAR (Light Detector and Ranging) sensors, among others. Although monocular cameras can only be used to generate depth information up-to-a-scale, they are still an important component of a depth estimation system due to their low price and the rich appearance data they provide. Although a monocular camera is small, low-cost and energy efficient, it is very sensitive to changes in the illumination. Additionally, the accuracy and reliability of depth estimation methods based on monocular images is still far from being practical. For instance, the state-of-the-art RGB-based depth prediction methods [12, 31, 36] produce an average error (measured by the root-mean-squared error) of over 50 cm in indoor scenarios (e.g. on the NYU-Depth-v2 dataset [61]). Such methods perform even worse outdoors, with at least 4 meters of average error on the Make3D and KITTI datasets [18, 58]. 3D LiDAR scanners, on the other hand, can provide accurate geometric information about the environment even when illumination changes occur. Even though, due to their active nature, LiDARs are robust to dark or overexposed scenarios, the generated 3D point cloud is sparse and non-uniformly distributed, which decreases its utility for recognition tasks. In general, each type of sensor has its own weaknesses.

To address the potential fundamental limitations of image-based depth estimation, this paper considers the use of sparse depth measurements, along with RGB data, to reconstruct depth in full resolution. Data fusion techniques have been extensively employed for robust perception systems, where fusing and aggregating data from different sensors is required. Although some approaches to robust perception resort to statistical methods for dealing with data outliers [2], the work presented in this paper belongs to the group that tackles the robust perception problem by leveraging the complementary nature of passive and active sensor modalities. Multi-sensor approaches to robotic perception can be categorized according to the level at which the data from the different sensing modalities is fused in order to obtain the estimate of interest. According to [7], data fusion can be made at the level of symbolic estimates (high-level fusion), at the level of features (medium-level fusion) or at the level of raw data (low-level fusion).

In this paper, a low-level fusion method is developed. It explores the complementary relations between the passive and active sensors at the pixel level [8, 14, 55]. The approaches in [51, 54, 63] follow this intuition but require the fused modalities to have similar coverage densities. Our proposed framework provides a procedure for fusing LiDAR and image data independently of the LiDAR data’s density. The main contribution of this paper is a depth regression model that takes both a sparse set of depth samples and RGB images as the inputs and predicts a full-resolution depth map. This is achieved by modelling the problem of fusing low-resolution depth images with high-resolution camera images as a conditional random field (CRF).

The intuition behind our CRF formulation is that depth discontinuities in a scene often co-occur with changes in colour or brightness within the associated camera image. Since the camera image is commonly available at much higher resolution, this insight can be used to enhance the resolution and accuracy of the depth image. A depth map will be produced by our approach using three features as illustrated in Fig. 1. The first one is an RGB colour image from the camera sensor, top image in Fig. 1 (a). The second one is 2D sparse depth map captured by a LiDAR sensor, middle image in Fig. 1 (b). The third feature is a surface-normal map generated from the sparse depth samples, bottom image in Fig. 1 (c).

The rest of this paper is organized as follows. Section 2 reviews related work on depth estimation. Section 3 explains how the LiDAR points and the camera images were registered. In Sect. 4, we first introduce our CRF-Fusion framework, and then, we provide a detailed explanation of the proposed model: the energy potentials that compose our CRF model and its inference machine. The experimental validation, performed on the KITTI dataset, is reported in Sect. 5. Finally, conclusions and directions for future work are listed in Sect. 6.

Fig. 1
figure 1

Input features of our framework. We developed a CRF regression model to predict a dense depth image from a single RGB image, and a set of sparse depth samples: (a), (b) and (c) are the input RGB image, a set of sparse depth samples projected on the image plane and the projected surface normals, respectively

2 Related work

Depth estimation from monocular images is a long-standing problem in computer vision. Early works on depth estimation using RGB images usually relied on handcrafted features and inference on probabilistic graphical models. Classical methods include shape-from-shading [72] and shape-from-defocus [62]. Other early methods were based on hand-tuned models or assumptions about the orientations of the surfaces [4]. Newer methods treat depth estimation as a machine learning problem, most recently using deep artificial neural networks [13, 68]. For instance, Saxena et al. [57] estimated the absolute scales of different image patches and inferred a depth image using a Markov random field model. Eigen et al. used a multi-scale convolutional network to regress from colour images to depths [12, 13]. Laina et al. used a fully convolutional network architecture based on ResNet [32]. Liu et al. proposed a deep convolutional neural field model that combines deep networks with Markov random fields [38]. Roy et al. combined shallow convolutional networks with regression forests to reduce the need for large training sets [56]. In [69], the proposed attention model is seamlessly integrated with a CRF, allowing end-to-end training of the entire architecture. This approach benefits from a structured attention model which automatically regulates the amount of information transferred between corresponding features at different scales.

The approach of Li et al. [5] combines deep learning features on image patches with hierarchical CRFs defined on a superpixel segmentation of the image. They use pretrained AlexNet [29] features of image patches to predict depth at the centre of the superpixels. A hierarchical CRF refines the depth across individual pixels. Liu et al. [37] also propose a deep structured learning approach that avoids handcrafted features. They presented a deep structured learning scheme which learns the unary and pairwise potentials of a continuous CRF in a unified deep CNN framework. Liu et al. [39] proposed a discrete-continuous CRF model to take into consideration the relations between adjacent superpixels, e.g. occlusions.

Recent work has also shown the benefit of adopting multi-task learning strategies, e.g. for jointly predicting depth and performing semantic segmentation, ego-motion estimation or surface-normal computation [50, 76]. Some recent papers have proposed unsupervised or weakly supervised methods for reconstructing depth maps [21, 30]. With the rapid development of deep neural networks, monocular depth estimation based on deep learning and computer vision techniques has been widely studied recently and achieved promising performance in terms of accuracy [75]. However, not considering information from other sensors makes the estimate not so robust. In the mentioned literature, there are used different kinds of network frameworks, loss functions and training strategies with just one sensory modality. The architecture proposed in this paper uses two sensory modalities.

Fusing data coming from multiple sensors has the potential to improve the robustness of the depth estimates. Ma et al. [43] use RGB images together with sparse depth information to train a bottleneck network architecture. Compared to imagery-only methods, their approach generates better depth estimation results. Others have investigated depth estimation from colour images augmented with sparse sets of depth measurements using probabilistic graphical models. The techniques described in [3, 11, 71], and [67] are able to fuse the information from both sources to significantly improve the resolution of low-quality and sparse range images. Wang et al. proposed a multi-scale feature fusion method for depth completion [66] using sparse LIDAR data. Ma et al. proposed two methods: a supervised method for depth completion using a ResNet-based architecture and a self-supervised method which uses the sparse LiDAR input along with pose estimates to add additional training information based on depth and photometric losses [42].

Although recent methods have achieved impressive progress in terms of evaluation metrics such as the pixel-wise relative error, most of them neglect the geometric constraints in 3D space. This component is considered in our CRF model, which makes this approach different from previous fusion methods.

Providing strong cues from surface information is relevant for improving the accuracy of the depth prediction [33]. Recently, Zhang et al. [73] proposed predicting surface normals and occlusion boundaries using a deep network and further used them to help depth completion in indoor scenes. The works in [53, 70] propose end-to-end deep learning systems to produce dense depth maps from sparse LiDAR data and a colour image taken from outdoor on-road scenes, leveraging surface normals as the intermediate representation. Zhang et al. [74] predicted surface normals by leveraging RGB data, leading to a better prior for depth completion. They ultimately combined these predictions with sparse depth input to generate a complete depth map. Our method takes advantage of the surface normals to improve the performance of the proposed model. [70], in particular, uses a diffusion layer to refine the completions. [40] shows that diffusion models can be interpreted as energy-based models (EBMs). Now, CRFs can also be interpreted as energy-based models parametrized as factor graphs. Therefore, there is a relation between the diffusion and the CRF approach as both can be interpreted as EBMs, but with different formulations and assumptions. For the problem of depth estimation, the CRF approach more naturally captures the prior knowledge of the geometric constraints inherent in the physics underlying the LiDAR data measurements, whereas the diffusion approach relies on a carefully crafted similarity metric, which might make the models unstable and hard to train. In other words, we expect a CRF approach to require less training to achieve equivalent performance as the diffusion-based method.

The problem of fusing LiDAR and image data can be approached as a camera pose estimation problem, where the relation between the 3D LIDAR coordinates and the 2D image coordinates is characterized by camera parameters, such as the position, orientation and focal length. In [47], there was proposed an information-theoretic similarity measure to automatically register 2D-Optical imagery with 3D LiDAR scans by searching for a suitable camera transformation matrix. The fusion of 3D LiDAR data with stereoscopic images is addressed in [23, 44, 48, 49]. The advantage of stereoscopic depth estimation is its ability to produce dense depth maps of the surroundings by using stereo matching techniques. In [49], for example, there was proposed a two-stage cascade deep architecture that first fuses stereo and LiDAR disparity maps and then refines the estimated disparity maps by introducing colour features. In contrast, our method does not rely on a stereo matching algorithm, something which tends to be computationally costly.

3 Image and LIDAR point cloud registration

This section gives a brief introduction to the process of aligning an image and a LIDAR point cloud, which allows the projection of LiDAR points onto the image plane. As presented in [19], in a robotic platform equipped with both a LiDAR and a camera, these two sensors are synchronized so simultaneous scans from both sensor are collected. The camera and LiDAR are cross-calibrated so that the point cloud can be projected onto the image plane [19]. Once projected, the LiDAR points are associated with either pixels or groups of pixel, also called superpixels. This section also briefly describes the Simple Linear Iterative Clustering (SLIC) algorithm, by which these superpixels are obtained.

3.1 Point cloud projection

In order to fuse the image and LiDAR data, it is imperative to find mathematical models that represent the spatial correspondence of pixels and 3D points. These models allow us to project the LiDAR point cloud onto the image plane. The projection \({\textbf{y}}\) of a 3D point \({\textbf{x}}=(x, y, z, 1)^{T}\) in rectified and rotated camera coordinates to a point \({\textbf{y}}=(u, v, 1)^{T}\) in the \(i^{\prime }\)th camera image is given by

$$\begin{aligned} {\textbf{y}}={\textbf{P}}_{\text {rect}}^{(i)} {\textbf{x}} \end{aligned}$$

with

$$\begin{aligned} {\textbf{P}}_{r e c t}^{(i)}=\left( \begin{array}{cccc}{f_{u}^{(i)}} &{} {0} &{} {c_{u}^{(i)}} &{} {-f_{u}^{(i)} b_{x}^{(i)}} \\ {0} &{} {f_{v}^{(i)}} &{} {c_{v}^{(i)}} &{} {0} \\ {0} &{} {0} &{} {1} &{} {0}\end{array}\right) \end{aligned}$$

being the ith projection matrix. Here, \(b_{x}^{(i)}\) denotes the baseline with respect to a reference camera. Note that in order to project the 3D point \({\textbf{x}}\), in the camera reference coordinates, to a point \({\textbf{y}}\) on the ith image plane, the rectifying rotation matrix of the reference camera \({\textbf{R}}_{r e c t}^{(0)}\) must be considered as well.

$$\begin{aligned} {\textbf{y}}={\textbf{P}}_{r e c t}^{(i)} {\textbf{R}}_{r e c t}^{(0)} {\textbf{x}} \end{aligned}$$

Here, \({\textbf{R}}_{r e c t}^{(0)}\) has been expanded into a \(4\times 4\) matrix by appending a fourth zero-row and column, and setting \({\textbf{R}}_{r e c t}^{(0)}(4,4)=1\). We also need to register the laser scanner with respect to the camera’s coordinate system. The rigid-body transformation from LiDAR coordinates to camera coordinates is given by

$$\begin{aligned} {\textbf{T}}_{v e l o}^{c a m}=\left( \begin{array}{cc}{{\textbf{R}}_{v e l o}^{{\text {cam}}}} &{} {{\textbf{t}}_{v e l o}^{c a m}} \\ {0} &{} {1}\end{array}\right) \end{aligned}$$

Finally, a 3D point \({\textbf{x}}\) in the LiDAR coordinate system gets projected to a point y in the ith camera image:

$$\begin{aligned} {\textbf{y}}={\textbf{P}}_{r e c t}^{(i)} {\textbf{R}}_{r e c t}^{(0)} {\textbf{T}}_{v e l o}^{c a m} {\textbf{x}} \end{aligned}$$

Subsequently, as a preprocessing step, points with a negative value of z are removed. Then, the remaining points can be projected onto the image plane using the projection matrix

$$\begin{aligned} \left[ x^{\prime } y^{\prime } z^{\prime }\right] ^{T}={\textbf{y}}\left[ x_{p}\;y_{p}\; z_{p}\; 1\right] ^{T} \end{aligned}$$

The projected pixel coordinates of the LIDAR points can be obtained by

$$\begin{aligned} {[}x, y]=\left[ \frac{x^{\prime }}{z^{\prime }}, \frac{y^{\prime }}{z^{\prime }}\right] \end{aligned}$$

Figure 2 portrays the projection of a point cloud projection onto the image plane.

Fig. 2
figure 2

Given an (a) input image and its corresponding depth, where dark blue indicates closer distances, our work is focused on densifying the input sparse depth. We combine the (b) superpixel segmentation with the sparse input depth to obtain our initial superpixel depth. We use a CRF to enhance the resolution and accuracy of the depth image considering information from the 4 neighbours for each superpixel. In (d), we can see how we achieve significant improvements with regard to the sparse input depth. Each pixel inside the superpixels has the same depth value

3.2 Superpixel segmentation using Simple Linear Iterative Clustering (SLIC)

Considering superpixels rather than individual pixels greatly reduces the complexity of the subsequent image processing tasks. In order to harness the full potential of the use of superpixels, their calculation must be fast, easy to use, and produce high-quality segmentations. Superpixel segmentation algorithms provide an effective way to extract information about local image features. Our framework uses an implementation of the SLIC algorithm to group sparse depth measurements which have been previously projected onto the image plane. SLIC is a simple and parallelizable pixel clustering method, based on the k-means algorithm, which is used for decomposing an image into a regular grid of visually homogeneous regions or so-called superpixels [1]. As a result, SLIC superpixels provide a regular grouping of image pixels according to their distance both spatially and in the colour space. We use this superpixel segmentation method to assign depth values from a sparse point cloud to all of the pixels within the superpixels. Applying SLIC to the original image provides a segmentation with as many segments as the number of superpixels set as a hyperparameter. Each segment is identified by an ID that allows individual pixels to be assigned to superpixel segments. Additionally, the coordinates of the segment’s centroid and those of its neighbours are also output by a super-segmentation step. Thanks to the superpixel segmentation, each node in our proposed CRF will be associated with a small number of segments, rather than have the extremely large quantity of individual pixels usually present in high definition images. As provided by the super-segmentation step, the number of neighbours around a segment may vary, depending on the spatial homogeneity of the colour in the image and the expected number of segments. After segmentation, the grid-like structure of the original images is lost. The grid structure of a CRF requires that each node, and therefore each segment, be associated with only four neighbours. Thus, it is necessary to find the four nearest neighbours corresponding to each segment, since the grid-like structure is not well defined.

In order to determine which superpixels are the four closest neighbours, the angles between the centroid of each segment and its neighbours are calculated. We select as neighbours the four segments whose angles have the least difference from 0, 90, 180 and 270 degrees, respectively. For two examples of nodes, depicted as red points, Fig. 3 illustrates their closest neighbours, whose centroids are represented by yellow dots.

Fig. 3
figure 3

Two examples of superpixels and their neighbours. The red dots in A and B are nodes assigned to two different superpixels, while the green lines represent their corresponding nearest neighbours. Note that the super-segmentation used allows a superpixel node to have more than four neighbours

Figure 4 shows the 4 closest neighbours selected. These neighbours are represented by dark green lines and dark yellow dots. Note that for the superpixels located at the corners, respectively, on the edges, only the 2, respectively, 3, closest neighbours need to be found.

Fig. 4
figure 4

Selected 4 nearest neighbours for superpixel nodes (red dots) in Fig. 3. Dark green lines connect nodes with their selected neighbours

4 CRF-based camera–LIDAR fusion for depth estimation

In this paper, depth estimation is formulated as a superpixel-level inference task on a modified conditional random field (CRF). Our proposed model is a multi-sensor extension of the classical pairwise CRF. In this section, we first briefly introduce the CRF model. Then, we show how to fuse the information of an image and a sparse LIDAR point cloud with our novel CRF framework.

4.1 Overview

The conditional random field (CRF) is a type of undirected probabilistic graphical model which is widely used for solving labelling problems. Formally, let \({\textbf{X}}=\left\{ X_{1}, X_{2}, \ldots , X_{N}\right\} \) be a set of discrete random variables to be inferred from an observation or input tensor \({\textbf{Y}}\), which in turn is composed of the observation variables \(c_{i}\) and \(y_{i}\), where i is an index over superpixels. For each superpixel i, the variable \(c_{i}\) corresponds to an observed three-dimensional colour value and \(y_{i}\) is an observed range measurement. The goal of our framework is to infer the depth of each pixel in a single image depicting general scenes. Following the work of [11, 45], we make the common assumption that an image is composed of small homogeneous regions (superpixels) and consider a graphical model composed of nodes defined on superpixels. Note that our framework is flexible and can estimate depth values on either pixels or superpixels. The remaining question is how to parametrize this undirected graph. Because the interaction between adjacent nodes in the graph is not directed, there is no reason to use a standard conditional probability distribution (CPD), in which one represents the distribution over one node given the others. Rather, we need a more symmetric parametrization. Intuitively, we want our model to capture the affinities between the depth estimates of the superpixels in a given neighbourhood. These affinities can be captured as follows: Let \({\tilde{P}}(X,Y)\) be an unnormalized Gibbs joint distribution parametrized as a product of factors \(\Phi \), where

$$\begin{aligned} \Phi =\left\{ \phi _{1}\left( D_{1}\right) , \ldots , \phi _{k}\left( D_{k}\right) \right\} , \end{aligned}$$

and

$$\begin{aligned} {\tilde{P}}(X,Y)=\prod _{i=1}^{m} \phi _{i}\left( D_{i}\right) . \end{aligned}$$

We can then write a conditional probability distribution of the depth estimates X given the observations Y using the previously introduced Gibbs distribution, as follows:

$$\begin{aligned} Pr(X | Y)=\frac{P(X, Y)}{Z(Y)} \end{aligned}$$

where

$$\begin{aligned} Z(Y)=\sum _{X} {\tilde{P}}(X, Y). \end{aligned}$$

Here, Z(Y), also known as ‘the partition function’, works as a normalizing factor which marginalizes X from \({\tilde{P}}(X,Y)\), allowing the calculation of the probability distribution P(X|Y):

$$\begin{aligned} P(X | Y)=\frac{1}{\sum _{X} {\tilde{P}}(X, Y)} {\tilde{P}}(X,Y). \end{aligned}$$

Therefore, similar to conventional CRFs, we model the conditional probability distribution of the data with the following density function:

$$\begin{aligned} {\text {P}}({\textbf{X}} | {\textbf{y}})=\frac{1}{\textrm{Z}({\textbf{Y}})} \exp (-E({\textbf{X}}, {\textbf{Y}})) \end{aligned}$$

where E is the energy function and Z is the partition function defined by

$$\begin{aligned} \textrm{Z}(\textrm{Y})=\int _{\textrm{Y}} \exp \{-E(\textrm{X}, \textrm{Y})\} \textrm{dY}. \end{aligned}$$

Since Z is continuous, this integral equation can be analytically solved. This is different from the discrete case, in which approximation methods need to be applied. To predict the depths of a new image, we solve the following maximum a posteriori (MAP) inference problem:

Fig. 5
figure 5

Illustration of the proposed model. On the top left is a fused view of the image and LIDAR point cloud on superpixels. On the top right are the normal surface map and RGB inputs used in the pairwise potentials. On the top middle is the graph structure of the CRF: The yellow nodes represent the centroids of the image superpixels, and the green branches the connections between them. The outputs of the unary part and the pairwise part are then fed to the CRF structured loss layer, which minimizes the corresponding energy function. On the bottom left is the probabilistic output, a dense depth map and uncertainty estimation map (see text for details)

To simplify the solution for the energy function, one can take the negative logarithm of the left-hand side and right-hand side of the equation of the probability distribution Pr(X|Y): then, the problem of maximizing the conditional probability becomes an energy minimization problem. Therefore, maximizing the probability distribution \({\text {Pr}}({\textbf{X}} | {\textbf{Y}})\) is equivalent to minimizing the corresponding energy function:

$$\begin{aligned} {\textbf{x}}^{\star }=\arg \min _{{\textbf{x}}} E({\textbf{X}}, {\textbf{Y}}). \end{aligned}$$

We formulate the energy function as a typical combination of unary potentials U and pairwise potentials V over the nodes (superpixels) N and edges S of the image x:

$$\begin{aligned} E({\textbf{X}}, {\textbf{Y}})=\sum _{p \in {\mathcal {N}}} U\left( x_{p}, {\textbf{y}}\right) +\sum _{(p, q) \in {\mathcal {S}}} V\left( x_{p}, x_{q}, {\textbf{y}}\right) \end{aligned}$$

The unary term U aims to regress the depth value from a single superpixel. The pairwise term V encourages neighbouring superpixels with similar appearances to take similar depths [11, 23].

4.2 Potential functions

The proposed multi-modal depth estimation model is composed of unary and pairwise potentials. For an input image, which has been over-segmented into n superpixels, we define a unary potential for each superpixel. The pairwise potentials are defined over the four-neighbour vicinity of each superpixel. The unary potentials are built by aggregating all LiDAR observations inside each superpixel. The pairwise part is composed of similarity vectors, each with K components, that measure the agreement between different features of neighbouring superpixel pairs. Therefore, we explicitly model the relations between neighbouring superpixels through pairwise potentials. In the following, we describe the details of the potentials involved in our energy function.

4.2.1 Unary potential

The unary potential is constructed from the LiDAR sensor measurements by considering the least square loss between the estimated \(x_{i}\) and observed \(y_{i}\) depth values:

$$\begin{aligned} \Phi ({\textbf{x}}, {\textbf{y}})= & {} \sum _{i \in {\mathscr {L}}} \sigma _{i}\left( x_{i}-y_{i}\right) ^{2}\\ \Phi ({\textbf{x}},{\textbf{y}})= & {} \Vert {\textbf{W}} ({\textbf{x}}-{\textbf{y}})\Vert ^{2} \end{aligned}$$

where \({\mathscr {L}}\) is the set of indices for which a depth measurement is available, and \(\sigma _{i}\) is a constant weight placed on the depth measurements. This potential measures the quadratic distance between the estimated range X and the measured range Y, where available. Finally, in order to write the unary potential in a more efficient matrix form, we define the diagonal matrix W with entries

$$\begin{aligned} {\textbf{W}}_{i, i}= \left\{ \begin{array}{ll} {\sigma _{i}} &{} \quad {\text {if } i \in {\mathscr {L}}} \\ {0} &{} \quad {\text {otherwise }} \end{array}\right. \end{aligned}$$

4.2.2 Colour pairwise potential

We construct a pairwise potential from K types of similarity observations, each of which enforces smoothness by exploiting colour consistency features of the neighbouring superpixels. This pairwise potential can be written as

$$\begin{aligned} \Psi ^{c}({\textbf{x}}, {\textbf{I}})= & {} \sum _{i} \sum _{j \in {\mathscr {N}}(i)} e_{i, j}\left( x_{i}-x_{j}\right) ^{2}\\ \Psi ^{c}({\textbf{x}}, {\textbf{I}})= & {} \Vert {\textbf{S}} {\textbf{x}}\Vert ^{2} \end{aligned}$$

where I is an RGB image, \({\mathscr {N}}(i)\) is the set of horizontal and vertical neighbours of i, and each row of S represents the weighting factors for pairs of adjacent range nodes. As the edge strength between nodes, we use an exponentiated \(L_{2}\) norm of the difference in pixel appearance.

$$\begin{aligned} e_{i, j}=\exp -\frac{\left\| {\textbf{c}}_{i} -{\textbf{c}}_{j}\right\| ^{2}}{\sigma _{d}^{2}} \end{aligned}$$

where \({\textbf{c}}_{i}\) is the RGB colour vector of pixel i and \(\sigma _{d}\) is a tuning parameter. A small value of \(\sigma _{d}\) increases the sensitivity to changes in the image. Thanks to this potential, the lack of content or features in the RGB image is considered by our model as indicative of a homogeneous depth distribution, in other words, a planar surface.

4.2.3 Surface-normal pairwise potential

The mathematical formulation of this potential is similar to the previous colour potential. However, the surface-normal potential considers surface-normal similarities instead of colour. The weighting factors \(nr_{i, j}\) for this case are formulated using the cosine similarity, which is a measure of the similarity between two nonzero vectors of an inner product space that employs the cosine of the angle between them. The cosine of 0 is 1, and it is less than 1 for any angle in the interval \([0, \pi ]\) radians. It is thus a measurement of orientation instead of magnitude [60]. The cosine of two nonzero vectors can be found by using the Euclidean dot product formula:

$$\begin{aligned} {\textbf{A}} \cdot {\textbf{B}}=\Vert {\textbf{A}}\Vert \Vert {\textbf{B}}\Vert \cos \theta \end{aligned}$$

Therefore, the cosine similarity can be expressed by

$$\begin{aligned} \cos (\theta )=\frac{{\textbf{A}} \cdot {\textbf{B}}}{\Vert {\textbf{A}}\Vert \Vert {\textbf{B}}\Vert } = \frac{\sum _{t=1}^{n} A_{i} B_{i}}{\sqrt{\sum _{t=1}^{n} A_{i}^{2}} \sqrt{\sum _{t=1}^{n} B_{i}^{2}}} \end{aligned}$$

where \(A_i\) and \(B_i\) are the components of vectors A and B, respectively. Finally, we define our surface-normal potential by the following equations.

$$\begin{aligned} \Psi ^{n}({\textbf{x}}, \textbf{In})= & {} \sum _{i} \sum _{j \in {\mathscr {N}}(i)} nr_{i, j}\left( x_{i}-x_{j}\right) ^{2}\\ \Psi ^{n}({\textbf{x}}, \textbf{In})= & {} \Vert {\textbf{P}} {\textbf{x}}\Vert ^{2}\\ nr_{i, j}= & {} \frac{\sum _{t=1}^{n} In_{i} In_{j}}{\sqrt{\sum _{t=1}^{n} In_{i}^{2}} \sqrt{\sum _{t=1}^{n} In_{j}^{2}}} \end{aligned}$$

4.2.4 Depth pairwise potential

This pairwise potential encodes a smoothness prior over depth estimates which encourages neighbouring superpixels in the image to have similar depths. Usually, pairwise potentials are only related to the colour difference between pairs of superpixels. However, depth smoothness is a valid hypothesis which can potentially enhance depth inference. To enforce depth smoothness, a distance-aware Potts model was adopted. Neighbouring points with smaller distances are considered to be more likely to have the same depth. The mathematical formulation of this potential is similar to the colour pairwise potential, as it follows the Potts model:

$$\begin{aligned} \Psi ^{d}({\textbf{x}}, {\textbf{D}})=\sum _{i} \sum _{j \in {\mathscr {N}}(i)} e_{i, j}\left( x_{i}-x_{j}\right) ^{2} \end{aligned}$$

and the weighting factor \(dp_{i, j}\) for this case is formulated as

$$\begin{aligned} dp_{i, j}=\exp -\frac{\left\| {\textbf{p}}_{i} -{\textbf{p}}_{j}\right\| ^{2}}{\sigma _{p}^{2}} \end{aligned}$$

where \({\textbf{p}}_{i}\) is the 3D location vector of the LiDAR point i and \(\sigma _{p}\) is a parameter controlling the strength of enforcing close points to have similar depth values.

4.2.5 Uncertainty potential:

Depth uncertainty estimation is important for refining depth estimation [16, 65], and in safety critical systems [28]. It allows an agent to identify unknowns in an environment in order to reach optimal decisions. Our method provides uncertainties for the estimates of the pixel-wise depths by taking into account the number of LiDAR points present for each superpixel. The uncertainty potential is similar to the unary potential. It is constructed from the number of LiDAR points projected onto a superpixel and employs the following least square loss:

$$\begin{aligned} U^{c}({\textbf{x}}, {\textbf{y}})= & {} \sum _{i \in {\mathscr {L}}} \sigma _{i}\left( x_{i}-unc_{i}\right) ^{2}\\ U^{c}({\textbf{x}}, {\textbf{y}})= & {} \Vert {\textbf{W}} ({\textbf{x}}-\textbf{unc})\Vert ^{2} \end{aligned}$$

where \(\textbf{unc}\) is defined as follows:

$$\begin{aligned} \textbf{unc}_{i, i}= \left\{ \begin{array}{ll} {\sigma _{i}} &{} \quad {\text {if P projected on SPx is 0}}\\ {\psi {i}}&{} \quad {\text {if P projected on SPx is >0 and <2}}\\ {mean} &{} \quad {\text {otherwise}}\end{array}\right. \end{aligned}$$

where P is a 3D point and SPx is a superpixel. In locations with accurate and sufficiently many LiDAR points, the model will produce depth predictions with a high confidence. This uncertainty estimation provides a measure of how confident the model is about the depth estimation. This results in an overall better performance, since uncertain estimates with high uncertainty can be neglected by higher-level tasks that use the estimated depth maps as an input.

4.3 Optimization

With the unary and the pairwise potentials defined, we can now write the energy function as

$$\begin{aligned} E({\textbf{X}}, {\textbf{Y}})= & {} \left( \alpha \right) \Phi ({\textbf{x}}, {\textbf{y}})+(\beta )\Psi ^{c}({\textbf{x}}, {\textbf{I}}) \ldots \nonumber \\{} & {} +\ldots (\gamma )\Psi ^{n}({\textbf{x}}, \textbf{In})+(\delta )\Psi ^{d}({\textbf{x}}, \textbf{In}) \end{aligned}$$
(1)

The scalars \(\alpha \), \(\beta \), \(\gamma \), \(\delta \) \(\in \) [0,1] are weightings for the four terms. We may further expand the unary and pairwise potentials to

$$\begin{aligned}{} & {} \Phi ({\textbf{x}}, {\textbf{y}})= \alpha ({\textbf{x}}^{\textrm{T}} {\textbf{W}}^{\textrm{T}} {\textbf{W}} {\textbf{x}}-2 {\textbf{z}}^{\textrm{T}} {\textbf{W}}^{\textrm{T}} {\textbf{W}} {\textbf{x}}+{\textbf{z}}^{\textrm{T}} {\textbf{W}}^{\textrm{T}} {\textbf{W}} {\textbf{z}}) \end{aligned}$$
(2)
$$\begin{aligned}{} & {} \Psi ^{c}({\textbf{x}}, \textbf{In}) = \beta ({\textbf{x}}^{{\textbf{T}}} {\textbf{S}}^{{\textbf{T}}} {\textbf{S}}_{{\textbf{X}}}) \end{aligned}$$
(3)
$$\begin{aligned}{} & {} \Psi ^{n}({\textbf{x}}, \textbf{In})=\gamma ({\textbf{x}}^{{\textbf{T}}} {\textbf{P}}^{{\textbf{T}}} {\textbf{P}}_{{\textbf{X}}}) \end{aligned}$$
(4)
$$\begin{aligned}{} & {} \Psi ^{d}({\textbf{x}}, \textbf{In})=\delta ({\textbf{x}}^{{\textbf{T}}} {\textbf{D}}^{{\textbf{T}}} {\textbf{D}}_{{\textbf{X}}}) \end{aligned}$$
(5)

We shall pose the problem as one of finding the optimal range vector \({\textbf{x}}^{*}\) such that:

$$\begin{aligned} {\textbf{x}}^{*}=\underset{{\textbf{x}}}{{\text {argmin}}}\left\{ E({\textbf{X}}, {\textbf{Y}}) \right\} \end{aligned}$$

Substituting equations 2, 3, 4 and 5 into 1 and solving for x reduces the problem to: \( \textbf{A x}={\textbf{b}}\) where

$$\begin{aligned} {\textbf{A}}= & {} \alpha ({\textbf{W}}^{{\textbf{T}}}{\textbf{W}})+ \beta ({\textbf{S}}^{{\textbf{T}}}{\textbf{S}}) + \gamma ({\textbf{P}}^{{\textbf{T}}}{\textbf{P}}) + \delta ({\textbf{D}}^{{\textbf{T}}}{\textbf{D}})\\ {\textbf{b}}= & {} \alpha ({\textbf{W}}^{{\textbf{T}}} {\textbf{W}} {\textbf{z}}) \end{aligned}$$

All we need to do to perform the optimization is to solve a large sparse linear system. The methods for solving sparse systems are divided into two categories: direct and iterative. Direct methods are robust but require large amounts of memory as the size of the problem grows. On the other hand, iterative methods provide better performance but may exhibit numerical problems [10, 20]. In the present paper, the fast algorithm conjugate gradient squared proposed by Hestenes and Stiefel [25, 64] is employed to solve the energy minimization problem.

4.4 Pseudo-code

Algorithm 1 provides the complete pseudo-code for our proposed framework, which is previously illustrated in Fig. 5. In this algorithm, lines 1 to 5 perform the preprocessing, which includes gathering the multi-modal raw data, building a connection graph between pairs of adjacent superpixels and projecting the clustered LiDAR points onto the image space. Lines 6 to 11 constitute the core of the approach. They include constructing the cost function using different potentials (unary and pairwise) to obtain the complete CRF for the depth estimation. The objective of the pairwise potentials is to smooth the depth regressed from the unary part based on the neighbouring superpixels. The pairwise potential functions are based on standard CRF vertex and edge feature functions studied extensively in [52] and other papers. Our model uses both the content information of the superpixels and relation information between them to infer the depth.

Algorithm 1
figure d

UAOFusion Network: A multi-modal CRF-based method for Camera–LiDAR depth estimation

5 Results and discussion

We evaluate our approach on the raw sequences of the KITTI benchmark, which is a popular dataset for single image depth map prediction. The sequences contain stereo imagery taken from a car driving in an urban scenario. The dataset also provides 3D laser measurements from a Velodyne laser scanner, which we use as ground truth measurements (projected into the stereo images using the given intrinsics and extrinsics in KITTI). This dataset has been used to train and evaluate the state-of-the-art methods and allows quantitative comparisons. First, we evaluate the prediction accuracy of our proposed method with different potentials in Sect. 5.2. Second, in Sect. 5.3 we explore the impact on the depth estimation of the number of sparse depth samples and the number of superpixels. Third, Sect. 5.5 compares our approach to state-of-the-art methods on the KITTI dataset. Lastly, in Sects. 5.6 and 5.7, we demonstrate two use cases of our proposed algorithm, one for creating LiDAR super-resolution from sensor data provided by the KITTI dataset and another one for a dataset collected in the context of this work.

5.1 Evaluation metrics

We evaluate the accuracy of our method in depth prediction using the 3D laser ground truth on the test images. We use the following depth evaluation metrics: root-mean-squared error (RMSE), mean absolute error (MAE) and mean absolute relative error (REL), among which RMSE is the most important indicator and chosen to rank submissions on the leader-board since it measures error directly on depth and penalizes on further distance where depth measurement is more challenging. These metrics were used by [13, 17, 30, 53] to estimate the accuracy of monocular depth prediction.

$$\begin{aligned} RMSE= & {} \sqrt{\frac{1}{|T|} \sum _{d \in T}\Vert {\hat{d}}-d\Vert ^{2}}\\ MAE= & {} \frac{1}{T} \sum _{d \in T}\Vert {\hat{d}}-d\Vert ^{2}\\ REL= & {} \frac{1}{T} \sum _{d \in T}\left( \frac{\left| {\hat{d}} -d\right| }{{\hat{d}}}\right) \end{aligned}$$

Here, d is the ground truth depth, \({\hat{d}}\) is the estimated depth, and T denotes the set of all points in the test set images. In order to compare our results with those of Eigen et al. [13] and Godard et al. [21], we crop our image to the evaluation crop applied by Eigen et al. We also use the same resolution of the ground truth depth image and cap the predicted depth at 80 m [21].

5.2 Architecture evaluation

This section presents an empirical study of the impact on the accuracy of the depth prediction of different choices for the potential functions and hyperparameters. In the first experiment, we compare the impact of sequentially adding our proposed pairwise potentials. We first evaluate a model with only unary and colour pairwise potentials. Then, we added the surface-normal pairwise potential, and finally, the depth pairwise potential is included. As shown in Table 1, the RMSE is improved after adding each pairwise potential.

Table 1 Depth completion errors [mm] after adding pairwise potentials (lower is better)
Fig. 6
figure 6

Qualitative evaluation of the impact of the pairwise potentials defined as CRF terms. In row order: 1st: pairwise potential I, penalizes dissimilar depth estimates of neighbouring pixels which have similar colours in the RGB image, 2nd: pairwise potential II, penalizes the depth differences between neighbouring superpixels whose normal surface vectors have large cosine similarities, and 3rd: pairwise potential III, penalizes neighbouring superpixels with large observed depth differences

5.3 The number of superpixels

In this section, we explore the relation between the prediction accuracy and the number of available depth samples and the number of superpixels.

Fig. 7
figure 7

Visual comparison of dense depth maps produced by the CRF framework when varying the size of the superpixels. From top to bottom, 1200, 2400 and 5500 superpixels

As displayed in Fig. 5, a greater number of superpixels yields better results in error measurements. Although a larger number of sparse depth observations improves the quality of the depth map, the performance converges when the number of superpixels is more than 5000, which is about 1.5% of the total number of pixels. We ran an exhaustive evaluation of our method for a different amount of superpixels. Figure 8 clearly shows that our method’s error decreases with an increased number of superpixels.

Table 2 Depth completion errors [mm] for different number of superpixels (lower is better)
Fig. 8
figure 8

Convergence for different number of superpixels

5.4 Sub-sampling 3D depth points

We performed a quantitative analysis of the impact of observed 3D point sparsity on the error of our proposed method, by decreasing the amount of 3D points considered during inference. As it is shown in Fig. 9, our method’s error decreases with an increased number of 3D depth points, enabling a better dense depth map estimation. From Fig. 9, we can argue that the amount of 3D depth points is really important for an accurate depth map estimation. Even though our estimation error increases with the sparsity of the depth observations, our method manages to provide state-of-the-art performances even when the depth observations are sampled down to 40%.

Fig. 9
figure 9

Convergence for different number of 3D depth points projected into 2D RGB images. The amount of 3D depth points used is represented as a percentage (\(\%\)). 100% means we are using all the 3D depth points projected into the 2D image

5.5 Algorithm evaluation for depth completion

This is a more challenging dataset than other datasets for depth estimation: the distances in the KITTI dataset are larger than in other datasets, e.g. NYU-Depth-V2 dataset. Hence, the KITTI odometry dataset is more challenging for the depth estimation task. The performance of our method and those of other existing methods on the KITTI dataset are shown in Table 3. Table 3 shows that the proposed method outperforms other depth map estimation approaches which are well accepted in the robotics community. Our model relies on the number of superpixels and the resolution of input data sources. This means that the model’s performance will increase if we increase the number of superpixels, the image resolution and the density of the LiDAR data.

Table 3 Depth completion errors [mm] by different methods on the test set of KITTI depth completion benchmark (lower is better)
Fig. 10
figure 10

Depth completion and uncertainty estimates of our approach on the KITTI raw test set. From top to bottom: RGB and raw depth projected onto the image; high-resolution depth map; raw uncertainty; and estimated uncertainty map

5.6 Algorithm evaluation for LiDAR super-resolution

We present another demonstration of our method in super-resolution of LiDAR measurements. 3D LiDARs have a low vertical angular resolution and thus generate a vertically sparse point cloud. We use all measurements in the sparse depth image and RGB images as input to our framework. An example is shown in Fig. 4. The cars are much more recognizable in the prediction than in the raw scans.

Fig. 11
figure 11

LiDAR super-resolution. Creating dense point clouds from sparse raw measurements. From top to bottom: RGB image, raw depth map, predicted depth and ground truth depth map. Distant cars are almost invisible in the raw depth map, but are easily recognizable in the predicted depth map

On the other hand, starting from a LiDAR super-resolution map we can generate a 3D reconstruction of the scene. The reconstruction of three-dimensional (3D) scenes has many important applications, such as autonomous navigation [24], environmental monitoring [46] and other computer vision tasks [26]. Therefore, a dense and accurate model of the environment is crucial for autonomous vehicles. In fact, imprecise representations of the vehicle’s surrounding may lead to unexpected situations that could endanger the passengers. In this paper, the 3D modelling is generated using a combination of image and range data are a sensor fusion approach that takes the strengths of each in order to overcome their limitations. Images normally have higher resolution and more visual information than range data, and range data are noisy, sparse, and have less visual information, but already contain 3D information. The qualitative and quantitative results presented here suggest that our system provides 3D reconstructions of reasonable quality. Following [43], we use a random subset of 2000 images from the test sequences for evaluation. We take the bottom part \(912\times 228\) due to there being no depth at the top area and only evaluate the pixels with ground truth. The performance of our approach and state-of-the-art depth completion methods is presented in Table 4.

Table 4 Depth estimation errors [m] by different methods on the test set of KITTI depth estimation benchmark (lower is better)

In Table 4, the RMSE value of sparse-to-dense, CSPN and S2DNet methods is slightly better (lower) than our approach. All these three methods solve the depth estimation task through deep learning algorithms which require large amounts of data, which can be very costly in practice, and additional techniques like data augmentation to improve their performance. For example, S2DNet uses a dataset composed of 1,070,568 images, while our approach only used 42% of that amount of data plus the LiDAR information. Additionally, these deep learning methods make use of advanced computing resources which limit their use in real-world applications. Our method is very competitive with deep learning models as shown in Table 4, without demanding a large amount of data or additionally strategies. Moreover, we highlight the fact that our method’s evaluation does not consider the full image data. As displayed in Fig. 8, our method converges to a better solution when we use 6000 superpixels instead of 5500. We do not evaluate further because the trend is clear, as much superpixels we use, a better dense depth map is predicted.

5.7 Depth estimation with very sparse LiDAR data: The UAO LiDAR-RGB Dataset

Thus far, we have sampled the depth from high-quality LiDAR depth maps, but in practice, sparse depth inputs may come from less reliable sources. Therefore, we provide a qualitative evaluation of this model on our own well-calibrated LiDAR and RGB dataset. We use a 16-beam LiDAR along with a Stereo Labs Zed Mini camera with \(1280\times 720\) resolution. This dataset enables us to demonstrate the stability and robustness of the proposed model in particularly challenging scenarios. The scenes were recorded with a low resolution of the camera and the LiDAR sensor in comparison with the KITTI benchmark. Notably, the proposed algorithm is able to estimate a dense depth map of indoor and outdoor environments using colour and sparse depth data. The experimental results are shown in Fig. 12, Fig. 13 and Fig. 14. Dark red indicates farther distances, and dark blue indicates closer distances.

Despite the lower number of LiDAR channels, the proposed method has provided accurate depth information even under challenging outdoor conditions, as shown in Fig. 14. In this scene, there is lot of variability in terms of the light and shadows generated by the environment and the weather itself. After a close look at Fig. 12, Fig. 13 and Fig. 14, it is noticeable that no depth observations from the LiDAR are available at the top and bottom locations of the colour image. After inference, the depth estimates, shown in the bottom images, at the above locations are consistent with the information provided by the image. We can conclude that the framework proposed here works reliably for the depth prediction task. Additionally, it also solves the depth completion problem, as it is able to deal with highly sparse input point clouds projected onto the image space.

Fig. 12
figure 12

Indoor LiDAR super-resolution. Creating dense point clouds from sparse raw measurements and colour. From top to bottom: RGB image, raw depth map and predicted depth

Fig. 13
figure 13

Outdoor LiDAR super-resolution. Creating dense point clouds from sparse raw measurements and colour. From top to bottom: RGB image, raw depth map and predicted depth

Fig. 14
figure 14

Outdoors LiDAR super-resolution. Creating dense point clouds from sparse raw measurements and colour. From top to bottom: RGB image, raw depth map and predicted depth

6 Conclusions and future work

In this paper, we described an innovative approach to fuse information from different sensor modalities, e.g. cameras and LiDAR, in order to probabilistically estimate a dense point cloud. Our approach achieves good performance in single image depth map prediction on the popular KITTI dataset. It is able to predict detailed depth maps on thin and distant objects. It also reasonably estimates the depth in parts of the image in which there is no ground truth available for supervised learning. The qualitative and quantitative results presented here suggest that our system provides 2D depth maps of reasonable quality, which depends on the density of the laser measurements and the number of superpixels selected. We believe that this method opens up an important avenue for research into multi-sensor fusion and the more general 3D perception problems, which might benefit substantially from sparse depth samples.

There are various variants of CRFs, each with its own pros and cons. A higher-order CRF, for example, considers interactions between more than two variables at a time. This can lead to more accurate predictions. A dynamic CRF is another type of CRF that is designed to work with sequences of data, such as time series or videos. This can be useful for depth estimation using Camera–LiDAR fusion, as it can take into account temporal information in the data. However, both higher-order and dynamic CRFs require more computational resources and are more difficult to train than pairwise, also known as Markov CRFs, which only consider interactions between adjacent variables. Since our work focuses on robotics applications, in which computational performance is a hard constraint, instead of trying to capture long-term dependencies in the data, we leverage the potential of modelling both appearance and geometric constraints from multi-modal sensor observations. A promising future research direction may tackle the challenges of developing efficient and real-time inference solutions for higher-order CRFs in the context of multi-modal depth estimation solutions. Another promising research venue is the parallel implementation of our method, which will bring benefits when deploying it in real robotic platforms. Additionally, the inclusion of other sensor modalities, such as radar [35, 41], could be explored as a way of improving our system’s robustness to challenging environmental conditions.