Fusion Based Holistic Road Scene Understanding

This paper addresses the problem of holistic road scene understanding based on the integration of visual and range data. To achieve the grand goal, we propose an approach that jointly tackles object-level image segmentation and semantic region labeling within a conditional random field (CRF) framework. Specifically, we first generate semantic object hypotheses by clustering 3D points, learning their prior appearance models, and using a deep learning method for reasoning their semantic categories. The learned priors, together with spatial and geometric contexts, are incorporated in CRF. With this formulation, visual and range data are fused thoroughly, and moreover, the coupled segmentation and semantic labeling problem can be inferred via Graph Cuts. Our approach is validated on the challenging KITTI dataset that contains diverse complicated road scenarios. Both quantitative and qualitative evaluations demonstrate its effectiveness.


Introduction
Road scene understanding plays an important role in various computer vision applications, ranging from autonomous driving to urban modeling. It commonly involves multiple tasks, such as drivable road surface detection [1,2], pedestrian and vehicle detection [3,4,5,6], semantic region labeling [7,8,9,10,11,12], geometric context reasoning [13,14], and so on. Each individual task is notoriously difficult due to the complexity of natural scenarios. As in the typical example presented in Fig. 1 (b), a road scene may contain severe lighting variation and a cluttered roadside background, together with variant numbers of vehicles and pedestrians. These challenges have led to a large amount of studies on tackling each problem.
Most existing work addresses the above-mentioned tasks individually. However, we can observe that these problems are coupled. For example, semantic region labeling should be easier if we know where the ground plane and moving objects are. Likewise, geometric context helps to detect objects and label regions. These ob-servations inspire our research here. In order to take advantage of the benefits from such correlations, this paper proposes to solve the problems jointly. In addition, considering that cameras and ranging sensors are often used conjunctively on today's autonomous vehicles, we build our work upon the fusion of visual and range data.
Specifically, this paper proposes a holistic approach that exploits appearance, geometry and contextual information to jointly tackle object-level image segmentation and semantic region labeling, from which it is straightforward to locate drivable road surfaces and moving objects in both images and 3D point clouds, as illustrated in Fig. 1 (f)-(i). Holistic road scene understanding is consequently achieved, providing robots with a deeper understanding of the whole scene.
The proposed approach distinguishes itself from other holistic scene understanding techniques in a couple of aspects. First, our approach generates semantic object hypotheses by simply clustering a 3D point cloud into object candidates, learning their prior Gaussian mixture models (GMMs), and using a deep learning method to reason their semantic categories. This procedure does not involve sophisticated feature extraction and requires almost no tedious pixel-wise hand labeling. Second, we perform bimodal data fusion on multi- Figure 1: An overview of the tasks achieved in this work. Given an aligned 3D point cloud (a) and a color image (b), we first obtain a dense depth map (c) by a guided upsampling technique. Then, the 3D point cloud is clustered to generate object hypotheses (d). The bounding cuboids are projected onto the image to get object candidates (e). Both object-level image segmentation (f) and semantic region labeling (i) are obtained simultaneously by our proposed approach. From them, we directly get the object detection results on the image (h) and on the point cloud (g). Note that the colors in the second row have no semantic meaning. Different colors denote different object instances. The colors in the third row represent the corresponding semantic categories, as shown in the legend. ple stages, hierarchically, from image guided depth map upsampling to RGB-D image patch based object classification and holistic inference in a conditional random field (CRF). Thus, both visual and range information are thoroughly utilized. Last but not least, to the best of our knowledge, this research is one of the first studies working on holistic road scene understanding. The effectiveness of our approach is validated on the challenging KITTI dataset [39].
The remainder of this paper is organized as follows. In Section 2, we make a brief review of both fusionbased and holistic oriented scene understanding techniques. Section 3 introduces the method of generating semantic object hypotheses. The proposed holistic CRF framework, which incorporates the learned priors, together with lidar point pivoted hard constraints and geometric context, to jointly solve the problems, is presented in Section 4. Experiments are demonstrated in Section 5, followed by a conclusion in Section 6.

Related Works
There is a huge body of work related to our problem in that it encompasses multiple extensively studied tasks. In this section, we focus our attention on the two most relevant aspects, which are fusion-based and holistic scene understanding. The former emphasizes the fusion of multi-modal data for the tasks and the latter aims to solve multiple tasks jointly.

Fusion Based Scene Understanding
With the advent of ranging sensors, nowadays, it is quite convenient for us to capture synchronized range and visual data. Such convenience has motivated a great number of studies on fusing these two modalities for tasks towards scene understanding. In contrast to a camera-or lidar-only scheme, fusion dramatically increases accuracy and robustness in various applications.
Generally speaking, fusion is often conducted at feature or decision level. The feature-level methods fuse two modalities via extracting both appearance and geometric features and concatenating them together for the succeeding process. Particularly, these methods first segment RGB-D data into superpixels [15], divide a colored 3D point mesh into spatially adjacent regions [16,17], or map both pixels and 3D points into cells [18,19]. Then, sophisticated appearance features, such as texton [20], SIFT and HOG [21], and kernel descriptors [15], as well as geometric features, such as surface normal, angular moments, and average height, are extracted from each unit for the tasks of object detection, 3D point segmentation [16,22], terrain classification [18,19], semantic 3D modeling [17], and scene parsing [9,23]. Among these studies, RGB-D data oriented work is mostly limited to indoor scene parsing because a great portion of such data are obtained by Kinect-like sensors (although they can also be obtained by upsampling lidar data [24]). In contrast, 3D point clouds are collected by lidar so that they are more suitable for outdoor applications.
In contrast to feature-level fusion, a decision-level method analyzes each modality individually and then combines the analysis results through a fusion scheme. For instance, Zhao et al. [23] utilize the fuzzy logic inference framework to combine the classification results of lidar data and that of images for scene parsing.
Other than the two above-mentioned separate fusion schemes, the use of deep learning, which is a powerful architecture merging both feature-and decisionlevel fusion into a whole, surged recently. It learns both feature representation and classification simultaneously to solve tasks such as RGB-D based object recognition [26] and demonstrates promising results.
In contrast, our approach integrates visual and range information on multiple stages. More specifically, lowlevel fusion is first conducted to produce dense depth maps by using an image guided depth upsampling technique [25] previously proposed by us. The obtained RGB-D image patches are fed into a deep learning method as well to reason semantic categories. Finally, in the proposed holistic conditional random field framework, besides the learned appearance and geometric priors, lidar points are integrated as hard constraints to guide image segmentation. Therefore, our fusion is conducted in a hierarchical way, which thoroughly makes use of the bimodal information.

Holistic Scene Understanding
While substantial progress has been made in numerous computer vision tasks over the last few decades, most previous works tackled each particular problem isolatedly. In recent years, however, more researchers have started to exploit the dependencies between different tasks and attempted to solve two or more problems jointly. For example, Bleyer et al. [27], Ladicky et al. [28] and Hane et al. [29] combine stereo reconstruction with object segmentation to improve the performance of both. The problems of classification and segmentation are also simultaneously addressed in [30]. In light of these successes, researchers have stepped further toward achieving the grand goal of holistic scene understanding [31,32,33,34].
Holistic scene understanding aims to fully interpret a scene by jointly solving the tasks of image segmentation, object detection, 3D reconstruction, scene classification, etc. To achieve this target, a critical problem that we face is how to infer mutual information between the tasks. Here, we roughly categorize the infer-ence techniques into two groups. One develops a general framework, such as Cascaded Classification Models (CCM) [31] and feedback enabled CCMs [33], to combine different tasks. These techniques treat the components of each task as black boxes. They rely upon complicated inference algorithms so it is hard to incorporate potentials specific to some particular problems [34].
A more extensive method is formulating a joint problem as an inference within a Markov or conditional random field (CRF) framework [27,28,30,32,34,36,37]. Each node in the graph represents a segmentation or category label associated with a pixel, superpixel or 3D point. Potentials encode unitary information and pairwise or high-order relations of inter-or intra-tasks. Inference within the random field is done by either a message-passing approach [34], fusion moves [27], or more efficient Graph Cuts algorithms [28,30,36,38] if energy functions satisfy submodularity restriction. In summary, the differences among all CRF-based works rely on the problems to be solved, the construction of the graphical models, the incorporated priors, and the inference techniques.
Our work follows the second line in order to thoroughly exploit the priors specific to road scenes and hierarchically fuse the bimodal data. The proposed holistic CRF graphical model is used for us to jointly solve object-level image segmentation and semantic region labeling problems. Our CRF encodes the priors learned from the bimodal data, together with lidar point pivoted hard constraints and geometric context, in the unary potentials. Meanwhile, pairwise potentials exploit the spatial dependencies in each task, as well as the coherency between the two tasks. All designed unary and pairwise potentials meet the submodularity restriction, so that Graph Cuts can be used for efficient inference.

Semantic Object Hypotheses Generation
Before integrating all information within a CRF framework, the first stage for us is to generate initial object hypotheses, learn their prior models, and reason their semantic categories. Considering that geometric information is more reliable than visual cues for discovering objects, we start from partitioning a 3D point cloud into clusters to obtain object hypotheses. Once we get the clustered points, their registered pixels, which are also referred to as seeds, are taken to build prior models of the objects. Moreover, each RGB-D image patch that is registered to the bounding cuboid of a 3D cluster is fed into a convolutional recursive neuron network (CRNN) [26] to determine its semantic category. The details of each step are stated below.

Data Preprocessing
The data we process are aligned image-lidar pairs that are, respectively, collected by a camera and a lidar mounted on a vehicle [39]. When the intrinsic and extrinsic parameters of both sensors are calibrated, it is handy for us to register a 3D point set and an image to each other. By registration, we obtain a sparse depth map, in which the seeds are assigned with corresponding depth values and the remainder is of no depth information. For the convenience of the subsequent processes, the sparse depth map is upsampled by a guided depth enhancement technique [25], which generates a dense depth map via integrating the sparse one with a color image. An example result is illustrated in Fig. 1(c).

Generating Object Hypotheses
As pointed out by Douillard et al. [40], the ground extraction significantly improves clustering performance. Therefore, before 3D point clustering, we first estimate the ground plane. The ground is commonly the dominant plane in most road scenes. We therefore use the Random Sample Consensus (RANSAC) algorithm [41] to estimate it. However, in scenarios such as a narrow street with buildings on both sides, the estimated dominant plane may lie on a wall of the buildings. In order to avoid such a mistake, we define a rough range for height according to where the lidar is equipped on the vehicle. Only the 3D points within the range are taken into consideration for ground plane estimation.
After detecting the ground plane, we leave out the corresponding points and use a simple but effective Euclidean clustering method to partition the remainder to generate object candidates. This method [42] is based on the nearest neighbor scheme. It is implemented with a kd-tree data structure and therefore is quite efficient. Moreover, this approach produces a set of object clusters well, especially for separated objects on the road.
Note that our clustering is performed on the original sparse 3D lidar points, instead of the denser points reconstructed from the upsampled dense map. The reason is that the upsampling techniques are prone to generate artifacts, especially on the places near object boundaries and in large invalid regions, leading to errors that might be propagated to later stages.

Learning Object Priors
Once the ground and other object clusters are produced, the corresponding seeds are taken as samples to learn their prior models. In our work, we only take the RGB color and 3D location of each seed as our feature.
No other sophisticated features are considered. Therefore, for each object instance, a Gaussian mixture model (GMM) of the 6D feature (R, G, B, X, Y, Z) is built. It needs to be mentioned that a different means is taken for building the sky model. Since there is no way to sample the sky from lidar data, sky regions in a set of images are manually labeled to learn a color GMM for the sky.

Reasoning Semantic Categories
This step is to determine the semantic category for each image patch registered to a 3D cluster. In order to avoid the complicated feature extraction step, we simply apply a deep learning method here. More specifically, a convolutional recursive neuron network (CRNN) [26] is adopted, which takes a RGB-D image patch as input. Within the CRNN, a convolutional neural network (CNN) layer with weights trained by k-means clustering is first used to extract low level features from the patch. The resulting feature maps are then connected to several recursive neural networks (RNN) to get higherorder combinational features. The weights of the RNNs are randomly assigned, which is very efficient and has shown to be good enough. Finally, the RNNs' outputs are fed into a softmax classifier for recognition. The CRNN associates each image patch with a set of scores, indicating the confidence of it being a specific category.

Holistic CRF Model
In this section, we formulate road scene understanding as a labeling problem, which associates each pixel with two types of labels: one indicates an object instance that the pixel belongs to and the other tells its semantic category. To this end, we construct a holistic CRF model consisting of two hidden layers. The model also integrates observed features of the pixels, together with the 3D lidar points and geometric contextual information to boost the accuracy of both object-level segmentation and semantic region labeling. Fig. 2 illustrates our constructed model.
Formally, when an image I is given, we construct a graph G = V, E . Here, the vertex set V = {V O , V C } consists of two sets of random variables and the edge set E = {E OO , E CC , E OC } contains three types of edges. More specifically, a random variable o i ∈ V O is associated with the i-th pixel and takes a value from {0, · · · , O+1} to represent the o i -th object label, in which O is the total number of object hypotheses generated in Sec. 3.3, 0 is for the ground and O + 1 is for the sky. Likewise, a random variable c i ∈ V C takes a value from  {1, 2, · · · , C} to indicate its category, where C is the total number of semantic categories. With such a graphical model, an optimal solution of joint object-level segmentation and semantic region labeling is obtained by maximizing the following probability: where Z is the partition function. In addition, there are five types of potentials. ψ O (o i ) and ψ C (c i ) are two unary potentials associated with the object label and the category label, respectively. ψ OO (o i , o j ) is a pairwise potential exploiting the dependency of neighboring object labels. ψ CC (c i , c j ) is also a pairwise potential investigating the dependency of category labels. ψ OC (o i , c i ) investigates the mutual information between object labels and category labels, and λ 1 , ..., λ 5 are scaling factors. The details of each potential are explained below.
With appropriate design, this graphical model can be inferred with the efficient Graph Cuts algorithm [38].

Object Potential
The object potential evaluates the confidence for a pixel to be labeled as the o i -th object. Commonly, it is designed in terms of the likelihoods, as follows [38,44]: The above-defined likelihood potential is sensitive when two objects share similar features. For instance, strong shadows on the ground and bushes nearby are prone to be labeled as the same object by mistake. In contrast, 3D point clustering performs better; at least it is invariant to illumination change. Therefore, we place high confidence [44] on the seeds. Let us denote the entire set of seeds by S, and the set of seeds belonging to the o-th object by S o . Then, the object potential is placed with hard constraints (HC) and defined by where α o is a small positive value and β o is a large positive value, which are experimentally set to force the constraints. With these hard constraints, the labels of the registered pixels are forced to be consistent with the point clustering results.

Category Potential
The category potential indicates the confidence for a pixel to be the c i -th category. This potential incorporates the classification result obtained by the CRNN together with the learned prior models and geometric contextual information for better reasoning.
Specifically, for the purpose of simplicity, let us first divide the semantic categories into three groups: C S G , C B , and C O . C S G stands for either the ground or the sky category, C B contains the background category and C O is for the remaining categories, such as pedestrians, vehicles, etc. The latter two are recognized by the CRNN. Therefore, we define a confidence score f (P k , c) for an image patch P k to be the category c, which is where k ∈ {1, · · · O} denotes the k-th object hypotheses, k = 0 for the ground and k = O + 1 for the sky, as before. Note that, there is no patch for the ground and sky. For a uniform formulation, we define the patch of ground, denoted as P 0 , as the part under the horizon line [2] of the image and the patch of the sky, P O+1 , as the rest of the image. s(P k , c) is the score obtained by the CRNN. g(P k ) is a term introducing geometric properties. Although more complicated geometric relations can be taken into account, here we only investigate a quite straightforward observation. That is, except the ground, the sky, and the background, all other objects must lie on the ground. Therefore, this constraint is designed to be Here, bottom height(P k ) denotes the bottom height of the corresponding object cuboid, which should be lower than a threshold T h . Upon these, we define our category potential as below: Here, M 1 (c i ) denotes the set of object instances that are identified as the c i -th category; α c is a large positive value assigned for the pixels that are not falling into any object patches. The reason to combine the category recognition confidence f (P k , c i ) together with the object-level segmentation confidence ψ O (o i ) is for obtaining semantic labeling results with better object boundaries. An illustration of this term is presented in Fig. 3.

Object Coherency Potential
The object coherency potential exploits the dependence between neighbors. It encourages two neighboring pixels to take the same object label if their associated features are similar to each other. This potential can smooth out isolated labels, leading to piecewise coherent results.
Specifically, for a pixel v i and each of its 4-connected neighboring pixels v j , this potential is defined as (7) where ||f i − f j || 2 is the L 2 norm of the difference between the features f i and f j . T () is an indicator, whose value is 1 when its parameter is true and 0 otherwise. This term indicates that the more similar the features are, the more likely that the two pixels belong to the same object.

Category Coherency Potential
The category coherency potential encourages neighboring pixels to take the same category label. Likewise, it is defined by

Object-Category Coherency Potential
This potential is proposed to exploit the dependency between object and category labels of the same pixel. More specifically, the category label of a pixel should be the same as the recognition result of the object that the pixel belongs to. Therefore, it is designed as where M 2 (o i ) is a function determining the category that an object instance belongs to, which is defined as:

KITTI Dataset
In order to validate the proposed approach, we have conducted a series of experiments on the KITTI vision benchmark suite [39], which provides us with numerous color images and 3D point clouds. The data are captured by a PointGrey Elea2 video camera and a Velodyne HDL-64E 3D lidar that are jointly mounted on a vehicle. Each image is in the resolution of 1242 × 375, and a 3D point cloud is of 100, 000 points or so, which covers a 360 o field of view (FOV). But only the points falling within the camera's FOV are taken into consideration. The two modalities are registered to each other according to the sensors' parameters provided on KITTI's website.
Experiments are conducted on the 'City', 'Residential', and 'Road' datasets, which contain a variety of complex scenarios on urban and highway roads, with the presence of vehicles, cyclists, pedestrians and other objects. The total number of images is 18529, among which 13765 images are randomly selected for the CRNN and the remaining 4764 images are used for evaluation. The details of the evaluation are stated below.

Evaluation of CRNN
The step of semantic reasoning via the CRNN is critical for our final results. Therefore, we first evaluate its performance. The input of the CRNN is an image patch obtained in the way introduced in Sec. 3. More specifically, we use the nearest neighbor clustering algorithm in the Point Cloud Library (PCL) [43] to generate initial object hypotheses. The produced clusters that have a very small number of faraway points are discarded for robustness. Then, the image patches registered to these clustered 3D points are fed into the CRNN as inputs.
Each patch is resized to 67 × 67. In the CRNN [26], we set the size of a CNN filter to 8 × 8 and the number of filters is 128. Pre-training for CNN filters is performed by k-means clustering on 300,000 patches, randomly sampled from our training set. Average pooling is performed with pooling regions of size 8 and stride size 2 to produce 128 feature maps of the size of 27×27. The RNN receptive field size is set to 3 × 3, by which each feature map is recursively reduced to size 9 × 9, to 3 × 3, and finally to 1 × 1. Through four RNNs, the final feature for classification is 128 × 4.
We manually label all the patches extracted from 13765 images into seven object categories. The categories and their corresponding patch numbers are listed in Table 1. In each category, we randomly select 70% patches for the CRNN training and the rest for the CRNN testing. We also horizontally flip the patches in the 'Cyclist', 'Pedestrian', and 'Sitter' categories in order to double their training samples.
In this section, a set of comparative experiments are designed in order to investigate the performance of the CRNN with different input configurations. For instance, we compare the performance of the CRNN when using RGBD patches versus that of using RGB only. Moreover, although rectangular patches are fed into the CRNN, our algorithm is actually able to extract object regions. Therefore, we also compare the performance for patches with and without masks. The average recognition accuracy of each configuration is shown in Table 2. It shows that the CRNN performs the best when depth information is considered and the background is   masked out.
In addition, we also present the confusion matrices in Fig. 4 to analyze the recognition performance further. These validate that the masked RGBD configuration achieves the least confusion in most of the categories. Besides this, we also make the following observations. First, among all the categories, 'Vehicle', 'Roadside', and 'Sitter' are recognized with high accuracy, followed by 'Cyclist', 'Pole', and 'Greenbelt'. The 'Pedestrian' category is most often confused. Second, we also observe that all categories are prone to be misclassified as 'Roadside'. The reason is that the 'Roadside' category is of extremely high diversity, containing variant objects such as trees, buildings, windows of the buildings, barriers on the roadside, mailboxes, and so on. Without global information, many patches of other categories are easily to be viewed as these even by human beings. Third, 'Pedestrian' is prone to be misclassified as 'Cyclist', 'Pole', or 'Roadside' due to their similarity in shape. In all, the confusions are reasonable and the CRNN performs well.

Evaluation of Holistic Understanding
Before evaluating the performance of holistic understanding, let us first introduce the implementation details. The parameters involved in the joint problem are empirically set as as follows. The scaling factors defined in Eq. (1) are λ 1 = 0.5, λ 2 = 1, λ 3 = λ 4 = λ 5 = 10; in Eq. (3), α 0 = 1, β 0 = 500; in Eq. (6), α c = 50; and in Eq. (7), σ = 625. Each Gaussian mixture model has five components. The algorithm is implemented in mixed Matlab/C and run on a desktop with an Intel Core i5 2300 and 12 GB memory. Our implementation has not yet been optimized for efficiency. The whole process is about 50s per frame. Roughly, it takes about 5s for loading and registering a 3D point cloud, 1s for point clustering, 13s for building the GMMs, 4s for the CRNN, and 22s for Graph Cuts inference.
Experiments are performed on the 4764 images that have not been used in the CRNN. In order to quantitatively evaluate the proposed approach, we randomly select 140 images and manually label them with both object-level segmentation and semantic category labels. When evaluating object-level segmentation, we choose the global consistency error (GCE) and the local consistency error (LCE), which are two criteria proposed by Martin et al. [45] for measuring consistency between two segmentation results. These criteria are designed to be tolerant to different numbers of segments arising from different perceptual levels when observing complex scenarios. For semantic labeling, the average accuracy, precision, recall, and F-measure are computed.
To investigate the performance, a group of comparative experiments is conducted. First, we are interested in how much improvement is achieved when incorporating depth information in the feature of the GMMs and integrating lidar points pivoted hard constraints (HC) into the object potential (in Sec. 4.1). According to whether location information is used and whether the HC is placed or not, we denote the algorithms by RGB, RGBXYZ, RGB HC, and RGBXYZ HC, respectively. For instance, literally, RGBXYZ HC represents the algorithm using both color and location features and with hard constraints, and likewise for the others. Table 3 lists the quantitative comparison results. It shows that the incorporation of depth and hard constraints greatly improve the performance. A typical example is demonstrated in Fig. 5, which illustrates how these different configurations behave. From the segmentation, semantic labeling, and 3D reconstruction results in Fig. 5 Finally, we investigate the performance of our holistic framework compared to the method that implements segmentation and semantic labeling separately. The quantitative comparison of object-level segmentation and average semantic labeling accuracy are listed in Table 3 (refering to 'Separate RGBXYZ HC' and 'Holistic RGBXYZ HC'). From it we know that the holistic method achieves better performance in both segmentation and semantic labeling. To get a deeper insight, we also compare the precision and recall of each object category for semantic labeling, as listed in Table 4. The object categories include the seven we introduced in the CRNN, together with 'Road' and 'Sky'. The percentage of the pixels that each category holds is also listed for a reference and the total number of the pixels is 140 × 1242 × 375. This table shows that both the recall and precision of 'Pedestrian', 'Pole', and 'Greenbelt' are increased in the holistic approach. Recall and precision of the other categories are either increased or decreased, which makes it difficult for us to tell the relative performance. Therefore, an F-measure that calculates the harmonic mean of the precision and recall is also provided. The F-measure of our holistic approach is improved for all categories, except 'Sky' and 'Sitter'. Fig. 6 demonstrates typical examples of how the holistic approach corrects both segmentation and semantic labeling results compared to the separated method. The improvements are presented in two aspects. On the one hand, the holistic approach can correct some segmentation errors produced by object-level    segmentation. For instance, as shown in rows E to G, the separated method segments part of the roadside regions wrongly and these segmentation errors are inevitably propagated to the semantic labeling procedure. Rows H to J show that this type of errors is corrected by jointly tackling these two tasks. Such improvement benefits from the coherency considered between segmentation and semantic labeling in the holistic framework.
On the other hand, the holistic approach can also correct some recognition errors of the CRNN. For example, some parts of the roadside are recognized as 'Car' and 'Pedestrian' in Fig. 6(b)F-G and Fig. 6(c)F-G, respectively, while with the consideration of geometrical context in our holistic framework, these recognition errors are corrected, as shown in rows I to J. More experimental results of the holistic approach are presented in Fig. 7. From these examples, we observe that, although the scenarios are extremely diverse, our approach can correctly segment and recognize most of the objects, such as cyclists, pedestrians, cars, poles, and backgrounds. The segmented objects are of precise boundaries.

Discussion
As presented above, we have conducted sets of comparative experiments. From these comparisons, we know that the integration of color and depth information highly improves the performance of both segmentation and semantic reasoning, and our holistic approach boosts the performance further. Of course, there is still room for improvement. For instance, too bright walls of buildings are easily segmented and labeled as 'Sky' and parts of cars' windows are often missed in segmentation and category labeling. These errors are mainly caused by missing lidar data. Therefore, they might be improved if the guided depth upsampling algorithm could perform better in large invalid regions.
In our experiments, we have not compared our algorithm with others' work yet. The main reason is that, although there is some object detection evaluation platform available on KITTI's website, to the best of our knowledge, there has been no work developed for object-level segmentation and semantic labeling tasks while integrating images and sparse lidar data.

Conclusions and Future Work
In this paper, we have presented an approach for holistic road scene understanding by integrating visual and range information. The approach has been validated by extensive experiments on the challenging KITTI dataset. Both qualitative and quantitative evaluations have been performed, which show that our algorithm is promising. In future, besides improving our algorithm in the aspects discussed above, we also plan to apply this work for large scale semantic urban modeling.