3D OBJECT DETECTION BY FEATURE AGGREGATION USING POINT CLOUD INFORMATION FOR FACTORY OF THE FUTURE

: Nowadays, object detection is considered as an unavoidable aspect that needs to be addressed in any robotic application, especially in industrial settings where robots and vehicles interact closely with humans and objects and therefore a high level of safety for workers and machines is required. This paper proposes an object detection framework suitable for automated vehicles in the factory of the future. It utilizes only point cloud information captured by LiDAR sensors. The system divides the point cloud into voxels and learns features from the calculated local patches. The aggregated feature samples are then used to iteratively train a classifier to recognize object classes. The framework is evaluated using a new synthetic 3D LiDAR dataset of objects that simulates large indoor point cloud scans of a factory model. It is also compared with other methods by evaluating on SUN RGB-D benchmark dataset. The evaluations reveal that the framework can achieve promising object recognition and detection results that we report as a baseline.


INTRODUCTION
Interpretation of point cloud data is considered as an inevitable step in the development of a perceptual component of most of the recent robotic applications.It can provide useful information about the surrounding environment such as the location of objects and obstacles.Unlike image-based object detection, object detection based on point cloud can determine the exact 3D coordinate of the objects and help to plan the subsequent steps such as object manipulation or obstacle avoidance in a navigation scenario.Such a system can help to produce smart industrial robots for factories of the future, therefore scaling down the ergonomic concerns while improving the productivity and quality of the working environment.
Recent advancements in remote sensing technologies have been resulted in manufacturing sensors that capture 3D point clouds with a higher accuracy which makes them relatively robust against challenging scene characteristics (illumination, noise, etc.).In contrast, point clouds have irregular representations that make them not a suitable input for typical CNNs.To avoid the problem of irregularity in the point clouds, most of the current object detection approaches rely on methods based on 2D detectors.Such methods either extend 2D RGB detectors from images to detect objects in 3D or generate 2D images from point clouds in order to feed them to the detection network.For instance, in (Song, Xiao, 2016) and (Hou et al., 2019), an extended version of Faster and Mask R-CNN are applied to 3D.Usually, 3D irregular point clouds are voxelized and converted to regular grids, where a 3D detector is applied to those grids.These methods are usually subjected to high computational costs caused by costly 3D convolutions.
On the other hand, methods such as (Chen et al., 2017) (Zhou, Tuzel, 2018) project 3D point clouds to 2D bird's eye view (BEV) images, and similar to regular 2D cases, apply 2D detectors on the images.Despite being useful in some scenarios, * Corresponding author these methods overlook enormous amounts of geometric information that can be critical, particularly in indoor environments.In such scenarios, as there are lots of clutter and occluded objects, using bird's eye view is not an effective option.For instance, in Frustum PointNet (Qi et al., 2018), there is a two-step detection network, such that 3D detections in the second step completely rely on the 2D detections in the first step.Using front-view RGB images, in the first step, objects are localized and their 2D bounding boxes are retrieved.In the second step, 3D frustums produced from the 2D boxes are utilized to localize objects in point clouds.These methods are highly dependent on 2D detectors in such a way that if the 2D detector can not detect an object, the 3D detector will also miss the object entirely in the point cloud.Some architectures (Qi et al., 2017b) (Wang, Posner, 2015) utilize the sparsity of the cloud, only by considering the sensed points and convert them into regular structures.Directly processing the points can also prevent information loss caused by the quantization procedures.Therefore, a straightforward way for object detection is to represent the whole point cloud with such architectures (similar to the 2D detectors) and produce object proposals directly from the learned features.These approaches work in a two-dimensional case since the object center is a visible pixel in the image.However, this is not an advantageous solution in presence of sparsity, as is the case in 3D point clouds.In point clouds, the object center is not visible and it is usually far from the captured surface points.Such networks have therefore difficulty to aggregate the features in the local neighborhood of the object centers (unlike 2D cases).Increasing detection range can not even help, as it adds on further clutters and points from adjacent objects to the detection result.To mitigate this issue, some methods such as Hough VoteNet (Qi et al., 2019) propose a voting mechanism that generates points to estimate the object centers which are later will be aggregated to produce object proposals.This method is efficient and well-adapted for indoor scenarios, where there are lots of occlusions and methods based on bird's eye view will fail.Motivated from these studies, in this work, we propose a new framework for object detection based on feature aggregation in the context of the factory of the future.The framework is adapted to improve the perceptual capabilities of Automated Ground Vehicles (AGVs) in the factory field.To this end, first, the acquired point cloud is passed to a preprocessing module that prunes the point cloud and generates the features.Then the point cloud is segmented into local patches, where features are aggregated and constitute the final object proposals.An object classifier iteratively trained through hard negative mining is used to determine object classes.Our contribution in this paper is three-fold: Firstly, we propose a new framework (Section 3) that is capable of object detection by feature aggregation from large point clouds using only point cloud information in cluttered indoor scenes.Secondly, we provide a challenging, partially-labeled synthetic object classification and detection dataset (Section 4) suitable for testing object detection frameworks.Our dataset introduces new challenges, specific for indoor industrial environments.Finally, using our developed framework, we produced baseline object recognition and detection results on collected dataset (Section 5).In addition, we evaluate its performance on challenging SUN RGB-D dataset using only geometric point clouds as input.We provide a detailed comparison to available methods in the literature that use only geometry or both geometry and RGB information.

3D Object Detection using Point Cloud
Previously, the proposed object detection methods were mostly as an extension of 2D approaches.Some methods use templatebased detection approaches (Nan et al., 2012) (Li et al., 2015) and some others extended the use of sliding window detection (Song, Xiao, 2014) or feature engineering (Ren, Sudderth, 2016) to 3D.Object detection becomes challenging especially in terms of instance-based segmentation where most of the studies (Gupta et al., 2010) (Reitberger et al., 2009) (Wu et al., 2013) applied in urban environment in order to detect bounding boxes of individual objects in a cluster of similar objects, such as trees alongside road corridors.
Recently, there is a burst of interest in using deep neural networks that resulted a torrent of studies which achieved state-ofthe-art performance in object detection.These networks perform efficiently when they work directly with 3D sparse inputs.In (Song, Xiao, 2016), they divided the point clouds into voxels and applied 3D CNN to learn features and to generate 3D bounding boxes.Another voxel-based method is Voxel-Net (Zhou, Tuzel, 2018), which is an end-to-end deep network.
It divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation also known as Voxel Feature Encoding (VFE).Encoded point cloud is given to the region proposal network (RPN) to produce detections.PointPillars (Lang et al., 2019) is also an end-to-end architecture with 2D convolutional layers.It uses a novel encoder that learn features from pillars (vertical columns of the point cloud).First, the raw point cloud is converted into a stacked pillar tensors and pillar index tensors.The encoder uses the stacked pillars to learn a set of features that can be scattered back to a 2D pseudo-image for a convolutional neural network.The features from the backbone are used by detection head to predict 3D bounding boxes for objects.PointPillars run in 62 Hz and currently is the fastest method available.Transform, rotate and scale of raw point clouds are easy and straightforward.SECOND (Yan et al., 2018) uses this property to perform data augmentation on point cloud which boosts the performance of the network and speeds up the convergence process.It also introduces a new angular loss function that resolves the large loss values produced when the angular difference of the predicted bounding boxes and the ground truth bounding box are equal to π.The framework takes raw point cloud as input and converts it through a two layers voxel feature encoding (VFE) and a linear layer.Finally, a region proposal network produces the detections.
By increasing the complexity of working directly with 3D point cloud data, especially in an environment with a large number of points, some other types of methods (Ku et al., 2018, Ren et al., 2018) try to use projection to reduce the complexity.A popular approach in such methods is first to projects the 3D data into a bird's eye view before moving the data forward in the pipeline.They exploit the sparsity in the activation of CNNs to speed up inference.Similarly, ComplexYOLO (Simon et al., 2018) converts a point cloud into RGB BEV map, where the RGB map is encoded by a grid map using the CNN network.Inspired by the YOLO architecture for 2D object detection (Redmon et al., 2016), the network predicts five boxes per grid cell.Each box prediction is composed by the regression parameters and object scores with a general probability and class scores.
However, the bird's eye view projection and voxelization and in general 2D image-based proposal generation suffers from information loss due to the data quantization.These methods might fail in indoor challenging scenarios because such scenes can only be observed from 3D space and there is no way for a 3D object box estimation algorithm to recover these kinds of failures.

Learning Point Cloud Representations for Detection and Segmentation
Instead of the above-mentioned methods that represent the point clouds as voxels or in multimodal formats, these methods learn features directly from the raw point cloud.This direct utilization of point cloud is extremely efficient and increases the speed of the classification and segmentation architectures.PointNet (Qi et al., 2017a) learns a higher dimensional spatial feature representation for each 3D point and then aggregates all the points within a small 3D volume (typically an occupancy grid cell) in order to model a 3D neighborhood context.However, Point-Net lacks the ability to capture local context at different scales.
The follow-up work PointNet++ (Qi et al., 2017b) introduced a hierarchical feature learning framework that improved the feature extraction quality by considering the local neighborhood of the points.Direct interaction of these methods with the raw point data makes them efficiently run in real-time.
Most of these methods work on outdoor autonomous driving scenarios to detect cars and pedestrians and use BEV for representing the point clouds.Therefore, they are not a suitable choice for an indoor scenario, where there are lots of occlusions and clutter in the scene.In a factory, there are lots of racks where lots of objects will be occluded and therefore BEV representation will end up an enormous loss of 3D information, which accordingly will lead to poor detection results.Plus, a good many methods utilize RGB information as input.Although images are rich in perceptual information, our proposed framework considers efficiency and confidentiality in industrial factories by using only point cloud information.Moreover, it is flexible in using various feature extraction backbones.The feature aggregation module can use any kind of feature that is provided by the feature extraction module of the framework.

PROPOSED FRAMEWORK
The proposed framework consists of five main components (see Figure 1): a preprocessing block that takes a point cloud as input and performs several operations such as pruning, cropping, calculating grids and producing sparse representation.The resulted voxels are used for calculating features which represent the geometry of the contained points by a feature extraction network.Based on the calculated features, a segmentation method is used for partitioning the patches of the calculated local features into individual object segments.The aggregation block takes these segments and calculates the final object proposal feature by concatenating features of the constituent voxels.Finally, a deep convolutional network based classifier identifies the class type of the proposal.The classifier is initially trained with a set of labeled object instances as well as a small set of a background class.The classifier makes two types of errors: false positives and false negatives.After several iterations, the false positives are incrementally collected in the background set.Therefore, the classifier is iteratively trained with on-thefly generated instances.

Pre-processing
The input for the ground surface detection algorithm is a translated point cloud D from each scan.The point cloud is translated in a way that the ego-vehicle is the origin of the coordinate system.Therefore, Pi can be considered as a set of the 3D point clouds at time where ∪ combines the points from each k = 1, . . ., K sensor.The crop operator (Crop(Di)) is also introduced that can crop the dense point cloud in a range from the ego vehicle.This way, a range value can be assigned by considering different criteria.
For instance, a range very close to the vehicle can be considered as the critical region and object detection algorithm can be used in this cropped region with higher accuracy.A pruning operator (P rune(Di, r)) is utilized that takes the given point cloud and prunes it in a given range.This way, we sub-sample the point cloud to a lower number of points to gain efficiency.By default, we prune the point clouds at 30 cm.Finally, the point cloud is discretized into fixed-dimensional grid voxels.Compared to RGB images, 3D point clouds are sparse and most of the space in such data is unoccupied.To exploit the sparsity of the point cloud only occupied grids are kept and the rest is discarded.This is critical for feature extraction step as it maps unoccupied grids to zero feature vector.

Feature Extraction
We refer to feature vector of a grid G at location (i, j, k) by f i,j,k .If we consider that each grid has a dimension of (Nx, Ny, Nz), then we can define set Φ as the set of grid indices in the feature grid to keep track of occupied and unoccupied grid cells.Therefore, ϕ ∈ Φ indicates a specific set of voxel indices in the whole grid.If the voxel is unoccupied, then its feature set is mapped to zero (f ϕ = 0) in order to help to generate a sparse feature set.Our framework is flexible and can utilize any feature extraction head for the occupied voxels.In this work, we use a modified version of (Ben-Shabat et al., 2017) for feature extraction.This object recognition network is adapted to extract features from a large point cloud grid with a given grid dimension and create its sparse grid set ϕ.This set is used for feature aggregation in later steps.The utilized network generalizes the Fisher Vector representation (Sánchez et al., 2013) to describe 3D point cloud data uniformly in order to use them as input for deep convolutional architectures.Therefore, to extract features of only one voxel, if X = pt where t = 1 . . .T is considered as a set of T 3D points in voxel (Gi of ϕ) of a given point cloud, we can define a mixture of Gaussian with k components and determine the likelihood of a single point pt of X associated with k th component as: where D is the number of dimensions of the data points.µ and are the mean and covariance, respectively.Then, we can calculate likelihood of all points belonging to Gi to mixture of all k Gaussian components as: with w k the weight of k th Gaussian component.Therefore, given a specific GMM with parameters λ and point set X we can normally calculate its Fisher vector G X λ which is sum of normalized gradients: the soft assignment of each point in the voxel to the GMM components can be calculated with its normalized gradient components (G X wk , G X µk ,G X σk ) and get concatenated to produce the final Fisher vector representation of the voxel: The feature vector is normalized to sample size.This is a hybrid representation that combines the discrete nature of the grids with the continuous structure of the Gaussian mixtures.The calculated features create unique representations for grids independent of the number of the points as well as invariant to both permutation and feature sizes.

Segmentation into Local Patches
For segmentation of the point cloud into individual objects, we follow l0 − cut greedy approach (Landrieu, Obozinski, 2016) which is a graph-cut method based on the min-cut algorithm (Kohli, Torr, 2005).It first over-segments a given point cloud into various partitions using its nearest neighborhood graph.Later, it iteratively calls min-cut algorithm to minimize the segmentation error.If we describe a point cloud's geometrical structure with an unoriented graph G = (V, E) (nodes as points in the point cloud and edges as their adjacency relationship), we can achieve optimal segmentation error by optimizing a Potts segmentation energy function (g) and produces individual ob-ject segments: where fi is acquired local geometric features.This formulation is beneficial as it uses a regularization parameter (ρ) that determines the number of clusters.There is also no need to set the maximum size of the segments and therefore objects in various sizes can be retrieved.By solving the optimization problem, the algorithm generates K non-overlapping partitions S = (S1, . . ., S k ) which is used as input for the aggregation block of the framework.Figure 2 illustrates the process of generating local object segments.

Feature Aggregation
Given the feature sparsity set f ϕ and the generated segments S, object proposal can be generated for the classification step.Set f indicates non-sparse grid cells that occupy each generated segment Si.By retrieving these cells, the final representation of the proposals can be calculated by aggregation of the cell features that comprise the segments which are simply a normalized combination of the occupied cell's Fisher vector descriptors: where n is the number of the segments, j is the non-occupied cell index and w is the weight of the features that can be determined based on the feature type or location of the cell in the grid.

Classification
The aggregated point cloud features are received by convolutional network architecture proposed in (Ben-Shabat et al., 2017) and the object models are trained through back propagation and optimization of the standard cross-entropy loss function using batch normalization.The training starts with a small initial set of background class instances and continues with standard hard negative mining technique inspired by common image-based object detectors (e.g.(Felzenszwalb et al., 2009)).In each iteration, the classifier is evaluated on the training set and all the false positive detections are stored.These negative samples then are listed in decreasing order and the first N samples are collected and added to the background class samples.Finally, the classifier is trained by the updated training instances and this process is repeated for several iterations until the desired result is achieved.

Post-processing
The detection process can result in multiple overlapping object bounding boxes because of the dense nature of the dataset that includes lots of objects placed in a close range from each other.
To avoid this problem, we employ a 3D non-maximum suppression approach which has been widely used in many detection tasks (Neubeck, Van Gool, 2006, Felzenszwalb et al., 2009).First, the detection bounding boxes are listed with descending order of their detection scores.By comparing those on top of the list with the currently accepted object list, we can make a decision to keep them or not.If the overlap of the bounding box with the previously accepted one is no more than a given threshold then, it is kept.The bounding boxes overlap is calculated by intersection over union (IoU) metric.We introduce novel simulated dataset representing a scenario in a modeled factory.In the simulation, a ground vehicle starts from a specific point in the 3D model of a factory and navigates throughout the factory and comes back to its primary position.
Figure 3 shows a point cloud retrieved from a scan captured from the simulated 3D factory model.As it can be partially seen from the point cloud, there are lots of racks in the model where at each row of the rack various object types are located.
The objects are interconnected with racks and also the other nearby object.This introduces difficulties for segmentation and makes object detection in this dataset a very challenging task.As shown in Figure 4 twelve sensors (S1 to S12) are mounted on the vehicle.The 3D sensors have a 120°horizontal and 45°v ertical field-of-view.Each sensor has its own yaw, pitch and raw angles and records with a maximum scan range of about 100 m.
We evaluate our framework on this dataset.The dataset contains 3 minutes and 20 seconds of simulation consisted of 6389 scans.
The scans are recorded in 30 scans per second rate and each scan on average includes 80 000 points.For evaluations, we selected 2/3 of the scans (4260 frames) for training and the rest for testing.For object recognition/detection, there are around 344 objects annotated in the global map of the factory.
Considering the whole scans, the total number of the annotated object instances in the dataset are 69 738.The initial instances   1 shows the details of the dataset.Unlike other recent datasets recorded for autonomous driving scenarios which main focus is on cars, pedestrians and bikes, our dataset includes seven different object classes emphasizing on industrial environment that are not addressed by most of the current solutions.

SUN RGB-D
SUN RGB-D is a scene understanding 3D benchmark dataset consisting of 37 object classes.The dataset includes around 10K RGB-D indoor images with accurately annotated object categories and orientations.Prior to feeding the dataset into our framework, the RGB-D images are converted to point cloud data.For evaluation, we followed the standard protocol provided in the original study and for comparison, we use 10 commonly used object categories.This dataset is used to evaluate performance of the proposed framework in real-world scenarios, enabling us to compare with state-of-the-art.

EXPERIMENTS
To evaluate performance of the framework, we use mean average precision (mAP) and F-score metrics (calculated as F − score = 2 × P recision×Recall P recision+Recall ).The intersection over union threshold for the reported best results is set to 0.6 for the synthetic dataset while it is set to 0.25 for SUN RGB-D dataset.Table 2 shows the results of the object recognition using the trained final classifier.For training the network, the batch size is set to 64, decay rate to 0.7 and learning rate to 0.001.For the representation of the feature vectors, the number of Gaussian functions and their variances are empirically chosen to be 5 and 0.04, respectively.The minimum number of points for an object is 2048.For optimization, Adam optimizer is used.
For training, only 50 epochs were used on an Nvidia Geforce RTX 2060 and each training iteration took about two hours on synthetic (and 10 hours on SUN RGB-D datasets).The obtained average class accuracy is 99.84% (mean loss is 0.010).
As it can be noticed, the trained classifier achieved a very high recognition accuracy.It is even possible to further increase accuracy by increasing the number of epochs at each iteration.Table 3 and Figure 8 show object detection results on our dataset.The table presents the results which are obtained after the final training iteration (the hard negative mining iteration is set to 50).We achieve satisfactory performance in most of the classes.However, for some classes, such as "Barrel" and "Truck", the detection results are not desirable.This is related to several problems.In some scenes (especially, the ones including "Barrels") the object points are so interconnected that the segment-Figure 8. Qualitative results of the proposed object detection framework applied on a sample scene from our synthetic dataset ation algorithm fails to separate the object and therefore, generates improper proposals.Moreover, the provided dataset is partially annotated.Thus, for some classes, such as "Box" and "Bobbin", the classifier performs a correct detection of the objects (TP) which are not initially annotated in the ground-truth ("Bobbin" instances on the back row of the racks on the right side of Figure 8).Those instances can be added to the background class in the training set and reduce the accuracy of the classifier and detection.The dataset is also not balanced and for some classes, such as "Cone" and "Truck", the training data is not sufficient.The Precision-Recall curve in Figure 6 underlines that despite all these challenges, the overall accuracy of the baseline framework is promising.Resolution of the grid cell is an important factor in producing accurate detections.Figure 7 depicts influence of this parameter on performance.As the resolution of grids increases, the performance increases until it tops when the grid resolution is 0.2 m 3 .Increasing grid resolution further makes the feature blocks bigger hence less detailed.Accordingly, the feature aggregation is affected in a negative way which decreases the performance.We compare performance of our framework on synthetic dataset with (Pepik et al., 2012) which is an extension of 2D deformable part models to 3D.Table 3 and Figure 6 show that our framework significantly outperforms this baseline (by 0.39 in F-score).Higher performance of 3D-DPM method (which is not a CNN based method) on the "Barrel" category explains the low performance of our method on classes with low number of instances and thereby  We also made a comparison to prior state-of-the-art methods using SUN RGB-D dataset.To be able to compare with these methods, we use mean Average Precision (mAP) metric.Notice that we use the same set of hyperparameters to train object detection network in both datasets.Deep Sliding Shapes (DSS) (Song, Xiao, 2016) is a 3D CNN based method that uses RGB images with 3D information together.Cloud of Oriented Gradients (COG) (Ren, Sudderth, 2016) is a sliding window based detection method that produces HOG like descriptors for 3D detectors.2D-driven (Lahoud, Ghanem, 2017) and F-Point-Net (Qi et al., 2018) use 2D detectors to reduce search space for 3D detection.VoteNet (Qi et al., 2019) is current state-ofthe-art method that uses Hough voting for 3D object detection using only point cloud information.As shown in Table 4, we achieve competitive results compared to the current best methods in the literature.Out of ten categories, we achieve the best performance in one category ("bed") and second best performance on three other categories.It achieves third best overall performance, however, notice that our method and VoteNet are the two methods that use only geometric information.Other methods use both RGB and geometry information to perform object detection.Figure 9 represents qualitative examples of our framework on SUN RGB-D dataset.Despite various challenges introduced in these scenes (e.g.cluttered, attached objects etc.), our framework produces robust detection bounding boxes.For example, in bottom right and bottom left of the Figure 9, the chairs and the sofa are attached to the tables, however, the framework is able to distinguish between them and produce correct detection boxes.It is interesting to see that interconnected objects make more trouble for the network in synthetic dataset as the density of the point clouds is much higher than a real scene.It therefore fails to produce correct proposals and misses on those objects.Similar to synthetic dataset that the framework was able to detect several similar attached objects in a scene (such as "Bobbin"s and "Boxes"), it also detects vast majority of similar objects such as chairs in real scenes (top right of Figure 9).Like other methods, the proposed framework has some limitations.Difficulties in segmentation of very dense and attached objects and false hallucinations of occluded objects are among them.The future work will target these issues to further increase the performance in object detection task.

CONCLUSION
In this paper, a novel object detection framework is proposed with a focus on the indoor factory environment.Unlike most of the recent state-of-the-art methods that use image and point cloud together, it only uses point cloud information.The features of the grid representation are computed and stored as a sparse representation.These features are restored and aggregated after local segmentation of the objects.A deep classifier trained with generalized Fisher representation is employed to learn object models.The developed framework is evaluated on a dataset which simulates a factory of the future setting and also SUN RGB-D object detection benchmark dataset.Based on the evaluations, the proposed framework achieves a competitive performance in object recognition as well as object detection task.

Figure 1 .
Figure 1.Architecture of the proposed method illustrating object detection procedure i and P = {Pi−m, . . ., Pi−1, Pi} as the set of current and last m point clouds.Similarly, set N = {Ni−m, . . ., Ni−1, Ni} can be assumed as translation and rotation parameters of the vehicle in the Euclidean space, where each consists of a three by three rotation and a one by three translation vector: Ni = [Ri|Ti].

Figure 2 .
Figure 2. Result of over-segmentation for generating local object patches before applying min-cut optimization to final object segmentation

Figure 3 .
Figure 3. Shows a point cloud scan from the factory model in the synthetic dataset

Figure 5 .
Figure 5.The simulated dataset from the factory model

Figure 9 .
Figure 9. Examples of qualitative detection results on SUN RGB-D dataset.In all set of images from left to right: a RGB image from test set, predicted 3D bounding boxes by our framework and annotated ground-truth the network has difficulty to converge in those categories.
(Rusinkiewicz, Levoy, 2001)d in the current time from vehicle coordinates to global coordinates, we can apply the following transformation to the point cloud P : Ri × Pi + Ti.Moreover, we can combine several scans to create a dense point cloud, since 12 sensors are available in the recorded dataset.When the transformation parameters among the sensors are available, we can merge points from different sensors by aligning the source cloud to the target point cloud coordinate to construct the dense point cloud.This process can be done efficiently by the Iterative Closest Point (ICP) algorithm(Rusinkiewicz, Levoy, 2001).The output is a point set of tightly aligned point clouds.Finally, a combined dense point cloud is obtained knowing the correct transformation from ICP (Nc == [Rc|Tc]) to our target reference frame (ego-vehicle) and the parameters of each scan Ni:

Table 1 .
Details of the annotated objects in the synthetic dataset

Table 2 .
Results of object recognition on the synthetic dataset using the trained deep classifier

Table 3 .
Results of the object detection on the synthetic dataset using the proposed framework in the background class are around 2000.With the background class, in total, there are 8 object classes in the dataset.The annotated objects consist of at least 2048 points.Every frame in the dataset is accompanied by an RGB image.Figure5depicts instances of the available objects in the dataset and Table

Table 4 .
Results of 3D object detection on the SUN RGB-D validation set