A Sparse Octree-Based CNN for Probabilistic Occupancy Prediction Applied to Next Best View Planning

This work proposes OcLe-CNN, a sparse octree-based Convolutional Neural Network (CNN) for 3D occupancy prediction. Occupancy prediction involves the inference of the occupancy probability of unobserved space. OcLe-CNN processes an octree-like data structure resulting in a reduced memory usage, as resources are allocated prevalently in the most detail-rich regions of the environment. Also, a novel loss function is introduced which results in smaller octrees compared to the state-of-the-art Structure and Task loss. The proposed CNN was integrated with a probabilistic robot Next Best View (NBV) planner, where an octree-like data structure speeds up the ray casting stage. The integration resulted in a lower total computation time. The method was implemented for both quadtrees and octrees, and it was validated on 2D and 3D datasets as well as on a real robot manipulator setup.


I. INTRODUCTION
I N SPARSE Convolutional Neural Networks (CNNs) tensors are represented as sets of coordinates-value pairs, instead of dense multi-dimensional matrices as in standard CNNs.Sparse CNNs promise better performance compared to standard dense convolution as they can focus resources where details are present.In this letter, a novel CNN is proposed for occupancy prediction: OcLe-CNN (Octree Level CNN).Occupancy prediction is the prediction of the occupancy probability of the unknown regions of space.OcLe-CNN operates on a ternary representation with empty, occupied and unknown regions.OcLe-CNN works on a sparse octree data structure instead of a dense voxel grid, which is an efficient representation as the voxel grid typically contains large regions of unknown or empty space.OcLe-CNN processes the octree using sparse CNN operators and it predicts an output octree, which may have a different structure than the input octree.Occupancy probability is predicted for each leaf node of the output octree (Fig. 1).Fig. 1.OcLe-CNN receives as input a partially complete model of the environment represented by an octree with occupied (red voxels in the top left image), unknown (black voxels, top left image) and empty voxels (not displayed for clarity), and predicts a new octree-based representation where each unknown voxel contains an occupancy probability (with colors ranging from black to blue in the bottom image).Ground truth is displayed in the top right image.Unknown voxels are bordered with a different color depending on size.In the bottom image, unknown voxels with probability lower than 0.1 are not displayed for clarity.
OcLe-CNN follows the standard approach in literature where octrees are processed using the residual encoder-decoder family of architectures.However, in OcLe-CNN input layers are present in all resolution levels in the encoder module and output layers are present in all levels of the decoder module.Hence, feature vectors can be input and output in leaf nodes at all resolutions, so that it is possible to efficiently process large regions with the same value.Moreover, a novel loss function is introduced, called Multi-scale loss, which is able to optimize simultaneously the octree structure and output values.This approach is fundamentally different from the state-of-the-art, where the loss function is the sum of a Structure loss, which leads to the optimization of the output octree structure, and a Task loss, which optimizes values in the output leaf nodes [1].The proposed Multi-scale loss is able to produce more compressed octrees without significant loss in quality.
OcLe-CNN was implemented for both quadtree (2D) and octrees (3D).For simplicity 3D terminology (octree and voxel grid) will be used, as the same concepts extend to 2D.OcLe-CNN was integrated with a probabilistic robot Next Best View (NBV) planner [2].NBV planning computes the optimal pose where the sensor should be placed to gather the most possible new information.OcLe-CNN enhanced probabilistic NBV planning by predicting occupancy probability of the unknown regions.The NBV method was also extended with a probabilistic octree-based ray casting approach, which is known [3] to be more efficient than ray casting on a voxel grid.As far as we know, this is the first work that uses spatially sparse octree-based CNNs for NBV planning.The method was also validated on a real robot manipulator with an eye-in-hand 3D camera.As a further contribution, the source code is made publicly available.
In summary, the contributions of this letter are: r the OcLe-CNN sparse network and the Multi-scale loss function, r a public implementation of OcLe-CNN, and r the integration of the method in a NBV approach and its evaluation in simulation and in a real robot setup.This letter is organized as follows.Section II introduces the related work.The OcLe-CNN network is presented in Section III.Experiments and results are reported in Section IV.Finally, Section V concludes the letter.

A. Sparse and Octree-Based CNN
Sparse CNNs were introduced to improve the performance of traditional dense CNN, exploiting that neural activation is often sparse.In this work spatially sparse approaches are investigated, where sparsity arises from the spatial arrangement of the data, as opposed to sparse networks resulting, for example, from pruning [4].Indeed, dense approaches to convolution on 3D voxel grids are often inefficient, as 3D data (e.g., point clouds) is sparse in nature [5].Applications of spatially sparse 3D convolution unrelated to NBV include semantic segmentation [5], generative scene completion [6] and scene completion using transformers [7].
Octree-based CNN approaches are a natural consequence of spatially sparse CNNs as most networks include an encoder module which iteratively halves the resolution, and a decoder module which iteratively doubles the resolution.Several methods such as OctNet [8], OctreeNet [9], OcTr [10], Oct-Former [11] and the work by Xiang et al. [12] use octree-based CNN for point cloud analysis [9], object detection [10] and semantic segmentation [8], [11], [12].In these approaches, the network produces an output octree with the same structure as the input octree.Conversely, our approach requires a different output octree structure than the input, as in the output details are present also in unknown regions of space, which have uniform value in the input.
The proposed OcLe-CNN network was inspired by Roc-Net [13] and O-CNN [1], [14] that support generation of the octree structure, 3D shape completion and shape generation but with two main differences.First, the proposed method predicts an occupancy probability value for all leaf nodes.Conversely, the two approaches [1], [13] predict output values only for maximum-resolution leaf nodes, while Adaptive O-CNN [14] predicts output values at all resolutions, but it focuses on the prediction of polygon mesh surfaces.The second difference is that the Task loss and the Structure loss were replaced with an improved Multi-scale loss function.

B. Sparse CNN Implementation
Support for sparse CNNs in machine learning frameworks such as PyTorch and TensorFlow is limited.Hence, several implementations have been proposed to accelerate sparse CNNs on GPU [1], [15], [16], [17].Software packages such as spconv [15], the Minkowski Engine [16] and TorchSparse [17] are aimed at spatially sparse convolution in general.Conversely, O-CNN [1] is specifically designed to accelerate octree-based convolution, but it represents octrees as binary blobs, which hinder the integration into a NBV planner.Thus, in this work the Minkowski Engine was adopted.

C. Next Best View Planning
Many approaches have been proposed which integrate learning in NBV planning.Mendoza et al. [18] proposed a supervised learning method where NBV was treated as a classification problem, where the network outputs a score for each view candidate and the maximum is selected.A similar method is SCVP [19], where multiple viewpoints are selected at once.The disadvantage of both these approaches is that the set of candidate viewpoints must be fixed at training time.Wu et al. [20] used a CNN to predict occupied voxels for plant phenotyping but, unlike our approach, the method was based on point cloud data and it did not track unknown voxels.In previous work [2], [21] we proposed a hybrid probabilistic NBV approach which used a CNN to estimate the occupancy probability based on learned patterns in the environment.However, it had high memory usage as it operated on a dense voxel grid.A few exploration frameworks for aerial vehicles such as SEER [22] and SC-Explorer [23] use learning to predict occupancy and semantic information.These frameworks use voxel grids instead of octrees for NBV planning and they were evaluated at a low resolution (10 cm [22] and 8 cm [23]).
Image generation approaches based on implicit neural representations (neural radiance fields) have also been proposed [24], [25], [26].However, these approaches require partial re-training of the network after each view and, therefore, require a lot of computational resources.On current hardware, the computation time is in the order of minutes [26] and the sensor resolution is often set to low values (e.g.200 × 200 pixels [25]).
Traditional probabilistic approaches for NBV planning [27], [28] use ray casting to estimate how much unknown volume will be traversed by view ray.A few NBV approaches [3], [29], [30], [31] proposed an octree-based data structure, so that ray casting can efficiently skip large uniform volumes.In particular, Vasquez-Gomez et al. [3] first introduced hierarchical ray tracing in an octree and observed a significant speedup.However, these approaches do not use learning for prediction.

III. METHOD
The proposed OcLe-CNN approach is designed to operate on a partially-complete volumetric representation such as the one used in many NBV exploration methods [2], [3], [23].The 3D representation is ternary, where each region of space may be occupied, if the surface of an object has been observed by the sensor, empty if the sensor view rays traversed the region unobstructed, and unknown if unobserved.As the ternary representation typically contains large contiguous regions of unknown or empty space, OcLe-CNN operates on an octree-based data structure T C .In particular, an octree-grid is used, i.e. a voxel grid of octrees (also called a semioctree [31]).Given the maximum octree depth d max , each octree grid represents a cubical region of space with side c min • 2 d max −1 , where d max is the maximum octree depth and c min is leaf size at maximum depth.Each octree is initialized to a single node.If the region of space inside a node has the same value, the node is set to a leaf node with that value.Otherwise, the node is recursively split into 8 children, up to the maximum depth d max , where each node size is c min .The state of a leaf node in T C is a two-element vector, with value [0, 0] for unknown, [1,0] for empty and [0, 1] for occupied.The output of OcLe-CNN is the new octree-grid T P , where each leaf node contains an occupancy probability.
OcLe-CNN follows the standard residual encoder-decoder family of architectures, with N conv encoder levels which halve resolution and N conv decoder levels which double resolution.Standard encoder-decoder architectures require the input to be provided only at the first encoder level at the maximum resolution.Instead, in octree-grid T C different parts of the environment are already at different resolutions.Hence, to avoid upsampling, each part is provided as input to the OcLe-CNN encoder level which operates at the corresponding resolution.Moreover, OcLe-CNN is designed to be implemented using standard sparse convolution operators on coordinate-format sparse tensors.Coordinate-format sparse tensors are sets of coordinatevalue pairs { p i , v i }, where pair p i , v i indicates that the cell at integer coordinate p i contains value v i .Hence, in our approach the octree-grid T C is represented using a set of octree levels L C = {L C,l }, which are d max sparse tensors L C,l , one for each level l.Each sparse tensor L C,l contains one element for each leaf node in T C at that level (Fig. 2).As only leaf nodes are in present in the octree levels, the regions of space represented by the levels are disjoint.We note that, as each octree level L C,l is input at the l-th encoder level, N conv must be at least the octree depth d max .
Similarly, each decoder level of OcLe-CNN outputs one of the octree levels L P,l of output occupancy probability octree T P .Each decoder level classifies each input coordinate-value pair to determine whether it represents a uniform region of space and therefore it must be output at this encoder level, or whether it must be upsampled and forwarded to the next decoder level.Since each input coordinate is either output at this level or forwarded, octree levels L P,l represent disjoint regions of space, and they are re-assembled bottom-up into the octree-grid T P .More details about the OcLe-CNN architecture are in Section III-A.OcLe-CNN is also trained using a Multi-scale loss Fig. 2. The octree-grid T C (left) is split into octree levels, with increasing resolution (top to bottom).Each octree level contains leaf nodes which can be occupied (red dots), empty (white dots) or unknown (black dots).OcLe-CNN produces a new set of octree levels (right) where the former unknown leaves contain occupancy probabilities (dots with multiple shades of blue).Some unknown leaves were split by the network to add details. Fig. 3.An OcLe-CNN with 3 encoder levels E l (left) and 3 decoder levels F l (right).Each red rectangular contour encloses a level.function which penalizes the creation of high-resolution octree nodes (Section III-B).

A. OcLe-CNN Architecture
The architecture of an example OcLe-CNN is displayed in detail in Fig. 3.Each encoder level E l receives as input octree level L C,l , and the output X l+1 of the higher-resolution encoder level if it is present.The sparse tensor X l+1 is first processed by a conv block C l , which perform convolution operations.Two possible conv blocks have been evaluated in the experiments: a single convolution layer and a sequence of Resblocks.A downsampling layer R d,l halves resolution by applying stride 2 convolution.Octree level L C,l is processed by an input convolution layer I l and then merged with the output of R d,l .As octree level L C,l and all the previous octree levels represent disjoint regions of space, the merge operation is the union of the coordinate-value pairs which form the sparse tensors (∪ in Fig. 3).In summary, each encoder level E l executes: Each decoder level F l receives as input the output X l−1 of the lower-resolution decoder level.As in standard residual encoderdecoder architectures, X l−1 is first added to the skip connection Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
X l and processed by a conv block D l , similar to the one in the encoder levels.Then, OcLe-CNN determines whether the value of each coordinate-value pair represents a uniform region of space by using a binary classifier B l .The classifier B l outputs a sparse Boolean mask B l with a true value if the corresponding element should be split into higher-resolution voxels and be forwarded to the higher-resolution level, and a false value if it represents a uniform region of space and therefore should be output at this level.If the region is determined to have uniform output value, then that coordinate-value pair is processed by an output block O i and becomes part of output octree level L P,l .Otherwise, the output is upsampled and provided as the input X l to the higher-resolution decoder level F l+1 for further processing.
The binary classifier B l is composed of two convolution layers and a softmax layer.The convolution layers have kernel size 1, i.e. they process each feature vector of its input sparse tensor independently.The first layer outputs 16 channels, while the second layer outputs two channels, i.e. two sparse tensors [β l,0 , β l,1 ].The two sparse tensors are converted into Boolean sparse tensor B l =β l,0 <β l,1 using the element-wise "less than" operator.
The inputs of all decoder levels are also filtered by observed masks L K,l , to prevent OcLe-CNN from processing regions which are already known to be completely occupied or empty.An observed mask L K,l is a Boolean sparse tensor which contains an element at position p i only if the corresponding volume is completely empty or occupied.The observed mask is used to filter the input of each conv block, so that all elements which are in the mask are removed.In Fig. 3 this filtering operation is represented using the ∧ symbol.Therefore, OcLe-CNN only processes and outputs values in the unknown regions of space.
In summary, the operations executed at each decoder level F l are: where D l is a conv block, operator ∧ uses the second operand as a mask for the first, and R u,l is an upsampling block that doubles resolution.

B. Multi-Scale Loss Computation
In order to reduce output size it is desirable that OcLe-CNN should produce large, low-resolution leaf nodes in the regions of space where limited information is available and only a coarse estimation of occupancy is possible, i.e. far from the regions of space observed by the sensor.However, this does not occur using the standard Structure and Task loss, because the Structure loss is calculated by comparing against the structure of the ground truth octree, which contains high-resolution details as the whole  environment is known.Hence, the network is biased towards the prediction of small nodes in large unknown regions (Fig. 4).
The proposed Multi-scale loss involves the computation of three possible octree levels (L P,l , L next P,l and L prev P,l ) for each decoder level F l .All three octree levels represent the same region of space, i.e. the one which is sent to the output at level l.However, octree level L P,l is predicted as normal, L next P,l is predicted after upsampling and then using decoder level F l+1 , and L prev P,l is predicted after downsampling and then using decoder level F l−1 .Hence, L next P,l and L prev P,l are a simulated result of what would have happened if the higher-resolution decoder level and the lower-resolution decoder level had been used, respectively, to predict the same region of space.If the network is able to make a detailed prediction in the region, then it can be expected that L next P,l has lower error than the other two, as it has greater resolution.Conversely, if the network makes an incorrect prediction, or if the prediction results in a uniform value instead of details, then all three levels have similar error.Hence, to bias the network towards lower-resolution in this case, the proposed loss weighs the error more for L next P,l and less for L prev P,l .The error of each output value is also weighted separately by the corresponding output value β l,0 of binary classifiers B l .Hence, the binary classifiers learn to predict the regions using the lower resolution leaf nodes when the error is similar.
In practice, given the ground truth voxel grid G T , the Multiscale loss is computed as follows (Fig. 5).Three different predictions are available for the same region at each decoder level: L P,l for the current level, L next P,l for the higher-resolution Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
level, and L prev P,l for the lower-resolution level.Each triplet of integer coordinates p i of level l corresponds to c l = 8 d max −l−1 voxels G T .Hence, the squared error v e i of the output occupancy probability v p i at index p i is computed as: where E j is an operator which averages over all ground truth voxels G T,j corresponding to index p i .To compute (6) efficiently, mean occupancy probability E j [G T,j (p i )] and a mean squared occupancy probability E j [G 2 T,j (p i )] were pre-computed for each 3D index p i and level l by downsampling ground truth voxel grid G T .
By applying (6), three sparse tensors L next E,l , L E,l and L prev E,l are computed, which contain one MSE element v e i for each of values v p i in L next P,l , L P,l and L prev P,l respectively.Each value in the sparse tensors is increased by a small value κ, to ensure that output leaf nodes have a cost even if the error is zero.Then, to bias the network towards lower resolution leaf nodes, error in L prev E,l is made less relevant by multiplying by α, where α is a bias hyperparameter (0 < α < 1).Similarly, error L next E,l is divided by α.Moreover, L next E,l , L E,l and L prev E,l are weighted by the binary classifiers outputs β l,0 by element-wise multiplication with the following sparse tensors, respectively: where "up" is the nearest-neighbor upsampling operator and λ is a leak hyperparameter ensuring that, even when β l,0 are equal to zero or one, a nonzero gradient is able to flow back in back-propagation.Finally, the Multi-scale loss is the sum of all elements of the error sparse tensors for every level l, each multiplied by α −l to further bias towards lower resolution levels.

A. OcLe-CNN Implementation and Training
The OcLe-CNN architecture was used to implement two different networks, OcLe-EncDec and OcLe-Resnet.In OcLe-EncDec each conv block C l is a single convolution layer as in our previous work [2].Conversely, in OcLe-Resnet, each conv block is composed of 3 resblocks as in O-CNN [1].In both cases, the downsampling layers R d,l and upsampling layers R u,l were implemented using a single convolution layer with stride 2, which was transposed in the upsampling layers.The number of encoder levels N conv was set to 6.The first convolution block outputs 4 channels in 2D environments and 2 channels in 3D environments.Each subsequent downsampling layer doubles the number of channels up to a maximum, which was set to 32 for OcLe-EncDec and to 16 for OcLe-Resnet to prevent overfitting.Similarly, each upsampling layer halves the number of channels.All layers have the leaky ReLU activation function, with the exception of the final layer which uses the sigmoid activation function to constrain the output probability between 0 and 1. OcLe-CNN was implemented in Python on top of the Minkowski Engine [16], as explained in Section II-B.An additional implementation was also developed where custom sparse convolution operators were built from scratch, using the PyTorch built-in sparse tensors.As baseline, we also implemented an encoder-decoder network as in [2] (EncDec) and a standard Resnet.These networks predict a dense grid using standard dense convolution operators, and they were trained using the standard MSE loss.
Each CNN implementation was trained and tested in 2D and 3D environments.In particular, the network was trained in 2D on the Inria Aerial Image Labeling dataset [32] which contains 180 black-and-white aerial images, where buildings are labeled in white.Images were downsampled to 1250 × 1250.For 3D tests, the network was trained on the synthetic tabletop 3DT dataset [21], which is composed of 180 volumetric grids with resolution 128 × 128 × 96 and cell size 1.2 cm, which were generated by randomly placing objects on a planar surface.Each environment was partially complete due to the simulation of random views (between 200 and 400 in 2D, and between 2 and 40 in 3D).For each dataset, 120 samples were used for training and 60 for testing.In the 3DT dataset, different objects were used to generate the training and the test samples.Training was carried out using the ADAM optimizer for 360 epochs, with learning rate 0.0001 for EncDec and 0.0005 for Resnet.The Multi-scale loss was configured with λ equal to 0.5, α = 0.9 and κ = 0.01 .When training with Multi-scale loss, pre-trained weights from the Structure and Task loss were used.

B. OcLe-CNN Evaluation
Table I shows the average RMSE, average prediction time, average number of output octree leaf nodes and memory usage of the networks, when evaluated on the test set.Memory was determined with a binary search by executing the network on each sample while artificially limiting the amount of available memory, up to a precision of 2 MB.Memory usage is reported minus the amount that is allocated by PyTorch at startup regardless of network.OcLe-EncDec and OcLe-Resnet achieve a lower number of octree nodes when trained with the proposed Multi-scale loss compared to the Structure and Task (Struct+Task) loss (also visible by comparing Figs. 1 and 6).The low number of octree nodes results in a lower memory usage and a lower prediction time, as shown in Table I.There is a trade-off between our sparse PyTorch implementation and the Minkowski Engine, in that the latter is faster, but it requires a very large amount of memory due to its complex caching system.The baseline implementations based on dense operators (i.e.EncDec, Resnet) are faster than both sparse PyTorch and the Minkowski Engine, likely because dense PyTorch is a more mature and optimized implementation.Nonetheless, sparse PyTorch achieves the lowest memory usage.
In terms of RMSE, the dense implementations usually slightly outperform the sparse networks, which is expected as sparsity introduces an approximation.However, it is also noticeable that the networks trained with the Multi-scale loss outperform the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.corresponding network trained with the Structure and Task loss.Indeed, by comparing the 2D outputs (Fig. 7), there is also a qualitative difference in that the standard Structure and Task loss generates spurious details in the large unknown regions.Conversely, the output with the Multi-scale loss in the same regions is more similar to the output of a standard dense network (e.g., the Resnet in Fig. 7), albeit with lower resolution.

C. Simulated NBV Experiments
OcLe-CNN was integrated with the NBV system in our previous works on probabilistic NBV [21], which was extended with octree-based ray casting.The 3D reconstruction system represents the environment using a dense grid of cells (a square grid in 2D and a voxel grid in 3D), which was converted into a ternary quadtree/octree T C for view evaluation.View evaluation operates in two steps, a prediction step and a ray casting step.In the prediction step, the occupancy probability is predicted by OcLe-CNN, which processes T C and it produces new octreegrid T P .Then, in the ray casting step, the probabilistic NBV method carries out octree-based ray casting on T P to estimate the expected information gain of each view, similarly to [3] but using the full sensor resolution.The system is implemented in C++, and OpenCL for GPU acceleration, and it ran on an Intel i9-10900 CPU @ 2.80 GHz, 32 GB RAM, with a GeForce RTX 3090, 24 GB RAM.Communication between C++ and Python code was achieved using the ROS (Robot Operating System) framework.The source code is available at https://rimlab.ce.unipr.it/RMonica.html,under nbv_3d_cnn_octree.
Experiments were carried out in simulation on 40 2D environments and 40 3D environments.The 2D environments were generated from the 2D test set by cropping from the initial 1250 × 1250 resolution to 400 × 400 at random coordinates.The 2D virtual sensor configured with 128 cells maximum range, 15 cells minimum range, and resolution 128 pixels.The 3D environments were synthetic tabletop environments generated using the same objects as the 3DT test dataset, with resolution 256 × 256 × 192.Sensor maximum range in 3D was 256 cells, minimum range was 20 cells, and resolution was 640 × 480 .During initialization, all cells in the reconstruction were set to the unknown value, and a single random view was simulated.At each iteration, a fixed number of random views were sampled with origin in the currently known empty space.The number of random views was 100000 in 2D environments and 1000 in 3D environments.Each sampled view was then evaluated using the NBV approach.Then, the reconstruction was updated by simulating an observation using the view with the highest information gain.Each experiment was executed for 100 iterations in 2D and 40 iterations in 3D, or until the NBV approach predicted no information gain.Multiple approaches were evaluated, depending on whether the standard dense EncDec network or the proposed OcLe-EncDec is used for prediction, and whether ray casting is carried out on a dense grid or an octree.In particular, four methods were compared: the baseline EncDec-Grid [2], which is based on standard ray casting on the dense grid predicted by dense EncDec; EncDec-Octree, where the dense grid was converted into an octree and octree-based ray casting was used; OcLe-Octree (Struct+Task) where OcLe-EncDec was used for prediction and an octree for ray casting; and OcLe-Octree (Multi-scale) where the proposed Multi-scale loss was also used.
The percentage of unknown cells remaining after each simulated view is shown in Fig. 8.With the exception of the Random method, which blindly selects a random view, the evaluated methods display similar numbers of unknown cells.Fig. 8 also shows the RMSE of the predicted occupancy probability.The RMSE always decreases with the number of views, as the prediction is based on a reconstruction which becomes more complete.The RMSE decreases more quickly in the first views, which observe larger regions of space.In the first views, the RMSE is the highest when using OcLe-Octree trained with the Structure and Task loss.Conversely, OcLe-Octree with the Multi-scale loss shows an RMSE which is closer the dense voxel-grid-based methods.
Table II shows the average total NBV computation time for each method and the number of octree nodes during ray casting.The computation time is split into network prediction time, ray casting time, and other processing time such as voxelgrid-to-octree conversions and data transmission through ROS.The EncDec-Octree method reduces the ray casting time with respect to EncDec-Grid by about 47% in 2D and 32% in 3D.Hence, we confirm that octree-based ray casting is faster than voxelgrid-based ray casting [3].When using OcLe-Octree, ray casting time is further decreased by about 25% in 2D and 4% in 3D.This result can be explained by the lower number of nodes in the octree, as OcLe-CNN outputs an octree where nodes with similar value are already merged into larger leaf nodes.The other processing time is lower for OcLe-Octree than for EncDec-Octree, because in OcLe-Octree the network directly predicts a octree, so no conversion from a voxelgrid is necessary.As expected from Section IV-B, network prediction in OcLe-Octree is slower than in EncDec-Grid and EncDec-Octree, as OcLe-Octree uses a sparse CNN.Nonetheless, for 3D environments, the ray casting and other processing times are more significant than the prediction time, hence the proposed OcLe-Octree method is faster.

D. Experiments With a Real Robot
The method was tested in tabletop 3D reconstruction experiments using a COMAU Smart Six robot manipulator with an eyein-hand Orbbec Astra-S camera.KinectFusion was used as volumetric 3D representation.The goal of the experiment was the 3D reconstruction of an unknown volume of 1.88 × 1.07 × 0.5 meters which encompassed the objects on top of the table.The table border is assumed already known from an initial scan operation.At each iteration, a volume of 2.2 × 1.27 × 1.3 meters (187 × 108 × 110 cells) on the table top was converted into an octree and the proposed method was used to compute the NBV.Motion planning was performed using MoveIt.The robot then moved the camera to the selected NBV and the 3D representation was updated.
Two experiments were carried out, to compare the EncDec-Octree method and the proposed OcLe-Octree method trained with the Multi-scale loss.The experiments were stopped after 10 iterations each.Fig. 9 shows the predicted occupancy probability octree and the corresponding NBV selected by the proposed method for some iterations.The average number of octree nodes using the OcLe-Octree method was about 150 k and the one of EncDec-Octree was about 299 k.Hence, OcLe-Octree resulted in a lower memory usage.The average NBV ray casting time for OcLe-Octree was also 3.93 seconds, which was slightly lower than EncDec-Octree (4.00 s).The experiments are also shown in the attached multimedia material.

V. CONCLUSION
This work presented a sparse octree-based CNN for occupancy prediction.The octree-based CNN resulted in a reduced memory usage compared to dense CNN approaches.A Multiscale loss was also proposed to further reduce the output octree size.The CNN was integrated into a probabilistic NBV approach, which was tested in 2D and 3D simulated experiments and in a real robot setup.In 3D, the integration resulted in a lower total computation time.Future work will investigate faster sparse CNN implementations.

Fig. 4 .
Fig. 4. Example of 2D input octree-grid (quadtree) with empty (white), occupied (red) and unknown (black) leaf nodes (left).Nodes were bordered with a different color depending on size.Example output of OcLe-CNN trained with the standard structure and task loss (center) and with the proposed multi-scale loss (right), where unknown leaf nodes have been filled with blue according to occupancy probability.With the proposed multi-scale loss, the network predicts larger nodes in the unknown regions.

Fig. 5 .
Fig. 5. Decoder level l augmented during training with loss computation.

Fig. 6 .
Fig. 6.An example 3D output of OcLe-EncDec trained with the structure and task loss.A larger number of high-resolution leaf nodes (bordered in yellow) are present compared to the output of the same network trained with the multi-scale loss (Fig. 1, bottom).

Fig. 7 .
Fig. 7. Input 2D environment (top-left) with empty (white), occupied (red) and unknown (black) cells.Ground truth occupied cells are also displayed in red.The other images show the occupancy probability (from black to blue) predicted by some of the evaluated approaches.

Fig. 8 .
Fig. 8. Percentage of unknown grid cells (top) and RMSE of the predicted occupancy probability (bottom) after each view, averaged over 40 environments, for 2D (left) and 3D (right) simulated experiments.

TABLE I AVERAGE
RMSE, PREDICTION TIME (MS), OUTPUT LEAF NODES, AND GPU MEMORY USAGE (MB) FOR EACH NETWORK, LOSS AND ENGINE

TABLE II AVERAGE
NETWORK PREDICTION TIME (MS), RAY CASTING TIME, OTHER PROCESSING TIME, TOTAL TIME, AND NUMBER OF OCTREE NODES