Three ‐ dimensional shape reconstruction of objects from a single depth view using deep U ‐ Net convolutional neural network with bottle ‐ neck skip connections

Three ‐ dimensional (3D) shape reconstruction of objects requires multiple scans and complex reconstruction algorithms. An alternative approach is to infer the 3D shape of an object from a single depth image (i.e. single depth view). This study presents such a 3D shape reconstructor based on U ‐ Net 3D ‐ convolutional neural network (3D ‐ CNN) with bottle ‐ neck skipped connection blocks (U ‐ Net BNSC 3D ‐ CNN) to infer the 3D shapes of objects from only a single depth view. The BNSC block is a fully convolutional block that uses skip connections to improve the performance of the sequential 3D ‐ convolutional layers of U ‐ Net. The primary advantage of U ‐ Net BNSC 3D ‐ CNN is improving the accuracy of shape reconstruction while reducing the computational load. The evaluation of the proposed U ‐ Net BNSC 3D ‐ CNN uses unseen views from trained and untrained objects with two public databases, ShapeNet and Grasp database. Our reconstructor achieves 72.17% and 69.97% accuracy in terms of the Jaccard similarity index for trained and untrained objects, respectively, with the ShapeNet database, whereas previous reconstructor based on 3D ‐ CNN achieves 66.43% and 58.35%. With Grasp database, our reconstructor achieves 87.03% and 85.35%, whereas 3D ‐ CNN 76.52% and 76.02%. Also, our U ‐ Net BNSC 3D ‐ CNN reduces the computational load of the standard 3D ‐ CNN reconstructor by 6.67% in the computation time and by 98.69% in the number of trainable parameters.


| INTRODUCTION
In general, three-dimensional (3D) shape reconstruction of various objects involves scanning the objects from multiple view angles with RBG-D sensors [1][2][3][4], a laser scanner [5], or stereo cameras [6,7]. The views are then patched together via reconstruction algorithms to generate the 3D shape of an object [8,9]. Using multiple scanning and patch-up algorithms to reconstruct a 3D shape is time consuming and effort intensive. The initial works for 3D shape reconstruction used point clouds or meshes to generate a rough 3D shape of an object by assuming symmetry [10][11][12][13][14][15]. Thus, those approaches have difficulty reconstructing irregularly shaped objects (i.e. asymmetric objects). Other works reconstructed 3D shapes by classifying an object according to its similarity with a database of CAD models [16][17][18][19][20]. Those approaches cannot reconstruct objects without a matching model in the dataset. It would be beneficial to have an intelligent method by which to reconstruct 3D objects from unseen views of learned (i.e. trained) objects, and it would be even more beneficial if the method could reconstruct 3D objects from unseen views of unlearned (i.e. untrained) objects.
Recently, 3D shape reconstruction via deep learning (DL) artificial intelligence from a single depth view of an object has offered an alternative solution [21,22]. These approaches require less time and effort than conventional 3D shape reconstruction techniques [8,9]. The basic idea is to train DL intelligence with a variety of learning objects and then use that intelligence to infer or reconstruct 3D shapes from unseen views of both learned and unlearned objects. Han et al. [23] presented an algorithm built in two modules that could reconstruct the 3D shapes of six objects. The first module was a 3D fully convolutional network combined with long short-term memory. It was used to infer the global structure of an object from an incomplete point cloud and multi-view depth information. The second module based on an autoencoder generated the complete 3D shape of the object from the output of the first module. That reconstruction algorithm achieved 96.1% of completeness with 50% of the complete volume as input. They defined completeness as the fraction of points in the reconstructed volume within a distance α (i.e. 0.001 times the maximum shape diameter) of any point in the ground-truth. The work of Smith and Meger [21] used a variational autoencoder and generative adversarial networks (GANs). Their system focused on 3D shape reconstruction from a single depth view of the object. However, their GAN architecture suffered from instability, especially when training involved multiple views from multiple objects [21]. Also, they provided only qualitative results showing some sample reconstructed objects from the training database and an actual Kinect scan. In their latest work, Varley et al. [22] implemented a 3D-convolutional neural network (3D-CNN) for 3D shape reconstruction from a single depth view of an object. Their 3D-CNN is a large neural network involving 148.4 million of trainable parameters. Also, their system offered limited performance in reconstructing untrained objects. The reconstruction accuracy (Jaccard similarity index) dropped from 74.86% with trained objects to 64.96% with untrained objects.
In this study, we present a novel 3D shape reconstructor to infer the 3D shapes of seen and unseen objects. The proposed shape reconstructor uses only a single depth view of an object to infer or reconstruct its full 3D shape. The design of our reconstructor is based on U-Net CNN because its connections between the contractive and expansive paths provide more feature information to the expansive path for 3D shape reconstruction [24]. In our architecture, the U-Net structure is extended from a 2D-CNN to a 3D-CNN to increase the receptive field through a 3D-CNN kernel [25]. In addition, we add two novel components to the design of the reconstructor: a skip connection block (SC block) and bottle-neck layers. The skip connections are pathways between non-consecutive layers that propagate the training error in the backward pass [26]. The propagation of the training error improves the training and thus the 3D reconstruction performance. To reduce the computational load, bottle-neck layers are added to the SC block (BNSC block). The bottle-neck layers reduce the training time of the SC block by reducing the number of trainable parameters and computational cost.
We evaluated our U-Net 3D-CNN with BNSC blocks (U-Net BNSC 3D-CNN) against three baseline networks including 3D-CNN, 3D-CNN with SC block, and 3D-CNN with BNSC block. The evaluations were performed using unseen views from seen (learned) objects and unseen views from unseen (unlearned) objects. The results show that our 3D shape reconstructor achieves superior Jaccard similarity index values (i.e. reconstruction accuracy) for both trained and untrained objects compared with the baseline networks.

| METHODS
Our proposed 3D-shaped reconstructor is shown in Figure 1. The input of our reconstructor is a single depth view of an object expressed in voxels. The proposed U-Net BNSC 3D-CNN is shown with a U-Net 3D-CNN, skip connections, and bottle-neck layers. The output of the proposed U-Net BNSC 3D-CNN is an inferred and reconstructed 3D shape of an object, also expressed in voxels.

| Baseline networks
The 3D-CNN is a neural network based on the classic CNN structure that contains some convolutional and pooling layers followed by fully connected layers and output layers [27]. The shape reconstruction algorithm in Varley et al. [22] is based on such a 3D-CNN with three 3D-convolutional layers and two dense layers. The output layer is a dense layer with 64,000 nodes used to reconstruct the shape of the object in a voxel grid of 40 � 40 � 40.
In this study, we implement a 3D-CNN with five 3Dconvolutional layers, three fully connected layers (i.e. dense layers), and one 2D-convolutional layer as the output layer. Figure 2a shows our implemented 3D-CNN. Each 3D-convolutional layer extracts 64 features (@64) with a kernel size of 4 � 4 � 4. A drop-out layer was added right after the last 3D-convolutional layer to prevent overfitting. The output layer is a 2D-convolutional layer (instead of a dense layer with 64,000 nodes) to reduce the number of trainable parameters from that found in Varley et al. [22]. The 2Dconvolutional layer extracts 160 features (@160) with a kernel size of 3 � 3.

| Skip connection block 3D-CNN
The skip connections are called a shortcut connection in Ref. [28], which is a connection between non-consecutive layers in a neural network. The key idea of a skip connection is to improve the training, especially in deep or wide networks [29][30][31]. Skip connections are pathways that transfer the training error back to the initial layers (i.e. backpropagation error), which produces a training improvement [26,32].
One of the critical requirements for 3D shape reconstruction is reducing the computational load because conventional 3D shape reconstructors generally use very deep networks (i.e. many trainable parameters) with prolonged training time. SC block is constructed as shown in Figure 2b with the following characteristics. First, the classic residual blocks (i.e. SC blocks) are generally used in very deep networks [28]. Then, the SC block was modified to skip one convolutional layer instead of two convolutional layers, similar to Huang et al. [32]. Second, different operations to combine the output of X 2 with the output of X 1 were explored: addition VALAREZO AÑAZCO ET AL.
-25 [28], concatenation [32], and average [33,34]. In our initial tests, the average operation performed slightly better than the others in terms of reconstruction accuracy.
To evaluate the effect of the SC block for 3D shape reconstruction, we performed evaluations with and without the skip connections, as shown in Figure 2b,a, respectively.

| BNSC block 3D-CNN
We designed our BNSC block to reduce the computational load of the SC block by adding bottle-neck layers. A bottleneck layer is a convolutional layer with a kernel size of 1 � 1 � 1 that reduces or restores the feature map [28,35]. The bottle-neck layers reduce the number of trainable parameters and computational cost through their 1 � 1 � 1 kernels. Figure 2c shows our BNSC block, which has two bottle-neck layers. The first bottle-neck layer reduces the feature maps from @64 to @16. At the end of the block, another bottle-neck layer increases (i.e. restores) the feature maps from @16 to @64. Even though the number of convolutional layers increases from two in the SC block to four in the BNSC block, the training time of our BNSC block should be less than the SC block because of the reduction in the computational cost. Also, the BNSC block has fewer trainable parameters than the SC block.
The number of trainable parameters can be computed using Equation (1) [36,37]: where, Input Features correspond to the size of the feature map at the input of the convolutional layer; Kernel Size is the size of the convolved kernel (1 � 1 � 1 or 4 � 4 � 4); Output Features is the size of the feature map at the output of the layer, which is equal to the number of the convolved kernels in the layer; and Bias is the parameters added in each neuron to shift the activation function.
The computational cost is defined as the number of computations in a convolutional layer, as calculated using Equation (2) [37,38]: where, Input Volume is the size of the input information for the 3D-convolutional layer (40 � 40 � 40 before the first pooling layer). For instance, the first SC block in Figure 2 has two convolutional layers with a kernel size of 4 � 4 � 4. The input feature is @1 and @64 for each convolution. The output features of both convolutional layers are @64. Using Equation (1), the number of trainable parameters at the end of the first SC block is 266,368. The input volume for all convolutional layers in the SC and BNSC blocks is 40 � 40 � 40. According to Equation (2), the computational cost of the SC block is 17,039 million computations. In contrast, the BNSC block has four convolutional layers with kernel sizes of 4 � 4 � 4, 1 � 1 � 1, 4 � 4 � 4, and 1 � 1 � 1. The input features are thus @1, @64, @16, and @16, and the output features are @64, @16, @16, and @64. Using Equation (1), the number of trainable parameters at the end of the BNSC block is 22,688, 11.74 times fewer than the SC block. Using Equation (2), the computational cost of our BNSC block is 1441 million computations, 11.82 times fewer computations than the SC block. Thus, in the BNSC block, fewer trainable parameters need to be optimized comparing with the SC block, and the computational cost is also less. The comparison details in terms of the number of trainable parameters and computational costs are given in Tables 1 and 2, respectively. BNSC 3D-CNN has the same network architecture as SC 3D-CNN. However, the first two blocks are replaced with our BNSC block, as shown in Figure 2c.

| Proposed U-Net BNSC 3D-CNN
U-Net is a network based on convolutional layers with two paths: a contractive one that captures context information from the feature extracted by the convolutional layers and an expansive path that reconstructs the object shape based on the features extracted by the contractive path [39]. Furthermore, U-Net uses some skip connections to connect the layers from the contractive path with their equivalents in the expansive path, providing more feature information for shape reconstruction.
In our reconstructor, only 3D-convolutional layers are used in the U-Net because their larger 3D receptive field improves the performance [25]. Furthermore, the BNSC block is used because BNSC 3D-CNN outperforms the 3D shape reconstruction performance of 3D-CNN. Finally, to propagate the feature information to the whole network, extra connections are added in the contractive path of our U-Net BNSC 3D-CNN reconstructor, as shown in Figure 3. Figure 3 shows the proposed U-Net BNSC 3D-CNN. The input layer is a 3D-convolutional layer that extracts 64 features (@64) with a kernel size of 4 � 4 � 4. Then, two BNSC blocks and two pooling layers are used in the contractive path, along with two skip connections, as in Ref. [40]. However, we use an addition operation instead of the concatenation operation used in Ref. [40]. In the middle of our U-Net BNSC 3D-CNN, a third BNSC block is used before the first up-sampling layer. The expansive path uses two BNSC blocks, two up-sampling layers, and skip connections between the contractive and expansive paths with a concatenate operation, as in Refs. [41,42]. The output layer is a 3D-convolutional layer with a kernel size of 4 � 4 � 4 and one feature, corresponding to the reconstructed 3D object shape expressed in voxels. Previous 3D shape reconstructors are based on an autoencoder with the U-Net connections between the encoder and decoder [43,44]. These previous reconstructors [43,44] differ from the original U-Net architecture because of the dense layers at the end of the encoder. The proposed U-Net BNSC 3D-CNN differs from the 3D autoencoder with the U-Net connections because our reconstructor uses of BNSC blocks and skip connections in the contractive path.
The loss function used to train the proposed U-Net BNSC 3D-CNN is a weighted version of the cross-entropy function in Ref. [44]. It was designed to update the values with more emphasis when the voxel grid has information about the object. Equation (3) describes the weighted loss function: where, N represents the total number of voxel grids times the number of samples in each training batch; y n is the true voxel grid value;ŷ n is the predicted voxel grid value; and γ is the weight value (three in our case). For the final output of the 3D shape reconstructor, all voxel grid values with a probability of less than 0.75 are converted to 0 and those with a probability greater than 0.75 are converted to 1, producing a reconstructed 3D object shape.

| Databases
To train and evaluate all the implemented 3D shape reconstructors, we used two public databases. First, the Grasp database [45] offers 590 objects as mesh models. The objects include tools, toys, groceries, drugstore products, and household objects. Second, the ShapeNet database [46] offers 3135 categories of CAD models, classified into warehouse objects such as toys and household objects such as furniture. The public databases were chosen because of their variety of objects and popularity in the research community. All objects are expressed in voxels as binvox files [22] and can be read using Binvox software, which rasterizes 3D object models into binary 3D voxel grids [47,48].
The binvox files were created using the following guidelines [22]. First, the meshes collected from the Grasp database were placed in Gazebo [49], which is 3D software for robotic simulations. Using a virtual depth camera, several partial depth views of the imported mesh objects were created. Second, using the partial depth views and Binvox, an occupancy grid of the visible pixels was created, with 1s assigned to object voxels and 0s to the background. Finally, the shape corresponding to the partial view was down-sampled to a 3D voxel grid of 40 � 40 � 40 and saved as a Binvox file.

| Evaluation methodologies
To evaluate the baseline networks (i.e. 3D-CNN, SC 3D-CNN, and BNSC 3D-CNN) and our proposed U-Net BNSC 3D-CNN, two evaluation methodologies were used. First, we conducted unseen views testing (unseen views of learned objects) to evaluate the performance of our reconstructors with trained objects. Second, we conducted unseen objects testing (unseen views of unlearned objects) to evaluate the performance of the implemented networks with untrained objects. Both evaluation methodologies were implemented using Grasp and ShapeNet databases.
From the Grasp database, three datasets were derived. First, the training dataset includes 250 partial views from 76 different objects. Second, the unseen views dataset contains 50 views not included in the training of each of the same 76 objects used in the training dataset. Third, the unseen objects dataset contains 50 unseen views of each of 15 objects not included in the training dataset.
From the ShapeNet database, we used the same categories of objects evaluated by Yang et al. [44]. The categories include a chair, couch, table, bench, faucet, firearm, car, and plane. From the categories of chair, couch, table, and bench, the training dataset was derived using 2400 partial views of each category. The unseen views dataset contains 1000 partial views of the same categories included in the training dataset. The unseen objects dataset contains 1000 partial views of the categories not included in the training dataset (i.e. faucet, firearm, car, and plane).
To assess the implemented 3D shape reconstructors, we used the Jaccard similarity index, as expressed in Equation (4): where, A and B indicate two volumes, as in Refs. [5,[50][51][52]. A Jaccard value (J) of 0 represents no intersection between the ground-truth and the reconstructed object, and a J of 1 represents a perfect match between the ground-truth and reconstructed object. The computational load of the reconstructors was evaluated based on the training time and network size. The training time was defined as the time needed by the reconstructors to process all the data in the training dataset one time (i.e. one epoch). The network size was defined as the total number of trainable parameters (i.e. weights, bias) in each reconstructor.

| Implementation
All networks presented in this work were built using Keras and Tensorflow and trained with no pre-training or fine-tuning. Training and testing were done with an Intel Xeon CPU E5-2620 v3 2.40 GHz and two NVIDIA GeForce GTX 1080 GPUs, each with 8 GB of RAM.
To train our networks, we used weighted cross-entropy except for Varley's, which used cross-entropy [22]. AdamOptizer was used to optimize the weights with a learning rate of 0.00,003. The number of training steps (epochs) was 160 for Varley's, 200 for 3D-CNN, 170 for SC 3D-CNN, and BNSC 3D-CNN, and 140 for U-Net BNSC 3D-CNN. The number of training steps for each network was chosen to prevent overfitting after the network achieved the smallest loss value.

| Conventional 3D reconstructors
Our proposed U-Net BNSC 3D-CNN is compared against two previous works. First, a 3D-CNN implemented in Varley et al. [22] (i.e. Varley's), which used the reconstructed 3D objects for object grasping. Second, Yang et al. [44] introduce a 3D reconstructor based on a 3D encoder-decoder network combined with GAN (i.e. 3D-RecGANþþ). 3D-RecGANþþ was trained and evaluated using the same categories of the ShapeNet database used in our work. Then for 3D-RecGANþþ, the original trained weights [44] were utilized. To perform a fair comparison with our proposed U-Net BNSC 3D-CNN and the baseline networks, the output of 3D-RecGANþþ was down-sampled to a voxel grid of 40 � 40 � 40. Table 3 shows the performance of our proposed U-Net BNSC 3D-CNN in term of the Jaccard similarity index compared with that of Varley's and the baseline networks. The Unseen Views of Seen Objects column indicates the average Jaccard index value for all objects in the unseen views dataset. The avocado, banana, and box columns are example objects from the unseen views dataset; each column value is the average Jaccard index value of 50 unseen views of each object. The Unseen Views of Unseen Objects column is the average Jaccard similarity index value for all objects in the unseen objects VALAREZO AÑAZCO ET AL.

| Results with the grasp database
-29 dataset. The book, cell phone, and toy columns are example objects from the unseen objects dataset; each column value is the average Jaccard similarity index value of 50 unseen views of each object.
As shown in Table 3, 3D-CNN performed slightly better than Varley's, 1.3% better with the unseen views dataset and 1.2% better with the unseen objects dataset. SC 3D-CNN outperformed Varley's and 3D-CNN, 4% better than Varley's with the unseen views dataset and 2.2% with the unseen objects dataset. BNSC 3D-CNN outperformed Varley's by 3%-10%, and 3D-CNN by 2%-4% in all tests. Comparing the proposed U-Net BNSC 3D-CNN against Varley's, the Jaccard similarity index improved from 6% to 23% in all evaluations. U-Net BNSC 3D-CNN also outperformed the baselines networks by 4%-16% in all tests.
The reconstructed shapes of some example objects from the unseen views and unseen objects tests are presented in Figure 4. The proposed U-Net BNSC 3D-CNN clearly produced the best results. Table 4  BNSC 3D-CNN is slightly smaller than SC 3D-CNN and 3D-CNN because of the BNSC block in BNSC 3D-CNN has fewer trainable parameters than the SC block in SC 3D-CNN. The largest reduction is shown by our proposed U-Net BNSC 3D-CNN, which is 98.69% smaller than Varley's network. Figure 5 summarizes the unseen views test, unseen objects test, and network size. The network size (millions [M] of trainable parameters) of each 3D reconstructor is shown using bubble size. The X-axis represents the Jaccard similarity index value in the unseen objects test. The Y-axis is the Jaccard similarity index value in the unseen views test. As shown in Figure 5, the proposed U-Net BNSC 3D-CNN outperforms the other networks in both the unseen views and unseen objects testing. Figure 5 also shows a significant reduction in network size. Our U-Net BNSC 3D-CNN has 93.76% fewer trainable parameters than all baseline networks and 98.69% fewer trainable parameters than Varley's. Table 5 shows the performance of the proposed U-Net BNSC 3D-CNN in term of the Jaccard similarity index compared with the baseline networks and the previous works. The column of Unseen Views of Seen Objects is the average Jaccard index for all the object categories in the unseen views dataset. The columns of the chair and couch show the average results of the Jaccard index over 1000 objects from each category. As shown in Table 5, U-Net BNSC 3D-CNN outperforms by 22% the Varley's reconstructor, 6% 3D-RecGANþþ, 12% 3D-CNN, 10% SC 3D-CNN, and 7% BNSC 3D-CNN. The column of the Unseen Views of Unseen Objects is the average Jaccard index with all the object categories included in the unseen objects dataset. The column of faucet and car are the average Jaccard index of 1000 objects from each category. Likewise, with the unseen views dataset, U-Net BNSC 3D-CNN outperforms all previous networks: by 28% over Varley, 11% over 3D-RecGANþþ, 21% over 3D-CNN, and 20% over SC 3D-CNN and BNSC 3D-CNN. Figure 6 provides the results of a qualitative evaluation of all 3D reconstructors with ShapeNet. The proposed U-Net BNSC 3D-CNN reconstructs the 3D volumes most similar to the ground truth compared with all other networks, especially for the unseen objects.

| DISCUSSION
Typically, the performance of neural networks (e.g., CNNs) improves when more layers are added. 3D-CNN slightly outperforms Varley's in terms of reconstruction accuracy (Jaccard similarity index) because 3D-CNN has more convolutional layers than Varley's. As shown in Figure 2a,b SC 3D-CNN has the same number of convolutional layers as 3D-CNN. The only difference in their architectures is the skip connections. The 3D reconstruction performance of our U-Net BNSC 3D-CNN improved upon that of classical reconstructors based on 3D-CNN because of the BNSC blocks and the skip connections in our U-Net architecture. Our experiments show that BNSC 3D-CNN outperformed the 3D reconstruction performance of 3D-CNN. For instance, the reconstruction performance increased from 77.9% with 3D-CNN to 81.63% with BNSC 3D-CNN in the unseen views test and from 77.2% to 79.08% in the unseen objects test with the Grasp database; from 50.94% to 55.86% in the unseen views test and from 48.99% to 49.83% in the unseen object test with the ShapeNet dataset. Adding the skip connections between the BNSC blocks increased the reconstruction performance, because the skip connections create pathways to propagate feature information to the whole network. For instance, the skip connections in the contractive path of our U-Net BNSC 3D-CNN propagate features to the second and third BNSC blocks, and the skip connections between the contractive and expansive paths propagate the features to the fourth and fifth BNSC blocks. To preserve and use the features from the contractive path in the 3D reconstruction, the skip connections between the contractive and expansive paths use the concatenate operation. As a result of the feature information propagation, U-Net BNSC 3D-CNN is 6% more accurate than the BNSC 3D-CNN in shape reconstruction with the Grasp database and 17% more accurate with the ShapeNet dataset, as shown in Tables 3 and 5, respectively. It is important to mention that the training time of U-Net BNSC 3D-CNN depends on the operation used with the skip connections in the contractive path. In our work, we used addition operations because addition does not increase the number of features at the input of the next BNSC blocks. Using the concatenation operation instead of addition imposes an increment in the network size and computational cost because of the increase in the feature maps at the input of the second, third, fourth, and fifth BNSC blocks. In consequence, the training time increases, as in the case of the training time of the SC block compared with the BNSC block.
The 3D reconstruction algorithms based on 3D-convolutional layers produce a large number of trainable parameters because of the 3D-kernel. To reduce the network size, we use the BNSC block, which adds bottle-neck layers to the SC block. The BNSC block has the same advantages as the SC block (i.e. training is improved by skip connections), but a reconstructor based on BNSC blocks is faster and has fewer trainable parameters than a reconstructor based on SC blocks because of the bottle-neck layers, as shown in Table 4.
Our U-Net BNSC 3D-CNN is only 20 s faster than Varley's in training time, despite the large difference in network size between the two reconstructors. The training time does not reflect the network sizes because of the parallel processing with GPUs. However, the 20 s of difference in training time is magnified when the training is extended for several epochs or when the training dataset includes more objects or views. The main advantage of our proposed reconstructor, comparing to Varley's network, is its improvement in the Jaccard index without a penalty in the computational load. The network size of 3D-RecGANSþþ is 153.2 million of trainable parameters, which is slightly more parameters than Varley's. Thus, our U-Net BNSC 3D-CNN widely reduces the network size of 3D-RecGANSþþ. Our U-Net BNSC 3D-CNN can be implemented on a graphic card with less memory consumption than previous works due to its network sizes. The difference in network size between our proposed U-Net BNSC 3D-CNN and the previous works is caused by the dense layers in Varley's and 3D-RecGANSþþ. Replacing the last dense layers with convolutional layers decreased the number of trainable parameters, as shown in Table 4 for the baseline networks. Therefore, our U-Net BNSC 3D-CNN has no dense layers.
The 3D reconstruction performance of our U-Net BNSC 3D-CNN might be improved by making some changes in its architecture. For instance, adding more BNSC blocks in the contractive and expansive paths would extract more features for 3D reconstruction. Also, increasing the number of feature maps extracted in each convolutional layer might provide more information for 3D reconstruction because the input would be decomposed into higher dimensional features. More investigations are also need for 3D reconstruction regarding the kernel size. In our experiments, we observed that a larger kernel size (e.g., 6 � 6 � 6 or 5 � 5 � 5) improved 3D reconstruction of the general shape of the object, whereas a F I G U R E 6 Reconstructed shapes of sample objects from the Unseen Views and Unseen Objects datasets for ShapeNet database categories VALAREZO AÑAZCO ET AL.
-33 small kernel size (e.g., 3 � 3 � 3) reconstructed the details of the object better. However, an increment in training time should be considered because of the number of layers (i.e. more BNSC blocks), feature maps, or kernel size will increase the computational load of the reconstructor.
The reconstruction performance with untrained objects depends on the partial view. For instance, whether the partial view provides information about the object details, U-Net BNSC 3D-CNN reconstructs those details in the object with higher reconstruction accuracy, as shown by the arms and legs of a toy in Figure 4 or the tires of a car in Figure 6. However, the reconstruction accuracy (i.e. Jaccard similarity index) for untrained objects would be affected if the partial view did not show any information about the object details.

| CONCLUSION
In terms of reconstruction accuracy (i.e. Jaccard similarity index), the proposed U-Net BNSC 3D-CNN outperforms the baseline and previous reconstructors tested in this work by 6%-11% in the unseen views test with the Grasp database and by 6%-22% with the ShapeNet database. In the unseen objects test, our proposed reconstructor improves by 6%-9% with the Grasp database and by 10%-28% with the ShapeNet database. This improvement is because the skip connections propagate the error and the features information to the whole network. Our U-Net BNSC 3D-CNN reduces the network size by 98.69% compared with Varley's and by 93.76% compared with the baseline networks because it is a fully convolutional network with bottle-neck layers. As a result of the reduction of the computational load, our proposed U-Net BNSC 3D-CNN is faster than the other reconstructors. The performance of our proposed U-Net BNSC 3D-CNN with the untrained object suggests that the DL-based 3D shape reconstructor could play a key role in various applications, including robotic navigation [53], robotic grasp planning [54], self-driving vehicles [5], and augmented reality [55] where the information of 3D shape and volume of an object is essential.