An Efficient 3D Model Retrieval Method Based on Convolutional Neural Network

,


Introduction
Recently, three-dimensional (3D) models have been widely used in computer-aided design (CAD), virtual reality (VR), 3D animation and film, medical diagnosis, 3D online games, machinery manufacturing, and other fields. In particular, with the development of 3D printing, the application of 3D models has become an indispensable technical means in all fields. Since more and more 3D models and digitizing tools are being developed for an ever-increasing number of applications, a large number of 3D models have become available on the Web [1]. rough the Internet, users can download free 3D models according to their needs. Modification and incremental design on these models can not only reduce product cost and shorten design time, but also effectively improve product reliability and quality. However, it is very difficult to find the needed 3D model quickly and accurately from the massive number of available models. 3D model retrieval techniques can solve the above problems; therefore, this technique has become a research hotspot.
One important issue of the 3D model retrieval is to represent models into descriptors. e descriptors describe the 3D model accurately and efficiently to support model classification, index building, and similarity matching. 3D model descriptors can be mainly divided into four categories: geometry-based [2], statistical analysis-based [3], topology-based [4], and projective view-based descriptors [5]. For the geometry-based 3D model descriptors, the 3D model is divided into many grids, and then the features of the 3D model are extracted by different mathematical transformations of the grid model. e earliest work on the former approach is the 3D ShapeNets [6], which learns a convolutional deep belief network that outputs probability distributions of binary occupancy voxel values. After that, Maturana and Scherer propose a similar approach, which builds the VoxNet for real-time object recognition [7]. Li et al. adopt field probing neural networks (FPNNs) to extract features of 3D models. In this method, the 3D models are first represented as volumetric fields, and then the field probing filters are employed to extract features from them [8]. Wu et al. propose a novel framework named the 3D Generative Adversarial Network (3D-GAN), which generates 3D objects from a probabilistic space by leveraging recent advances in volumetric convolutional networks and generative adversarial nets. is method achieves impressive performance on 3D object recognition [9].
Statistical analysis-based 3D model descriptors are a good choice for nonrigid 3D model retrieval. e earliest work is the 3D rotation invariant spherical harmonic representation of 3D shape descriptors (SHP), which reduces the dimensionality of the descriptor and provides a more compact representation [10]. Sun et al. propose the heat kernel signature (HKS) descriptor to describe the local characteristics of the nonrigid 3D models. It is based on diffusion scale-space analysis and characterized by the heat transfer process of the 3D surface [11]. e HKS descriptor is invariant under isometric deformations and stable under perturbations of the model. It has achieved good performance in nonrigid 3D model retrieval. However, it is sensitive to the scale changes of the 3D model. Aubry et al. propose the wave kernel signature (WKS) descriptor to describe the nonrigid 3D model, which describes the average probability of quantum mechanics at a position on a nonrigid 3D model surface. e WKS descriptor explains the relationship between the points on the different spatial scales and the rest of the model surface, and its discriminative ability is more than that of the HKS descriptor [12]. Zeng et al. use WKS and HKS to represent the 3D model, then construct two convolutional neural networks for the HKS distribution and the WKS distribution separately, and use the multifeature fusion layer to connect them. is multifeature fusion learning method can achieve good performance [13]. e topology-based 3D model descriptors analyze the topological structure of the 3D model to extract the topological connections and structural relations among different components. At present, this type of methods mainly includes attribute adjacency graph (AAG) [14], feature dependency graph (FDG) [15], skeleton graph [16], and Reeb graph [17,18]. At present, the trend is to combine the topological structure of the 3D model and multiple views. For example, Su et al. propose multiview CNN (MVCNN), which takes multiview images of an object. is method has the potential strength of MVCNNs in sketch-based shape retrieval [19]. e descriptors based on projective views are the most promising because they transform 3D models into images, which allow image processing methods used for retrieval. In this type of descriptors, the light field descriptor (LFD) is the most popular because it is robust to transformations, noise, and model degeneracy [20]. In the LFD, a 3D model is projected to generate 100 binary images, which are rendered in different views for each model. is descriptor represents 3D models better than other descriptors, but its time complexity is heavy because the image number used for matching is large. Recently, these methods which combine the projective views and deep learning have achieved good performance. In these methods, deep learning models are trained to extract features from the 2D views and make classification. For example, Johns et al. propose pairwise method to bring CNN to generic multiview recognition, by first decomposing an image sequence into a set of image pairs, classifying each pair independently, and then learning an object classifier by weighting the contribution of each pair [21]. Ma et al. propose a method which extracted 2D Zernike moments from 2D projective views as the view saliency.
en the view saliency is used to boost a multiview CNN (VS-MVCNN) for 3D object recognition [22]. In the DeepPano, a panoramic view is used to represent the 3D model, and the CNN is designed to learn deep representations directly from the panoramic view [23]. e similar method is PANORAMA-NN [24], which also uses a panoramic view. In addition, Hegde and Zadeh use FusionNet to combine the representation of 2D projective views and the representation of model volume to learn new features, which yields a significantly better classifier than using either of the representations in isolation [25]. Qi et al. make a comprehensive study on the voxel-based CNNs and multiview CNNs for 3D object classification [26]. Elhoseiny et al. explore CNN architectures for combining object classification and pose estimation learned with multiview images, and this method takes a single image as input for its prediction [27]. Kanezaki et al. improve this method by aggregating predictions from multiple images captured from different viewpoints [28].
We can see that many methods have been effectively applied to 3D model recognition. However, there are several problems that need to be solved. First, current methods do not consider the similarity of 2D views when representing 3D models as 2D views. If cameras around 3D models are sparse, projective views cannot fully describe 3D models. If cameras are dense, redundant views will be generated, resulting in heavy time and space complexity. Second, a fixed number of projective views are used for similarity matching, which also leads to high computational complexity. To solve the above problems, we propose a novel 3D model retrieval method, which is improved in both index building and model retrieval. In the index building, 3D models in library are first converted into 2D projective views using the proposed projection method. en representative views are selected from these 2D projective views by the proposed method based on the K-means.
is method can reduce redundant views and improve the retrieval accuracy and efficiency. After that, the representative views are input into the learned CNN to extract features, which are organized as indexes by their labels. In retrieval, the input 3D model is first processed by the same way as that used in the index building to obtain representative views. en all representative views of a model are classified into one category by the CNN and voting algorithm, and then only the features of one category rather than all categories are chosen to make similarity matching with these representative features. In 2 Complexity addition, we propose a novel similarity matching method, in which the number of views for retrieval is gradually increased until the evidence is enough to determine a 3D model. erefore, model retrieval efficiency is improved substantially.

The Proposed Methodology
2.1. e Overall Scheme. As shown in Figure 1, the whole process of the proposed method can be divided into three steps: (1) 3D model representation and CNN training; (2) 2D representative view extraction and index building; (3) model retrieval. In the first step, 3D models are first converted into 2D projective views, and then these 2D projective views are used to train the CNN. In this part, a projection method is proposed to generate views. In the second step, 3D models are first converted into 2D projective views using the same projection method as that used in training. ese views are then selected by the proposed method based on the Kmeans. e views which are closest to the centers of their own categories are selected as the representative views. Finally, these representative views are input into the learned CNN for feature extraction and index building. In the third step, the input can be an image or a 3D model. If the input is an image, the classification and retrieval are carried out directly. If the input is a 3D model, representative views are generated first by our projection method and representative view selection method, and then the representative views are input into learned CNN for classification and feature extraction. All representative views of the 3D model can be classified into the same category through the voting algorithm. Finally, the result model is found through the variable view matching method.

3D Model Representation and CNN Training.
Nowadays, CNNs have been used widely for object detection, scene recognition, texture recognition, and finegrained classification. e CNN is also used in the proposed method because the CNN outperforms other methods in our task, and the views projected from 3D models can be large enough to learn a good CNN.

Multiview Representation of 3D Model.
It is a key step to represent 3D models into 2D projective views. e main two factors in obtaining the projective views are the selection of projection method and rendering mode.
rough experiments, we adopt the projection method based on region division and rendering method based on multilight sources. e steps are described as follows: (1) Model preprocessing: the purpose of model preprocessing is to normalize the 3D model by limiting it to the unit sphere. First, the maximum and minimum values in the three coordinate directions are obtained by collecting the boundary information of the model and traversing the coordinates of all points. en, the scaling and the position center of the model are calculated. Finally, the model is translated and scaled. e model preprocessing is shown in Figure 2.
(2) Selection of projective points: cameras are deployed on the sphere centered on the center of the 3D model. e spherical surface is divided into four uniform regions, with one camera deployed at the center of each region. Any other cameras are located in the bisectors which pass through the center. e angle between the bisectors is equal. e cameras placed on the bisectors are located in the middle points between the center points and the boundaries.
e lens of each camera should point to the sphere center. e placement of the camera in each region is shown in Figure 3. In the proposed method, 40 projective views are used. A comparison of the proposed method and the LFD is shown in Figure 5, where the view generated by our method is shown in Figure 5(a) and that by the LFD is shown in Figure 5(b). We can see that the projective view obtained by our method is a grayscale image with information entropy of 0.462. In contrast, the projective view of the LFD is a binary image with information entropy of 0.287. erefore, our method contains more detailed features.

CNN Training.
In recent years, CNNs are widely used for image classification. At present, there are a lot of CNNs, such as the VGG, GoogleNet, ResNet, and DenseNet. It is reported that the ResNet can achieve the good performance on ImageNet. e ResNet adopts a unique "shortcut connection" which can effectively avoid gradient disappearance and ensure the training accuracy [30]. In our experiments, the ResNet50 achieved better performance than other types of deep neural networks, so it was used for feature extraction and classification. e ResNet50 consists of 49 convolutional layers and one fully connected layer. e structure of the ResNet50 is shown in Table 1.

Index Building.
It is very important to build indexes for improving the efficiency of model retrieval. In this section, representative view selection is presented first, and then the index building based on the CNN is introduced. methods, a large number of cameras are evenly distributed on the surface of the unit sphere to obtain 2D views. is way does not take the differences in model surface complexity into account. In fact, the part of the 3D model with large surface complexity needs more views to represent, while the part with small surface complexity can be well represented with fewer views. e 2D views projected by current methods have a large number of similar views, which cause amounts of redundancy. erefore, it is necessary to keep only one view from similar views to make views more representative. In this paper, we propose a method to extract 2D representative views. In this method, the K-means is adopted to classify views into different categories according to their similarity, and then one representative view is chosen from each category. In this way, different 3D models may yield different numbers of 3D views.
As an unsupervised classification method, clustering classifies datasets without labels into several clusters [31]. One widely used algorithm for clustering is the K-means [32]. Its advantages are simplicity and local minimum convergence properties. However, it has a shortcoming that the number of clusters should be set manually. For each 3D model, the proposed method based on the K-means is implemented as follows: Step 1: convert the 3D model into 40 2D projective views by the projection method proposed in Section 2.2.1 Step 2: cluster these 2D projective views using the Kmeans Step 3: select the views which are closest to the centers of their own categories as the representative views When the 2D views are clustered by the K-means, the number of categories K must be determined first. According to the experiment, 10-20 views can obtain good performance. erefore, K is roughly set as 10-20, and then the elbow [33] method is used to determine the final value of K. If the 2D views of a 3D model are divided into K categories, K 2D representative views are obtained for the representation of a 3D model.

Index Building Based on CNN.
e indexes of 3D models are built by inputting the 2D representative views into the ResNet50 and then organizing the output features according to their categories. As for input model Model i , its representative views W i1 , W i2 , . . . , W in are first generated.
en, these representative views are input into the learned ResNet50. e outputs of the 49th layer of ResNet50 are features of these representative views, denoted by F i1 , F i2 , . . . , F in . e outputs of the 50th layer of ResNet50 are the labels of these representative views. In this method, the task of 3D model classification is transformed into the classification of views. e index building process is shown in Figure 6.

Model Retrieval.
e task of similarity matching is to find the most similar 3D model in the dataset according to the input. e input can be an image or a 3D model. If the input is an image, the features are directly extracted and the category is determined through the learned CNN. In a category, the output 3D model is found via the following equation: where dis() is the function to compute the Euclidean distance, W is the features of the input image, F ij is the features of jth view of the ith model, 1 ≤ i ≤ m, m is the number of models in a category, 1 ≤ j ≤ n i , and n i is the number of representative views of the ith model. e model i is the output result.
If the input is a 3D model, the model retrieval is realized in three steps: (1) generating 2D representative views; (2) inputting these views into CNN for feature extraction and classification. All representative views of a model may not be classified into the same category because of misclassification, so we adopt voting algorithm to determine one category for views of a model; (3) performing similarity matching. In order to improve the matching efficiency, we propose a similarity matching method which uses variable view numbers.
Let Category Vector denote category vector with the cth element indicating the number of views classified into the cth category. Category_Vector is initialized as follows: where Category_Vector is a c-dimensional vector corresponding c categories in a model library. When a representative view is assigned to the cth category, this vector is updated by Finally, the category of the model is determined by After classification, the retrieval procedure is summarized in Algorithm 1. In order to improve retrieval efficiency, we design a flexible retrieval strategy: (1) if the distance between an input view and a view of a model in the library is small enough, i.e., dis < η, we can make sure that this model is what we need (output model); (2) if there are C threshold representative views belong to the same model in the same category, we can make sure that this model is what we need (output model); (3) if representative views are matched with different models of the same category, the cumulative distance value is calculated. If the cumulative distance value of a model is the minimum, the model is the output model.

Experiments and Results
Chair Deep feature layer Features of views  6 Complexity method is evaluated on the following two aspects: model classification and model retrieval.

Model Classification Evaluation.
In this section, we compare the proposed method with the state-of-the-art methods. e evaluation is made on the following 3D model databases: McGill 3D Shape Benchmark [34] (a nonrigid 3D model dataset) and ModelNet10 and ModelNet40 [35] (two rigid 3D model datasets). Table 2 shows the detail information of these datasets. We follow the training and testing splitting included in ModelNet10 and ModelNet40. ModelNet10 consists of 4899 models in 10 categories, and 3991 are used as the training dataset and 908 models are used as the test dataset. Mod-elNet40 consists of 12311 models in 40 categories, and 9843 are used as the training dataset and 2468 models are used as the test dataset. In the McGill, there are 255 models. 179 3D models are randomly selected for training and the remaining 76 3D models as the test dataset. e model pretrained with the data of ImageNet is used as initialization parameters of the ResNet50. e learning rate is set as 0.01. e batch_size is set as 32 according to GPU size and training efficiency. In order to make the loss function converge quickly, the epoch is set as 200.

Representative View Selection.
In the proposed projection method, each 3D model is presented as 40 views. In order to improve the efficiency of classification and retrieval, representative views are selected from the 40 projective views by the method proposed in Section 2.3.1. e number of representative views K has a great influence on the classification accuracy. In experiments, K is set as 5, 10, 20, and 30, respectively. Misclassified models of the proposed method given different K are shown in Table 3.
In the McGill, whatever K is, there is no misclassified model. e number of misclassified models in ModelNet10 and ModelNet40 decreases as K becomes larger. When K is 5, the number of misclassified models is the largest. When K is more than 20, the number of misclassified models is decreasing slowly. According to this result, we set the range of K as [10,20]. e performance of the proposed method under different datasets and different conditions is shown in Table 4. Taking ModelNet10 as an example, there are 908 models in its training set, and each model has 40 2D views before representative view selection. en the number of 2D views is 36320 (908 × 40). After representative view selection, each model has about 14 2D views, so the number of 2D views is 12742. e classification accuracy remains the same before and after the representative view selection method is used.
We can see from Table 4 that our representative view selection method does not cause performance degradation on McGill and ModelNet10. e classification accuracy on ModelNet40 only decreases by 0.9% after our representative view selection. It should be noted that our representative view selection can significantly reduce the number of views to about 1/3. A smaller number of views lead to higher efficiency of the 3D model classification and retrieval. e experiment in the following section adopts representative views for model classification and retrieval. For each model, about 14 projective views are enough to obtain a good performance.

Comparison of Classification Algorithms Based on
Views. We compare the proposed method with several traditional methods, and the results are shown in Table 5. We can see that our proposed method has achieved the best performance in ModelNet10, with a recognition accuracy of 94.10%. In addition, it has achieved a recognition accuracy of 92% in ModelNet40, which is just 0.9% lower than that of VS-MVCNN. Although VS-MVCNN outperforms the proposed method, it needs 80 views.
Our proposed method can achieve 100% recognition accuracy in McGill (shown in Figure 7). is indicates that the proposed method performs well on both rigid and nonrigid 3D datasets.

Classification Result Analysis.
e confusion matrix of the proposed method in ModelNet10 is shown in Figure 8. We can see that the proposed method can achieve an accuracy of 100% in classes of bed, chair, and monitor, an accuracy of more than 90% in the classes of bathtub, desk, sofa, and toilet, and an accuracy of less than 90% in classes of dresser, night_stand, and table (respectively, 88%, 84%, and 83%). e accuracy in table class is the worst, with 15% of models being misclassified as desk class and 2% of models being misclassified as night_stand class. e reason is that the models in table class and the models in desk class are extremely similar to each other.
We can see from Figure 9 that the models in the dresser class and night_stand class are extremely similar, which leads to misclassification. e misclassification of these models does not matter for users because the two models are either the same or similar enough. e advantage of our method is that it can obtain high accuracy given a small number of views. Especially on McGill, the recognition accuracy is 100%. e reason is that there are great differences between the classes on McGill, and multiple views can better represent 3D models from different angles, leading to superior performance. However, on ModelNet10 and ModelNet40, the proposed method does not have good performance on some classes, such as the table class and desk class, or night_stand class and dresser class. e reason is that there is no obvious difference between the classes of ModelNet10, as well as ModelNet40. It is easy to make mistake for any classification method.

Retrieval Experiment.
Our retrieval method is based on the classification results. e input is classified before similarity matching. e advantage is that similarity is calculated between the input and the models in one category rather than all categories, so it can greatly reduce the searching scope and computation complexity. In the following section, the similarity matching method is evaluated and analyzed on the rigid datasets and the nonrigid dataset, respectively.

Retrieval Experiment for Rigid Datasets.
Our shape descriptors are compared against the spherical harmonics descriptor (SPH) [10], LFD [20], 3D ShapeNets [6], DeepPano [23], PANORAMA-NN [24], View Inter-Prediction GAN (VIPGAN) [36] and Ma et al.'s method [37]. e result of the mean average precision (MAP) is shown in Table 6. We can see that the MAP of our proposed method is obviously higher than those of other methods. ere are two reasons for this: (1) classification is made before retrieval because the accuracy of the proposed classification method is high enough to ensure the good retrieval accuracy, and (2) the voting mechanism is adopted. Some views of an input model are easily misclassified due to their high similarity. rough voting mechanisms, these misclassified views can be reclassified correctly. e precision-recall curves are shown in Figures 10 and 11. We can see that our method outperforms other state-of-the-art methods. e precision-recall curve of the proposed method is stable, while those of other methods gradually decrease with the increase of recall. Taking Figure 10 as an example, when the recall rate is less than 0.2, the PANORAMA-NN and Ma et al.'s input: W l is the features of representative views of input model, l � 1, 2, 3, . . . , p , F ij is the features of jth view of the ith model in dataset, m is the number of models in a category, n i is the number of representative views of the ith model, η is the minimum distance, Distance_Vector is the distance vector, Distance_Vector � [0, 0, . . . , 0], Count is the counting vector, it is used to record the number of views that are classified into each category,    8 Complexity method perform better than our method. However, when the recall rate is larger than 0.9, the precision rates of the two methods decrease rapidly. In particular, the precision rate of Ma et al.'s method decreases to 0.1 when the recall rate is close to 1. e precision-recall curves of the DeepPano and VIPGAN are similar to that of the proposed method when the recall rate is less than 0.9. However, their precision rates decrease rapidly when the recall rate is close to 1. e SPH performs the worst.   Figure 9: Similar models in different categories.   ModelNet40 SPH [10] 45. 9 34.4 LFD [20] 49.8 40.9 3D ShapeNets [6] 69.2 59.9 DeepPano [23] 84.2 76.8 PANORAMA-NN [24] 87.4 83.5 VIPGAN [36] 90.6 89.2 Ma et al. [37] 93.1 84.3 Ours 94.1 92.0 e LFD is slightly better than the SPH. e 3D ShapeNets is in the middle of these eight methods. e precision rates of these three methods decrease from 1 to 0 with the increase in recall.
We can see from Table 7 that our proposed method achieves the best performance on the NN, FT, ST, and DCG measures. And the performance of the proposed method on the nonrigid dataset is better than that on the rigid dataset.
e reason is that we use the well-trained CNN to classify the models in McGill. e classification accuracy is 100%, so the retrieval accuracy is also 100%. In summary, our method obtains good performance on both rigid and nonrigid datasets.

Retrieval Efficiency Analysis.
Experiments show that similarity matching consumes the most time during 3D model retrieval. Taking ModelNet10 as an example, there are 908 models in the test set and 3991 models in the training set. Each model has 40 views, so the test set contains 36320 views and the training set contains 159640 views. If all views are used for similarity matching, the time complexity is large. Table 8 shows the comparison of the number of views before and after representative view selection in ModelNet10.
We can see that the view number in the test set decreases from 36320 to 12742 and the view number in the training set decreases from 159640 to 56613 through representative view selection. e view number is reduced by 2/3 after representative view selection, so this method can effectively reduce redundant views and greatly improve the retrieval efficiency.
In ModelNet10, the training set consists of 3991 models, and these models are divided into 10 classes, with each class consisting of 399 models on average. After applying representative view selection, the number of similarity matching is reduced from 638400 (40 × 399 × 40) to 78204 (14 × 399 × 14) (reduced by 87.8%). e variable view matching method can further improve the matching efficiency. In this paper, η is defined as the similarity of two views generated by two adjacent projective points of the same model. We call η as adjacent view distance. e smaller η is, the higher the accuracy is. We take ModelNet10 as an example to analyze η under our projection method. Adjacent projection points are shown in Figure 12.
Experiments show that the adjacent view distances of any two views are different. In the same category, the minimum adjacent view distance is chosen as the representative to form the list of adjacent view distance. Table 9 shows the average adjacent view distances when different numbers of models are selected for each category. Taking the bathtub class as an example, when the model number is 1, the minimum adjacent view distance is 1.705. When the model number is 20, the average adjacent view distance is 1.995. e adjacent view distance of the table class is the smallest and that of the bed class is the largest. e reason is that the model complexity is different. e models in the table class are simple, while the models in the bed class are more complex than others. e last row of Table 9 shows the average adjacent view distance of all categories with model numbers of 20, 10, 5, and 1. We can see that when the number of models is 1, the average adjacent view distance is the smallest with 1.6418. When the number of models is 20, the average adjacent view distance is the biggest at 1.8572. In order to improve the efficiency and accuracy of 3D model retrieval, η is set as 1.5.
In Algorithm 1, there are three conditions to finish similarity matching. e view numbers used under the three conditions are 1, 5, and 14, respectively, i.e., C threshold is set as 5. e results on ModelNet10 are shown in Table 10, where η is 1.5. For example, in bathtub, there are 3 models under condition 1. at is to say, these 3 models can be retrieved by only using one view. And there are 4 models under condition 2 and 43 models under condition 3. If we do not use the variable view matching, all models are retrieved by using 14  Adjacent projection points Complexity views. In ModelNet10, if we use variable view matching, the number of all views is 10267 (132 + 405 + 9730), while that of the traditional method is 12742. e number of views is reduced by 2475. at is to say, the average number of views for retrieval of each model is reduced to 11. rough variable view matching, the average number of similarity matching of each model is approximately 61446 (11 × 399 × 14). Compared with only using representative view selection method, the number of similarity matching is further reduced by 21.4%.

Conclusion
With the increase of 3D models, the degradation of retrieval accuracy and efficiency becomes a serious problem for 3D model retrieval systems. An efficient 3D model retrieval method is proposed in this paper. e efficiency of the proposed method is improved in three aspects: (1) Efficient indexes are built through the representative view selection and the feature extraction with the CNN. And then features are organized via their labels. In this way, the 3D models are represented more efficient and the number of used views is reduced substantially. (2) e number of similarity matching is reduced by classification before retrieval. In retrieval, 2D views of the input model are classified into one category with the CNN and voting mechanism, and then, only the features of this category, rather than all categories, are chosen to make similarity matching. (3) Variable view matching method is proposed. e retrieval of some models can be terminated ahead of time.
e accuracy of our proposed method is improved in two aspects: (1) e classification of input models is made before retrieval. Our classification method obtains good performance, so the retrieval accuracy and efficiency are guaranteed. (2) e voting mechanism is used to classify input 3D models.
rough the voting mechanisms, the misclassified views can be reclassified correctly.
Although the proposed 3D model retrieval method demonstrates great improvement in both accuracy and efficiency, similar 3D models are easy to be misclassified. erefore, we will study how to improve the discrimination of model representation in our future work.

Data Availability
Previously reported ModelNet10 and ModelNet40 data are used to support this study and are available at http:// modelnet.cs.princeton.edu/. ese prior studies (and datasets) are cited at relevant places within the text as reference [20].
e McGill 3D Shape Benchmark data are used to support this study and are available at http://www.cim. mcgill.ca/∼shape/benchMark/. ese prior studies (and datasets) are cited at relevant places within the text as reference [19]. We also called it McGill and McGill10 in our paper.