TGPS: dynamic point cloud down-sampling of the dense point clouds for Terracotta Warrior fragments

: The dense point clouds of Terracotta Warriors obtained by a 3D scanner have a lot of redundant data, which reduces the efficiency of the transmission and subsequent processing. Aiming at the problems that points generated by sampling methods cannot be learned through the network and are irrelevant to downstream tasks, an end-to-end specific task-driven and learnable down-sampling method named TGPS is proposed. First, the point-based Transformer unit is used to embed the features and the mapping function is used to extract the input point features to dynamically describe the global features. Then, the inner product of the global feature and each point feature is used to estimate the contribution of each point to the global feature. The contribution values are sorted by descending for different tasks, and the point features with high similarity to the global features are retained. To further learn rich local representation, combined with the graph convolution operation, the Dynamic Graph Attention Edge Convolution (DGA EConv) is proposed as a neighborhood graph for local feature aggregation. Finally, the networks for the downstream tasks of point cloud classification and reconstruction are presented. Experiments show that the method realizes the down-sampling under the guidance of the global features. The proposed TGPS-DGA-Net for point cloud classification has achieved the best accuracy on both the real-world Terracotta Warrior fragments and the public datasets. © 2023 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement


Introduction
As a critical carrier of Chinese history and culture, the virtual restoration of the Terracotta Warriors has important social and economic value for the digital protection of cultural relics and the study of cultural inheritance.With the emergence of 3D laser scanner technology, point clouds are easily obtained and widely used in computer vision (CV), such as automatic driving [1], remote sensing [2], cultural heritage protection [3,4,5,6,7] and other fields.The digital collection of the Terracotta Warrior fragments can be achieved by a 3D scanner and restored virtually.The dense point clouds have lots of redundant points and need excessively large storage space and a lot of time for post-processing.It is not conducive to the storage, transmission, calculation, and display of cultural relics.Therefore, how to accurately and effectively simplify the dense point clouds while maintaining the feature points of models is still a challenging problem for virtual restoration of the cultural relics artifacts.
Due to the sparseness, unstructuredness, and irregularity of point clouds, the traditional Convolutional Neural Network (CNN) is not suitable for the point clouds directly.As a pioneering work, PointNet [8] can perform analysis directly on point cloud using deep neural networks.To learn richer local structures, many follow-up works are improved which can generally be divided into neighborhood feature fusion [9,10], graph-based convolution [11,12,13], and kernel-based convolution [14,15].PointNet++ [9] builds a hierarchical feature learning structure.To collect critical points from the upper layer, the intervention of down-sampling methods is required.PointNet++ first utilizes Farthest Point Sampling (FPS) to select centroids at different levels, then query the neighbor points of the centroids within a specific radius to form local regions.However, the FPS is only based on the Euclidean distance with high time complexity and not suitable for the sparsity and semantic information of point clouds.Subsequently, some researchers have attempted to design apply deep learning to the study of the point cloud down-sampling [16,17,18,19,20].However, The existing approaches are not suitable for the large and dense point clouds of the Terracotta Warrior fragments, which have tens of thousands of points.At the same time, these task-independent sampling network performance still need to further improve.
To address the problems mentioned above, an end-to-end learnable down-sampling method named TGPS for the Terracotta Warrior fragments is proposed.The overall framework of the method is illustrated in Figure 1.First, we divide the dense point clouds into irregular local patches (may overlap) using FPS and K-Nearest Neighborhood (KNN) algorithm.Second, the TGPS can produce sampled points which related to the different downstream tasks.Moreover, the DGA EConv aggregates the features of a set of points' contribution coefficients to further enrich the local region information.Third, the TGPS-DGA Module, which is consisted of TGPS and DGA EConv, is followed by a folding-based decoder [11] (or a fully connected (FC) layer) forming a completed network for point cloud reconstruction (or classification).Finally, our equipped model is evaluated on the public datasets and the real-world datasets.Experiments demonstrate that the proposed method achieves comparable performance with other approaches in point cloud reconstruction and classification.The main contributions are summarized as follows: • We propose an end-to-end specific task-driven and learnable down-sampling approach (TGPS) based on Transformers, and present the sampling results for the downstream tasks for 3D Terracotta Warrior reconstruction and classification.Experiments show our method dose well for the large and dense point clouds of the Terracotta Warrior fragments.
• We propose a DGA EConv for local feature aggregation which exploits the attention mechanism on a local graph to capture accurate and robust geometric details of Terracotta Warrior fragments.• We propose a TGPS-DGA-Net, which is equipped with the TGPS-DGA Module and the FC layer, for 3D object classification that achieves the best accuracy on the real-world Terracotta Warrior fragments and the public datasets.

Point cloud sampling
Since FPS [9] and SO-Net [10] both are offline and task-independent down-sampling methods.Subsequently, some methods have been improved to a certain extent.Yang et al. [16] utilized a Self-Attention (SA) mechanism to learn the relationship between points and used Gumbel Subset Sampling (GSS) to improve the accuracy.S-Net [17] generated the sampled points optimized for a specific task.However, it was not differentiable in the matching operation and could not propagate gradients through the neural network.Furthermore, Lang et al. [18] proposed a differentiable method to approximate point cloud sampling.To address the problem that existing down-sampling methods did not consider the importance of the sampled points to the final output, Nezhadarya et al. [19] presented an adaptive down-sampling layer named Critical Points Layer (CPL), which could simplify the unordered point clouds while preserving critical points.However, the sampling task was trained separately and the performance was limited.Wang et al. [20] proposed a trainable method (GDS) for the point cloud classification task without repetitive sampling and introduced the Chamfer distance (CD) loss function to make the sampling results as uniform as possible.Our method can be regarded as an improvement to the GDS in some extent, and the sampling results are more important for downstream tasks.

Transformers
Transformers were first proposed and had achieved great success in the field of Natural Language Processing (NLP) [21,22].Then, they have also shown promising performance on various vision tasks [23,24,25,26,27].More recently, there have also been attempts to apply Transformers to point cloud data.PT [28] and PCT [29] achieved the superior performance of visual Transformers in point cloud analysis.Subsequently, some scholars extended Transformer to various applications, such as 3DETR [30], Pointformer [31] for 3D object detection, PTTR [32] for 3D target tracking, and PST-NET [33] for point cloud simplification, etc.Yet a learned point clouds down-sampling method based on Transformers, subject to a subsequent task objective, has not been proposed before.Our TGPS can be seen as a new attempt approach.

Graph convolution methods
Graph convolutional networks can be categorized as spectral approaches [34,35,36] and nonspectral approaches [12,37,38].As the Eigen decomposition of the Laplacian operator requires high computing costs, and the spectral approaches require a large number of parameters and difficult spatial positioning.Different from spectral approaches, the non-spectral method describes the spatial relationship of each point with vertices and edges, and constructs a convolution kernel by aggregating edge features and vertex features.Simonovsky et al. [37] proposedthe edge convolutional operations performs on graph signals in the spatial domain.However, the edge label generation process is dynamic, the irregular local point distribution is not considered.DGCNN [12] adopted EdgeConv to capture the local relationship and dynamically updated the relationship graph.LDGCNN [38] increased the number of network layers and uses residual connections to improve the model performance, but directly using the max-pooling would lead to the loss of some important information.Therefore, we propose the DGA EConv that can learn local relations between points.

Transformer unit
We follow the coordinate-based input embedding method in [34].The input points are first embedded into a high-dimensional space through MLP, and then the Neighbor embedding cell (NEC) is used to extract local features.Next, the global features are obtained by two consecutive Improved Offset-Attention (IOA) cells to obtain the global features to learn the rich semantic features of each point.The output feature vector of Transformer unit is obtained by the concatenation of local features and global features.Hence, the Transformer unit can learn more about geometric information and semantic information, and improve the ability to perceive the local features.

Neighbor embedding cell
Neighbor embedding cell is used to strengthen the capability of local feature extraction.After data augmentation, the input point cloud P with N points and corresponding features F, and the sampled point clouds P S ∈ P are obtained by down-sampling.For each point p i of the sampled point cloud, the neighborhood N(p S ) = {p j S |j = 1, 2, . . ., k} is gathered by the KNN.Then, the difference feature between the center point p S and its neighbor point p j S are calculated as the local feature, and the MLP is combined with the point cloud feature before neighborhood aggregation.Finally, the aggregated feature F NS (p S ) of the neighbor embedding cell are generated after the max-pooling.
where F(p S ) is the corresponding output feature of the point p S , C(•) denotes concatenation operation, R(F(p S ), k)) is the operation that repeats the input F(p S ) k times to construct a matrix, M(•) is the max-pooling operation.

Improved offset-attention (IOA)
As a core component of Transformers, the SA is used to establish the inner links between features so that the similar features can be selectively aggregated together.In our TGPS-DGA Module, an improved Offset-Attention (IOA) is introduced to replace the original SA, which can improve the performance of the network.First, a set of query, key, and value pairs is produced by linear transformations of the input feature F NS (p S ). where denotes the learnable linear transformation weights, d u is the dimension of the matrices Q and K, d i is the dimension of the matric V.For less computation, the dimension d i is set to be d u /4.We compute the dot products of the query points with key points, with scaling by √ d u , and apply a softmax function to obtain the weights on the values.The higher the attention weight, the stronger the relevance of the feature.Next, the weighted sum of the attention weights and the corresponding values in matric V is calculated and the final output features are denoted by F sa (p S ).
The IOA cell can be built upon the different relationship functions between the SA features F sa (p S ) and the input features F NS (p S ) for each point.To avoid the disappearance of the gradient during training, the features are passed through the MLP and the final output features F ioa (p S ) of the IOA are obtained by skip connections with the feature F NS (p S ).
where F ioa (p S ) denotes the output feature of a single-layer IOA, θ(, ) is a relational function.
As shown in Figure 3, the number 1 to 5 represents each of the five different relational functions, which is Summation, Subtraction, Concatenation, Hadamard product, and Dot product, respectively.
(a) Summation: where o(•) and µ(•) are trainable transformations, such as the form of linear functions.As shown in Table 6, the relation function θ(, ) takes the best result of the Concatenation.Please see more discussion and analysis in Section 4.4.2 for details.Finally, the output feature F G (p S ) of two consecutive IOA cells is expressed as Eq. ( 7).
where IOA i denotes the ith improved offset attention layer, each attention layer has the same output dimension as its input and W o is the weight of the linear layer.
The output feature F LG (p S ) of Transformer unit is formed by concatenating the local feature F L (p S ) and global feature F G (p S ).

GPcS layer
As shown in the GPcS layer in Figure 2(a), the output feature of the Transformer unit F G (p S ) is fed into a mapping function ϕ, which is consisted of the MLP followed by a global max-pooling, to learn a score vector F µ (p S ) to estimate each point's significance γ(p S ).The score vector F µ (p S ) can be seen as an optimizable variable related to the description of the input cloud during training.Therefore, the score vector F µ (p S ) can be expressed as: The significance of each point γ(p S ) is calculated by the inner product between F G (p S ) and F µ (p S ), which can be regarded as the semantic similarity between each point and global feature as follows: The features of important points and their coordinates are preserved for global shape description by retaining points with high scores.As shown in Eq. ( 10), we gather the index of top-m ranking points in the value of γ(p S ) and the corresponding features form F g .The characteristics of the sampled point can be obtained by Eq. (11).
As the sampling rate gradually increases, this will result in the loss of a large amount of local geometric detail.Therefore, the spatial neighborhood features of the downsampled points need to be aggregated.To enrich the feature information, the features of the previous layer of the sampled points are included.The simplest solution is to find the input feature F G (p S ) of the corresponding sampled point from the input point clouds.The sampled point feature F g (p S ) and the local geometric details aggregated by the KNN algorithm are connected to obtain rich local features F n , as shown in Eq. (12).

Dynamic graph attention edge convolution
For a given graph G = (V, E), where the set of vertices is By reasonably assigning weights to the edge features formed by the k-nearest neighbor graph, the interference of distant points is weakened, and the features of close points are relatively strengthened, which helps the network to better learn suitable local features.To reasonably assign weights to the point clouds, a shared attention convolution kernel r : R 3+d → R d ′ is constructed, which is obtained by learning the features of the center point p i and its neighbor point p ij .The attention edge coefficient is defined as w ′ ij , which represents the importance of the neighbor point to the center point.The coefficient w ′ ij is calculated by the attention mechanism r. where is a feature mapping function, an MLP maps input point features to high-dimensional features.(p ijp i ) represents the relative spatial relationship between point p i and its neighbor point p ij , h ij represents the input feature of the neighborhood point p ij , (ω g (h ij ) − ω g (h i )) represents the feature difference between two points, and C(•) denotes concatenation operation.From Eq. 1, the attention edge coefficient will assign more attention to similar neighborhood points, which are similar in both distance and feature space.To make the edge coefficients easy to compare with neighbors of different scale sizes at different points, they are normalized using the softmax function.
where w ′ ij,k represents the attention weight of p ij corresponding to p i on the kth feature channel.The shared attention convolution r can be implemented by a MLP.Expanding Eq. 14, the final normalized edge coefficients can be expressed as: where ω α stands for the MLP.The normalized edge coefficient w ij is used to assign weights to each edge, and the weighted sum of the neighborhood point features of the central point p i is calculated as the final output feature for each point.The DGA EConv calculation operator is shown in Figure 4.The final updated features of point p i can be defined as: where ω g (h j ) is the feature mapping function and b i is the corresponding bias term.

Loss
CD loss L cd is to force the distribution of the sampled point clouds P S close to the distribution of the input point clouds P. The CD loss is expressed as follows: The repulsion loss L rep prompts that the sampled point cloud P S closes to the input point cloud P, while the sampled point p ′ is far away from other points around another sampled point.The repulsion loss L rep is expressed as follows: Accordingly, a joint loss L joint is designed to train the network for adjusting the distribution of the sampled points, which can effectively ensure that the sampling results are evenly distributed in the overall and local regions.

Experiments and results
To evaluate the performance of the proposed method, a series of experiments are conducted.In this section, we present the results of our approach to point cloud reconstruction and classification.
The training dense point clouds include the Stanford Bunny and the real-world Terracotta Warrior Fragments.The classification task is benchmarked on the datasets of ModelNet, ScanObjectNN, and the Terracotta Warrior fragments.ModelNet is a synthetic dataset, while, both ScanObjectNN and the Terracotta Warrior fragments are real-world datasets.

Architecture
In TGPS-DGA-Net, the augmented data first goes through the NEC cell to extract local features with 16 dimensions, then pass through two consecutive 16-dimensional IOA layer and finally concatenate the output of each IOA layer.To enrich local relations between points, the feature with 512 dimensions is formed by DGA EConv.In the down sampled GPcS layer, the sampled ratio k is set to 256.Finally, the final output passes through two DGA EConv layers with dimensions of [40,41].For classification, at the end of the network are two linear layers each followed by batch normalization and a LeakyReLU activation function with a negative slope of 0.2 while the final layer directly outputs predictions.In addition, for reconstruction, we use the same encoder architecture as in classification.The parameters of the decoder are followed by FoldingNet [11].Our proposed method optimizes the joint loss separability to balance the distribution of the sampled points.When the weight is taken as 1, that is, L joint = L cd + L rep can make the point cloud distribution the most uniform.The network is trained for 250 epochs on an NVIDIA GTX 1080Ti GPU and PyTorch v1.2, using Adam optimizer without weight decay.An initial learning rate of 0.01, and an initial momentum for Batch Normalization layers of 0.9 are set.The batch size is 24 for classification and 20 for reconstruction.The neighborhood value k is set to 16 for classification and 20 for reconstruction.

Reconstruction
The folding-based decoder is added to the DGA EConv for the reconstruction task.It should be noted that the dense point clouds of the Terracotta Warrior Fragments are scanned by trained students in our lab using a Creaform VIU handy scanner.Several dense point cloud models used are shown in Figure 5.To verify the validity of the sampled points, Geomagic software is used to obtain the triangular patch surface and fit the results of the point cloud simplification.As shown in Figure 6 and Figure 7, the simplified results and triangular patch surface reconstruction results of the Bunny and the Terracotta fragments.In addition, the quantitative results are shown in Table 1.
The simplification rate of our TGPS and GDS [20] is usually 1/n 2 , while the method in [3] needs to extract feature points and simplify non-feature points separately.Therefore, there is no  guarantee that the number is the same as our TGPS and GDS.For fairness, the number of [3] should be close to that of the other two simplified results.For Bunny, the number of original point clouds is 35974, and the number should be 8994 when the simplification rate is 25%, however the number of the method in [3] is 9201, approximately 25% of the simplification rate.As shown in Figure 6, the feature points of the ear in [20] are sparsely distributed, and there are holes in the surface reconstruction results.The point cloud simplification results and reconstruction results of our TGPS and the method in [3] both are similar to the original point clouds, and the simplified results of our TGPS are obtained by network learning without manual intervention.In summary, our TGPS is more effective.Figure 7 shows the different simplification and reconstruction results of G10-19-hand and G10-19-head.Figure 7 (a1) and (c1) are the dense point clouds of the G10-19-hand and G10-19-head, the number of points is 26692 and 40859, respectively.(a4) and (c4) are our simplification results with the number of points of 6773 and 10214; (a2) and (c2) are the simplification results of the method in [3], with the number of points of 6886 and 10103; (a3) and (c3) are the simplification results of GDS [20], with the same number as our TGPS; (b1-b4) and (d1-d4) are the triangular patch surface reconstruction results by Geomagic Wrap.
For the G10-19-hand, the simplification rates of the three methods all remain at about 25%.In Figure 7(b3), we can see holes in the wrist of the hand model.The method in [3] and our method can both obtain better surface reconstruction results, which indicates the simplified results are better.For the G10-19-head model, the feature points of the nose, eyes, mouth, and ears of the simplified model obtained by our TGPS can be well preserved, and the distribution of the feature points is relatively uniform in the relatively flat parts of the forehead, cheeks, and neck.However, the feature points of the hairline above the forehead of the simplified model obtained by the method in [3] are lost, which leads to the fact that the part is relatively flat and the feature outline is not obvious when it is converted into a triangular mesh model.In summary, our TGPS has the best simplification results and surface reconstruction results.
To evaluate the accuracy of the simplified point clouds, the geometric error between the original and simplified point clouds should be measured.The quantitative errors are shown in Table 3, where ∆ max and ∆ avg represent the maximum error and average error of the geometry which are proposed by Shi et al. [42].In Table 1, the method in [3] is the largest among the three methods in terms of both ∆ max and ∆ avg , which is consistent with the experimental results in Figure 6 and Figure 7.The simplification results of our TGPS are significantly better than the method in [3] and [20], with minimal differences from the original point clouds while retaining a similar number of points, that is, the simplified results are closer to the original point clouds, which is beneficial for surface reconstruction.These quantitative results indicate that our TGPS comprehensively outperforms the existing simplified methods, and can better preserve the feature points of models.

Classification
In this section, to demonstrate the effectiveness and efficiency of the proposed TGPS-DGA-Net, we conducted experiments on three public datasets (ModelNet10, ModelNet40, ScanObjectNN) and a real-world dataset (Terracotta Warrior fragments).The results show that our proposed framework greatly outperforms all existing methods.We also perform a series of experiments to analyze the importance of each component in our method.

Datasets and implementation details
ModelNet: ModelNet40 contains 12,311 CAD models from 40 man-made object categories, of which 9,843 objects are for training and 2,468 objects are for testing.As a subset of ModelNet40, ModelNet10 contains 4899 models from 10 categories and is split into 3991 for training and 908 for testing.Each sample only retains 1024 uniformly distribute points as input, and only the coordinates (x, y, z) of the sampled points are used in the experiment.
ScanObjectNN: ScanObjectNN is a real-world point clouds object based on indoor scene scan data, containing 15 categories and a total of 2880 objects.The training set contains 2304 objects and the test set contains 576 objects.
Terracotta Warrior fragments: After the above simplification method, the simplified model of the Terracotta Warriors can be obtained.The current Terracotta Warrior fragments dataset was randomly sliced using the Blender software from 40 whole Terracotta Warriors, and finally get 3250 fragments, which can be divided into four categories: (Arm: 800, Body: 810, Head: 810 and Legs: 830).
Figure 8 illustrates the randomly sliced fragments of the kc02f02-Arm model by Blender.As shown in Figure 8(b), these fragments vary in size and the number of points they contain.Hence, the obtained point clouds of the sliced fragments need to be down-sampled or interpolated, finally ensuring that each fragment has a fixed number of 2048.However, the number of the Terracotta Warrior fragments is not far from enough for training deep neural networks.To improve the robustness of the network, the sliced fragment point clouds are resampled into four non-overlapping point clouds, and the extended dataset is generated.Among them, 10144 patches for training (Arm:2656, Body:2720, Head:2272, Leg:2496) and remained 1852 for testing (testArm:476, testBody:504, testHead:428, testLeg:444).For training, we sample 1024 points and normalize them into a unit ball as input.Figure 9 shows the processed Terracotta Warrior fragments.(1) Comparing with existing classification methods In Table 2, the first four methods PointNet++ [9], SpiderCNN [14], PointCNN [15], and DGCNN [12] were general networks for point cloud learning, and the down-sampling method used was the FPS.
Table 2 shows the results of different methods evaluated regarding the overall accuracy(OA) and per-class mean accuracy(CA).In this highly competitive dataset, our model achieves the highest performance in terms of OA with 93.23%.Compared with PointNet++, our TGPS-DGA-Net applies the attention graph convolution to complete the construction of edge features and effectively improves the accuracy by 2.53%.Compared with DGCNN, our method outperforms it by 1.39%.As a model based on Transformers, our TGPS-DGA-Net is much higher than PST-NET by 4.03%.As a task-driven sampling network, our method outperforms by 3% than that of PST-NET, S-Net, and SampleNet.When the DGA EConv in our TGPS-DGA-Net is replaced by the general EdgeConv, named GTS-EC-Net, the result is reduced by 1.29%, which further illustrates the effectiveness of the proposed DGA EConv.
Table 2 also exhibits the results of the methods on the ScanObjectNN dataset.Our TGPS-DGA-Net with 86.16% outperformed Pointnet++ and DGCNN by 3.86% and 3.36%, respectively.This indicates that the TGPS-DGA-Net is more robust to the interference caused by the confusion between foreground and background points, and the reason is that the sampling method can select key points through the TGPS.Experiments demonstrate that our TGPS-DGA-Net achieves comparable performance with other approaches in point cloud classification, and also has good generalization ability for complex scenes.
(2) Comparing with existing sampling methods To verify the feasibility and effectiveness of the proposed TGPS, we conduct two groups of experiments.The first group of experiments exhibited the accuracy of our TGPS in different sampling rates.From the last row in Table 3, we can see that the accuracy of our proposed method tends to increase first and then decrease as the sampling rate decreases.When the number of sampled points M is set to 256, the TGPS-DGA-Net achieves the best classification accuracy.
The offset of the accuracy is only 0.24% when M is 1024 and 64.Experiments further prove that the TGPS is still effective in retaining feature points with fewer sampled points.The second group of experiments is comparing the proposed TGPS with the previous methods.For fairness, all the reported methods are given 512 points.In Table 2, our TGPS achieves the best sampling results.Among them, FPS and RS are simple and the accuracy is far inferior to our TGPS.Sample-Net and PST-Net are task-driven sampling methods, but the classification accuracy needs to be improved.Our TGPS obtains up to 2.5% and 0.03% improvement over WCPL and GDS, respectively.Furthermore, Figure 10 shows the simplified results of WCPL, GDS, and our TGPS.The number of all sampled points is 256.From the comparison results, we can see that the sampled points obtained by our TGPS are evenly distributed.At the same time, feature points and contour points are effectively preserved.

Experiments on Terracotta Warrior dataset
We benchmark our TGPS-DGA-Net on 3D Terracotta Warrior fragments and compare the performance with the existing methods.As shown in Table 4, our method can achieve the highest accuracy.Compared with the traditional method [43], our method outperforms by 10.04%.Moreover, our method outperforms the method in [4] by 6.27%, which included both XYZ coordinates and RGB information.Compared with the method [5] uses richer feature information (with normal vectors) as inputs, our method has an improvement of 1.46%.The implementation results further demonstrate the effectiveness of the GDA EConv in extracting local features.To further verify the effectiveness of the number of layers of IOA modules included in the Transformer unit on feature extraction, 1-layer, 2-layer, and 4-layer IOA modules are used for comparison on the verification set, and the results are shown in Table 8.The results show that when the number of layers is 1, the accuracy is 91.77%, which is the lowest among the three forms.When the number of layers is 2, the accuracy is significantly improved by 1.46%.When the number of layers is 4, the result is closed to that at 2. Considering more IOA layers will lead to an increase in model complexity, 2-layer IOA is the best choice.To verify the effect of different loss functions on classification results, Table 9 shows the classification accuracy.As suggested in Table 9, the result of the joint loss L joint can achieve the best performance.It is further proved that our joint loss helps to extract the important points for classification.In addition, as exhibited in Figure 11, the sampled points of the CD-only loss are relatively concentrated on the contour of the 3D object, and the result of Rep-only loss is relatively even than that with CD-only loss, there are still more obvious holes in the lower right corner of the guitar and in the fuselage area of the aircraft.the result of the total loss achieves the best performance.When adopting the joint loss L joint , the distribution of the sampled points is more uniform in the overall and local regions, and the contour features are well preserved.

Conclusion
Due to the dense point clouds of the cultural relic collected by the 3D scanner requiring excessive storage and time costs, we propose an end-to-end point cloud simplification method for the Terracotta Warrior fragments.In this paper, we propose a learnable point cloud down-sampling method, which uses the importance of each point to perform non-repetitive point down-sampling under the guidance of global features, and retains important points related to downstream tasks.Among them, the TGPS can retain important point features and their coordinates by training the network.In addition, to further learn rich local representation, the DGA EConv proposed to perform local feature aggregation for the neighborhood graph.Experiments demonstrate that our TGPS can effectively simplify the models and capture more detailed features according to the point cloud classification and reconstruction tasks.
However, there are limitations that our method is relatively sensitive to outliers or noise.In the future, we could further propose a differentiable adaptive sampling method to readjust the spatial distribution of sampled points in data-driven manners to further improve the robustness of the network.Furthermore, combined with knowledge distillation, we will propose an efficient and lightweight model that can achieve network acceleration.
a dense point cloud D ∈ G × 3, a set of local patches {P i } ⊆ N × 3, i = 1, 2, . . ., C, are obtained by FPS and KNN, where C is the number of local patches.To alleviate the overfitting of the network, each local patch with N is first subjected to perform conventional data augmentation (e.g.rotation or jittering) to improve the diversity of the training samples, while preserving the structure of the raw data.Next, the features are encoded by the Transformer unit, and the sampled points can be obtained by the TGPS.The number of sampled points of each patch is M.Then, the DGA EConv is used to aggregate the center point features by the importance of the neighbor points to enrich the local region.Finally, combined the TGPS-DGA Module with FC layers to form TGPS-DGA-Net for 3D object classification.The overall framework of TGPS-DGA-Net is illustrated in Figure 2. We evaluate our pretrained models on the real-world Terracotta Warrior fragments and the public datasets, e.g., ModelNet and ScanObjectNN, and achieve the best accuracy among the comparative methods.

4. 4 . 3 .
Effect of edge featureAs a core cell for extracting local information, DGA EConv constructs the nearest neighbor graph for the sampled points.In this group of experiments, different combinations of the center point, the nearest neighbor point, and the Euclidean distance information are used to represent the edge features.Among them, the combination of different features is achieved through splicing operation.As shown in Table7, Model B has the best results.The reason is that the edge features of model A only considers global features, while model C decreases in accuracy due to information redundancy.While, model B includes both the local and global features of the model.Experiments show that the proposed DGA EConv can effectively retain the key points and important features related to the downstream classification task, thus improving the network performance.
[12,37,39]p jk } and the set of edges is E ⊆ V × V.For computation efficiency, we use the KNN algorithm to construct a neighborhood graph for the center point p i and the neighborhood point p ij .The edge set E between nodes is composed of a set of directed edges {(p i , p i1 ), (p i , p i2 ), ..., (p i , p ik )}.To make the center point p i pay more attention to the points with larger contributions, it is necessary to aggregate the point features according to the importance of the neighbor points.Inspired by[12,37,39], we propose a DGA EConv which takes into account the local geometric structure of the points.The DGA EConv extracts features by three steps: (1) Construct a graph using the KNN algorithm.(2)For each point, calculate the attention score of its neighbor points.(3) Calculate the weighted average of the feature information of the neighbor points according to the obtained attention score.As a core component of the proposed framework, the DGA EConv can perform 3D point cloud classification and reconstruction based on local attention.The main purpose of dynamic graph attention convolution is to learn a function g : R d → R d ′ , where d and d ′ are the feature dimensions.The input feature H

Table 5 . The effect of different feature dimensions (%)
Experiments are conducted to evaluate the effectiveness of different relationship functions with 256 sampled points, and the comparison results are shown in Table6.The Concatenation achieves the highest accuracy, slightly higher than the Summation by 0.52%.While the Subtraction is 1.41% lower than the Concatenation.The simplified point cloud should maintain similarity to the original model, but the Subtraction may result in an increased loss of contextual information.Instead, the Connection is aggregated with the input features and attention features, which the network more robust.The Concatenation outperforms the Hadamard product and Dot product by 2.02% and 20.54%, respectively.Therefore, all experiments in Sec.3.1 and Sec.3.2 adopt the self-attention structure based on the Concatenation.