MSaD-Net: A Mix Self-Attention Networks for 3D Point Cloud Denoising

In the process of acquiring 3D point cloud data, due to environmental interference or unstable scanning equipment, the acquired data often have noisy points. Recently, with the development of neural networks for point clouds, great progress has been made in deep learning-based point cloud denoising. However, most of the existing methods adopt a pointnet-like structure to predict point offsets. Simple pooling operation loses much important information, such as local neighborhood information, and global information. The loss of information makes the algorithm ineffective when dealing with some complex cases. In order to solve the above problems, we propose a self-attention-based point cloud denoising network architecture, through the Transformer structure, to establish long-range dependencies of the points. In addition, we propose a local information embedding module to fast select meaningful points and serve as the input of the Transformer. We also consider the correlation between channels of point cloud features and further introduce a channel attention module. Extensive experiments prove that our method outperforms existing methods, and maintains a high running speed.


I. INTRODUCTION
A S A 3D data format for presenting the real world, the point cloud is widely used in measurement and autonomous driving. In recent years, due to the rapid development of various 3D scanning equipment, a large amount of point cloud data has been generated. Massive point cloud data also drives the development of point cloud deep learning models, such as PointNet [1], [2], etc. However, due to the noise brought by the point cloud acquisition process, the effect of the deep learning model is affected, resulting in poor model accuracy. Therefore, denoising such point cloud data is beneficial to improve the effectiveness of subsequent models. The traditional point cloud denoising method designs various filtering algorithms [3], [4] according to various features to denoise the point cloud. But such methods have huge limitations and can only remove a certain kind of noise. These methods are hard to deal with large-scale point cloud data. At present, in the autopilot scene where point cloud is most widely used, the point cloud obtained is often derived from complex autopilot environments. The traditional point cloud denoising method has been difficult to meet the practical application scenarios. Learning-based point cloud denoising models have made great progress in recent years. Such models learn a denoising model from a large number of training samples. For large-scale and complex scenes, it also has a good denoising effect. At present, most models based on deep learning use the PointNet-like structure to predict the offset point by point. The pooling operation in the PointNet structure simply uses the pooling operation to compress the features of all points into a global feature vector. Although these methods are fast, they lose many important feature information. As shown in Fig. 1, compared with the previous methods, we increase the extraction of local information and propose a local information embedding module. It can quickly extract important information points and build local neighborhood information. While building local neighborhood information, the number of input point clouds is reduced. This enables us to use Transformers to establish longrange dependencies that build global information. Transformer is widely used in natural language processing and computer vision [5]. It can establish the relationship between various parts, optimize the feature information, and improve the model effect. In addition, in order to establish the relationship between each channel of point cloud features, we further construct a channel attention module to further optimize the denoising effect of the network. In general, the contributions of this paper have the following three points: i. We propose A mix Self-attention Networks for 3D point cloud denoising (MSaD-Net). Compared with the previous method, MSaD-Net enhances the extraction of local

II. RELATED WORK
In this section, the point cloud denoising method will be briefly reviewed. These methods include optimization-based point cloud denoising and deep learning-based point cloud denoising.

A. Optimization-Based Denoising
Optimization-based denoising methods attempt to transform the denoising problem into an optimization problem constrained by geometric information. These methods can be further classified as method based on local surface fitting, method based on sparsity, and method based on graph. The local surface fitting methods can achieve the point cloud denoising by mapping the noise into the fitted surface. Therefore, some methods-based this principle were proposed. Alexa et al. [6] denoised the point cloud by presenting the moving least squares method. Cazals et al. [7] designed a jet fitting method by encoding all local geometric quantities, such as normal. Lipman et al. [8] produce a set of points to describe the underlying surface by the locally optimal projection (LOP). Huang et al. [9] proposed the advanced the weighted locally optimal projection (WLOP) on the basis of the LOP. However, the local surface fitting methods are easy to oversmooth the denoising results, resulting in poor performance. To salve the problem, the sparsity-based were proposed by Avron et al. [10] and Mattei et al. [11]. These methods predict the clean coordinates of points based on the reconstructed normal. The normal was reconstructed with sparse representation theory. Whereas the above methods face the problem of over-sharpening denoising results. Finally, the methods based on the graph apply the graph signal processing theory on the point cloud denoising [12]. These methods [13], [14] mainly denoise the point cloud represented by graphs. Existing methods based on optimization mainly utilize geometric priors. However, these methods are difficult to balance between reserving the sharp information and obtaining well denoising performance.

B. Deep Learning-Based Denoising
In recent years, with the rapid development of deep learning [15], [16] for point clouds, point cloud denoising expands from methods based on traditional optimization to methods based on learning. An edge-aware point cloud consolidation network [17] was proposed to denoise trivial low-level noise. Pistilli et al. [18] proposed a graph convolutional denoising network. Roveri et al. [19] proposed a CNN-based denoising network, PointProNets, a fully differentiable. Wang et al. [2] introduced the differentiable surface splatting approach to transform noises into images and reconstruct surfaces guided by the denoised images. Rakotosaona et al. [20] estimated denoising displacement for every point and eliminates these outliers in the point cloud, respectively. Huang et al. [21] proposed a multi-offset denoising network (MODNet), which exploit the selection mechanism of patch scale. Zhang et al. [22] proposed an encoder-decoder-based framework, called Pointfilter, which is sensitive to sharp features. However, Pointfilter only implements point cloud filtering based on Pointnet. It does not further explore the design of point cloud denoising network structure with point cloud patch as input.

III. METHODS
The pipeline of MSaD-Net is shown in Fig. 2, which incorporates an attention mechanism to encode token long-range dependencies. Compared with the previous point cloud denoising neural network, our method has the following advantages: i) A lightweight local information embedding module is introduced, which can convert point cloud features into tokens suitable for transformer structure input, reducing the calculation. ii) The transformer module is introduced to encode the feature information of the point cloud token. iii) The channel attention module is further introduced to make up for the defect that the transformer cannot establish the correlation of feature channels. In the following, we elaborate on these novelties one by one.

A. Network Input and Output
MSaD-Net is similar to pointfilter [22], taking the patch of point cloud as input. Traverse each point in the complete point cloud, and construct a patch with each point as the center. Defining the input point cloud as I(x, y, z) ∈ R n×3 , we only use the 3D coordinates of point cloud as input. MSaD-Net outputs a (1 × 3) vector as the offset of the center point. As the point cloud completes the offset operation, the denoising operation is completed.

B. Local Information Embedding
The transformer can help the input point to establish longrange dependencies. If the computational cost is not considered, the original point cloud P (x, y, z) ∈ R n×3 can be directly used as the input of transformer. The computational time complexity of Transformer is O(N 2 ), where N is the number of points. It will cause the calculation to be slow, when N is large. In addition, previous methods lack the extraction of local information, resulting in poor model performance. To solve the above problems, we propose local information embedding.
The local information embedding module is similar to the SA module of PointNet++ [15], including the sampling layer and the grouping layer. The difference is that we use probabilistic features in the sampling layer to select target points, named Geometry-aware Attention Sampling (GAAS). As shown in Fig. 3, the input feature is N × c, we use an MLPs layer combined with the sigmoid layer to learn a probabilistic feature (N × 1). Then, according to the probability feature, we select k point with the height probability as the center point of the local information construction. Where, k = N . Finally, we construct  local feature information using grouping layers in PointNet++. In addition, the probability feature and local feature information are combined to finally complete the local information embedding.
The whole process can be defined by the following formula: I ∈ R n×c is the input point and feature. I ∈ R n ×c is the output point and feature. Where n < n and c > c. P E(·) is the local information embedding module. After passing through the PE layer, the number of points decreases, which is more conducive to the transformer module to reduce the calculation cost.

C. Transformer Layers
After the PE layer, we get the input sequence of the point cloud, I ∈ R n ×c . We use Multilayer Perceptron (MLP) to map I to query(Q), key(K), and value(V ) vectors respectively. To fully learn multiple different attention relations, we perform the multi-head self-attention in parallel with H independent attention heads, as shown in Fig. 2. Self-attention can be expressed as: where d head is the dimension, Sof tmax represents the activate function. After multi-head attention, we merge the features of all the heads, and then use the Feedforward Layers layer to map the features to be the same size as the input features (R n ×c ). We concatenate multiple transformer to encode relationships between multiple token patches.

D. Channel Attention
Transformer layers are used to model point-wise relationships, but the channel attention of point features is not concerned. To model the channel relationship between point features, we introduce the channel attention module in CBAM [23]. Taking the point cloud feature I ∈ R n ×c as input, then we use max-pooling and avg-pooling respectively to compress the point cloud feature to R 1×c . Then go through a Shared-MLP layer, the two features are added, and the channel feature vector I ∈ R c 1×c is obtained through a sigmod activation function.
where M axP ool(·) and AvgP ool(·) are max pooling and average pooling, respectively. σ(·) is the sigmod activation function. The process of channel attention can be expressed as: where I c is the point feature after channel attention, represents the feature-wise multiplication.

E. Mix Self-Attention and Final Predicted
After transformer layers and channels, the relationship between point-wise and channel-wise is established, which together constitute mix self-attention. After a max pooling operation, I c ∈ R n ×c becomes I p ∈ R 1×c . Then, after a series of MLP (128, 64, 3) layers, the feature becomes I p ∈ R 1×3 . I r is the offset, and the offset is added to the center point of the original input patch to complete the denoising operation of the current point. The process can be expressed as: where P is the center point of patch I.

F. Model Loss
The L 2 distance is widely used in point cloud denoising networks [20], but it is not a good measure for denoising models. To further improve the performance of our MSaD-Net, we follow the loss function in the Pointfilter [22]. It consists of two terms, defined as follows: where L pro is the projection loss function. It can retain sharp information by taking account into the normal and distance similarity between neighboring points the ground truth and the denoising point cloud, defined as: where n p j represents the ground-truth normal of the point p j . p i represents the filtered point of the pointp i . Λ( ṗ i − p j ) is a Gaussian function that gives more weight to the points nearṗ i . And Γ(nṗ i , n p j ) can allot more weights to these neighboring points which the normal is more similar toṗ i .ε n denotes the support angle (Here, it is set to 15 • ). ε p = 4 Υ/m. Υ represents the diagonal length of the bounding box of the patch ℘ i . m represents the number of the point in the ground-truth patch. Furthermore, the repulsion term L r can make the denoising point cloud uniformly distributed, as follows: During the training phase, we set the hyperparameter λ to 0.97.

A. Dataset and Implementation Detail
Our traing dataset consists of 22 clean models provided by [19]. More specifical, for the point number of point clouds, we randomly sample 10000, 20000 and 50000 points on each models. And like many existing methods, we assume noise as Gaussian distribution. We add 4 standard deviations of the bounding box diagonal length Gaussian noise in the ground truth model (0.0%, 0.5%, 1%, 1.5%). For testing the denoising performance of our MSaD-Net, our test dataset includes both synthesized point clouds and real-world dataset. The synthesized point clouds are obtained by appending Gaussian noise. The standard deviations of the noise are 0.5%, 1%, and 1.5% bounding box diagonal length in the ground truth model. As for real-world dataset, we employ Paris-rue-Madame [24] to show the visual comparisons. Our MSaD-Net is implemented by the Pytorch and trained by one RTX 2080Ti GPU(11 GB memory). Adam is utilized to train MSaD-Net with an initial learning rate of 1e-4. During the training, the learning rate drops from 1e-4 to 1e-7. The batch size is 250. The training time of our model is about 15 hours.

B. Evaluation Metric
In this paper, we adopt Chamfer Distance (CD) and pointto-mesh (P2M) [25] to quantify the denoising capability of these methods. The CD mutually acquires the nearest neighbor point of the other point cloud and adds the squared distance. Meanwhile, the P2M is also used to compute the difference between the denoised point clouds and the ground truth.

C. Quantitative Comparison
To evaluate our method quantitatively, we compare our MSaD-Net to the state-of-the-art denoising methods, including WLOP [9], GPD [18], ECN [17], PCN [20] and PF [22]. We adopt Chamfer Distance (CD) and point-to-mesh (P2M) [25] to quantify different methods. Each method is tested by applying different levels of noise (0.5%, 1.0%, and 1.5%, and three densities, i.e., 10 K, 20 K and 50 K). The comparison results of these methods are shown in Table I. Obviously, Our model obtains the best score on two metrics with all the noise densities and levels.
Furthermore, we tested each method on three variety of other noise models, i.e., High-level noise, Laplace noise, and uniform noise. The evaluation results are summarized in Table III. Our method still yields more stable denoising performance on different noise models. Therefore, our model generalizes better.
Meanwhile, We also compare the running time of these methods, as showed in Table II. For the optimized-based method (WLOP), it seems be the fastest. However, these traditional optimized-based methods require to tune parameters to produce acceptable results. Therefore, the total denoising runtime of Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  WLOP is longer. The running time of our model is shorter than GPD and PCN, and longer than PF. The main reason is that our method needs to embed neighborhood information foreach noisy point. And we also test the runtime of our MSaD-Net with FPS. We notice that our MSaD-Net(with GAAS) is faster than MSaD-Net with FPS.

D. Visual Comparison
We also visually compare with other methods on 4 synthetic point cloud datasets. These datasets is 1.5% Gaussian noisy. Compared with our MSaD-Net, WLOP, ECN,PF and PCN cannot denoise high-level noise well, as showed in Fig. 4. And GPD leads to shape shrinkage of results. Our method achieves high-performance denoising and feature preservation.
Furthermore, in order to demonstrate that our MSaD-Net is also robust on real-world dataset. We also verify our MSaD-Net on Paris-rue-Madame [24] dataset corrupted with raw noise. As shown in Fig. 5, our MSaD-Net significantly outperforms previous methods on real-world point cloud. GPD tends to shape shrinkage. WLOP and ECN lead to the disappearance of some geometric features while removing the noise. In contrast, Our method barely smoothes geometric features in real-world point clouds.

E. Ablation Studies
We further evaluate the effectiveness of the essential components for ablation study, including Geometry-aware Attention Sampling (GAAS) and Transformer Layers (Tran) and Channel Attention (CA). All experiments are conducted on the synthesized test dataset and results are shown in Table IV. Note that, the model A is a baseline based on PointNet++ [15]. Based on model A, we add GAAS to replace FPS. From the results, GAAS achieves a significant improvement in network performance. Furthermore, we add the remaining attention modules (Tran, CA) in turn to build the new model, which is denoted as C and Full, respectively. These two modules not only focus on the relationship between points, but also the relationship between feature channels. From the results, this structure is effective, especially the improvement of Tran. Therefore, GAAS, Tran, and CA all contribute positively to the performance.
Furthermore, we visualize the point cloud sampled by GAAS and FPS. As shown in Fig. 6, FPS produces a relatively uniformly sampled point cloud. It can't perceive geometric feature points that can be used for filtering. The points sampled by our GAAS are mainly distributed in the geometric region useful for denoising the center point. This illustrates that the trained GAAS can perceive the geometric features of the point cloud. And it achieves key point sampling that is conducive to denoising.

V. CONCLUSION
In this paper, we propose a novel self-attention-based point cloud denoising network, called MSaD-Net. Through the Transformer structure, it establishes long-range dependencies of the neighborhood structure. We also consider the correlation between channels of point cloud features and further introduce a channel attention module. Extensive experiments prove that our MSaD-Net outperforms existing methods, and maintains a high running speed.