D3GATTEN: Dense 3D Geometric Features Extraction and Pose Estimation Using Self-Attention

Point-cloud processing for extracting geometric features is difficult due to the highly non-linear rotation variance and measurement noise corrupting the data. To address these challenges, we propose a new architecture, called Dense 3D Geometric Features Extraction And Pose Estimation Using Self-Attention (D3GATTEN), which allows us to extract strong 3D features. Later on these can be used for point-cloud registration, object reconstruction, pose estimation, and tracking. The key contribution of our work is a new architecture that makes use of the self-attention module to extract powerful features. Thoughtful tests were performed on the 3DMatch dataset for point-cloud registration and on TUM RGB-D dataset for pose estimation achieving 98% Feature Matching Recall (FMR). Our results outperformed the existing state-of-the-art in terms of robustness specification for point-cloud alignment and pose estimation. Our code and test data can be accessed at link: https://github.com/tamaslevente/trai/tree/master/d3gatten.


I. INTRODUCTION
Point-cloud alignment is an important task in computer vision as this task forms the basis for many problems. It is a fundamental task for applications such as 3D reconstruction [1], simultaneous localization and mapping (SLAM) [2], tracking [3], flow estimation [4], AR/VR tracking [5], pose estimation for data fusion [6]. Point-cloud registration consists of computing the rigid transformation of two overlapping point-clouds in order to align them in the 3D space. However, given the properties of the point-cloud to be unordered, irregular, and often noisy, extracting pointwise correspondences is not a trivial task. In recent years, many methods have contributed to solving these problems, from hand-crafted methods [7], [8], [9], [10] to more recent deep learning-based approaches [11], [12], [13], [14], [15], [16], [17], [18].
The associate editor coordinating the review of this manuscript and approving it for publication was Yangmin Li . Deep learning-based methods that deal with point-cloud registration can be divided into several subcategories. The first class [19], [20], [21], [22] follows the principle of iterative closest-point (ICP) [23] base methods, where pose transformation and correspondence evaluations are performed.
We follow this path and introduce D3GATTEN: Dense 3D Geometric Features Extraction And Pose Estimation Using Self-Attention for superior feature extraction using the self-attention module [32]. Our architecture uses D3Feat: Joint Learning of Dense Detection and Description of 3D Local Features [16], which adds a salient point detector to a fully convolutional feature descriptor. In addition to this, we implemented a self-attention module for feature extraction. Our algorithm is significantly more noise resistant after the addition of this module. This is a crucial consideration in order to apply this algorithm in real world. The results obtained for point-cloud registration can be seen in Figure 1c.
In summary, the key contributions of our work are as follows.
1) Based on D3Feat [16] we extend with a separate self-attention mechanism for selecting the most reliable keypoints for the registration of point-clouds; 2) the thorough analysis of different variants for the proposed extension with self-attention module based on the various backbones, module location in the pipeline, and profiling with respect to robustness against noise and runtime; 3) analysis and description of existing methods based on both hand-crafted and deep-learning methods. Furthermore, to demonstrate that our algorithm performs well, we conducted several tests to evaluate its performance especially for robustness against noise. The public dataset used for testing was 3DMatch [33]. As a limitation of the proposed method, we observed the lack of outlier rejection ability which is the inherited characteristic from the base [16]. Nevertheless, according to our investigation, our method achieves results comparable to the state-of-the-art in terms of Feature Matching Recall (FMR) which is close to 98% with faster runtimes at the same time.

A. CLASSICAL METHODS
First, we briefly describe the hand-crafted keypoint featurebased methods for point-cloud registration. The early point-cloud local descriptors used hand-made features to describe local geometry. For a better understanding of these methods, we present them briefly. [7] Andrew E. Johnson first proposed the spin image as a surface representation approach in [34], and it is used for surface matching and object detection in 3D images. In an objectoriented coordinate system rather than a viewer-oriented one, spin images encode the global attributes of any surface. Spin images: a representation for 3D surface matching (SPIN) [34] takes advantage of a projection of adjacent points on the tangential plane. The Spin Image descriptor is then calculated by adding the points in the support area to each bin of the 2D array, as shown in Figure 2a.

1) USING SPIN IMAGES FOR EFFICIENT OBJECT RECOGNITION IN CLUTTERED 3D SCENES
2) UNIQUE SIGNATURES OF HISTOGRAMS FOR SURFACE AND TEXTURE DESCRIPTION (SHOT) [10] The surface normal histograms at distinct spatial regions are encoded by the descriptor. To begin with, a local reference frame (LRF) is built for the keypoint and its nearby points in the support area are aligned with it. As illustrated in Figure 2b, the support zone is then separated into various volumes along the radial, azimuth, and elevation axes. A local histogram is constructed for each volume by collecting point counts in bins based on the angles between the normals at nearby points inside the volume and the normal at the keypoint. Finally, by concatenating all of the local histograms, the SHOT [10] descriptor is created.
3) UNIQUE SHAPE CONTEXT FOR 3D DATA DESCRIPTION (USC) [8] This method is a 3DSC [35] enhancement that eliminates the need to compute multiple descriptors at a single keypoint. For each keypoint, first, an LRF is created. The local surface is then aligned with the LRF to guarantee rigid transformation invariance. The keypoint's support area is then separated into numerous bins, as illustrated in Figure 2c. Finally, a USC [8] descriptor is constructed by adding the total number of points for each bin. 4) FAST POINT FEATURE HISTOGRAM (FPFH) [9] As illustrated in Figure 2d, the first step is to construct a Simplified Point Feature Histogram (SPFH) for each point by computing the associations between the point and its neighbors. FPFH is calculated as the weighted sum of the FIGURE 2. Geometric feature-based methods [36].
SPFH of the feature point and the SPFH of the points in the support area.

B. LEARNING-BASED METHODS
The introduction of deep learning approaches in image processing has also greatly benefited this field. The learned 3D feature descriptors have recently taken over and now outperform the hand-crafted alternatives. Gojcic et al. [37] developed an end-to-end framework for multiview point cloud registration by directly learning to register every view of a scene in a uniform manner across all views. Teaser [38] resolves the rotating sub-problem via graded non-convexity. This technique makes effective use of Douglas-Rachford Splitting to confirm global optimality. The considerable computational cost in SDP relaxation is resolved by this technique. Using a set of criteria, Serafin and Grisetti [39] improved the convergence of the minimization function by pruning problematic correspondences using normal and tangent information, while also generating an extended evaluation of each point correspondence.
Many new methods have also been introduced as this area has become very challenging. In the following, we analyze the newest and most effective methods for point-cloud registration.

1) LEARNING COMPACT GEOMETRIC FEATURES [40]
A description has been established that can be applied immediately to a set of unordered points. This design has the advantage that no surface parameterization, volumetric representation, or additional depth image synthesis is required. This feature facilitates nearest-neighbor searches in Euclidean space, allowing dense mappings between point sets in near-linear time. Finally, this design relies on multilayer perceptrons (MLPs) to extract manually created features and transfer them to a compact feature space.
2) LEARNING THE MATCHING OF LOCAL 3D GEOMETRY IN RANGE SCANS [33] is built on a fully convolutional siamese network architecture that learns rotation invariant 3D local feature descriptors. The 3DMatch extracts a feature for each interest point in a 3D point-cloud to integrate the local structure near the interest point. The 3D point-cloud must be converted into 3D volumetric data in 3DMatch [33], and the local representation is extracted by feeding the 3D volumetric data into the neural network.
3) GLOBAL CONTEXT AWARE LOCAL FEATURES FOR ROBUST 3D POINT MATCHING [17] AND UNSUPERVISED LEARNING OF ROTATION-INVARIANT 3D LOCAL DESCRIPTORS [41] PPFNet [17] uses the PointNet [42] architecture to learn 3D patch descriptors by combining the characteristics of the pair of points. PPFNet, on the other hand, is not truly rotation-invariant. To solve this problem, PPF-FoldNet [41] uses mainly point pair features (PPF) as input, which is inherently rotation invariant, and integrates a FoldingNet [43] architecture to enable unsupervised training of rotation invariant descriptors. [24] This approach, as the name suggests, is full convolution. FCGF [24] generates high-resolution features rapidly and does not require low-level preprocessing or 3D patching as input. To expand the receptive field and extract geometric information, a fully convolutional 3D network with metric learning is used. It recovers the transformation using robust pose estimators after extracting correspondences between two point-clouds. The D3Feat [16] design is mainly based on the KPConv [44] architecture, with the extraction of features from a point-cloud taking over. D3Feat introduced a keypoint selection method that overcomes intrinsic density changes of 3D point clouds, as well as a self-supervised detector loss driven by on-the-fly feature matching findings during training. The incorporation of a descriptor with a salient point detector is D3Feat's contribution to this fully convolutional design.

6) DeepVCP: AN END-TO-END DEEP NEURAL NETWORK FOR 3D POINT CLOUD REGISTRATION [45]
Extracts local features using PointNet++ [46], then filters them using a weighted layer that maintains just the most relevant ones. The characteristics descriptors are computed using a miniPointNet structure and then passed into a corresponding point generation layer, which creates the relevant key points in the target point cloud. In order to regress the final result of the transformation, two loss functions are coupled, hoping to encode both the local similarities and global geometric limitations. [11] To encode the information of the point-clouds, NgeNet [11] uses siamese shared layers with a new geometric-encoding interaction module applied to the superpoints, as well as multi-scale parallel decoding layers to extract multi-level point-wise characteristics for each point-cloud. On indistinguishable surfaces, a learning-free consistency voting system is designed to choose the feature with the appropriate neighborhood for each point and eliminate erroneous features.
2) CoFiNet: RELIABLE COARSE-TO-FINE CORRESPONDENCES FOR ROBUST POINT-CLOUD REGISTRATION [13] CoFiNet primarily uses a KPConv-based [44] encoderdecoder architecture. For context aggregation, the authors incorporated two attention-based networks into this design. VOLUME 11, 2023 In the first phase, the dense points are down-sampled to evenly distributed nodes, and the features are enhanced before being utilized to generate the similarity matrix. The confidence matrix is then used to offer coarse node correspondences.
3) GEOMETRIC TRANSFORMER FOR FAST AND ROBUST POINT-CLOUD REGISTRATION [14] Learns features at several resolution levels by downsampling the input point-clouds. Using the Geometric Transformer, which iteratively encodes intra-point-cloud geometric structures and inter-point-cloud geometric consistency, the architecture recovers high-quality superpoint correspondences between the source and target point-cloud. Finally, a local-to-global registration approach is used to compute the transformation among the point-cloud.

4) PREDATOR: REGISTRATION OF 3D POINT-CLOUDS WITH LOW OVERLAP [15]
The network is built using the KPConv [44] architecture and can be split into three main steps. In the first step, the two input point-clouds are converted into smaller sets of super points and encoded with the associated latent features with common weights. Then, using an overlapping attention module to extract co-contextual information. In the last phase, the network anticipates dense overlap scores and indicates the confidence of degree whether the points are in the overlap regions.

A. PROBLEM STATEMENT
Given the point-cloud source S = s i ∈ R 3 i=1,2,··· ,Y and the point-cloud target T = t j ∈ R 3 j=1,2,··· ,Z , where Y and Z are the cardinality in the point-cloud source and target. The main idea of point-cloud registration is to find a rigid transformation T ∈ SE(3) that best aligns the source and target point-cloud. The transformation can be recovered by solving: where κ is the set of correspondences and | · | denotes the cardinality of the set.

B. SELF-ATTENTION MECHANISM
Attention was first presented as an extension of recurrent neural networks (RNNs) [47]. The transformer mechanism introduced in [32] is a significant advance in attention research, as it reveals that the attention mechanism can obtain state-ofthe-art results. The self-attention mechanism was originally developed for natural language processing but was immediately adopted for other tasks, such as processing images [48] and videos [49]. Given 3 matrices Q (queries), K (keys) and V (values), which are projections of the input of the layer, the output of the attention mechanism is the weighted sum of the values multiplied by the compatibility score between queries and keys. Scores indicate how much attention should be paid to other locations or words in the input sequence. Once we determined our keys (K ), queries (Q), and values (V ), we can compute attention as follows: First, the dot product of the query and the key are computed. If we execute this on a large number of queries and keys at the same time, we can express the dot products as matrix multiplication, Q · K T , where Q is a vector of queries, and K is a vector of keys. Then it is divided by the square root of the dimensions of the key vector √ d k , to prevent the dot products from being too large as the vector length grows. The softmax function rescales the values between 0 and 1 and regularizes them. Finally, the result (weights) is multiplied by the value (all words) to lower the relevance of irrelevant terms and focus only on the most significant ones.

IV. PROPOSED METHOD
A. NETWORK ARCHITECTURE Figure 3 depicts the architecture of our method. The segmentation part of KPConv [44] has been adapted. The network consists of 5 convolutional layers. Two convolutional blocks make up each layer, with the exception of the first layer's first block being strung together. Our convolutional blocks are constructed similarly to bottleneck ResNet blocks [50] with batch normalization and leaky ReLu activation. The layers of the encoding and decoding parts are connected by skip connections. Extracting the most accurate and powerful features is done using the suggested self-attention module. In the sections below, we present our architecture and the proposed self-attention module. To better understand our approach, we start with a diagram, which describes the architecture of the proposed self-attention-based model and can be seen in Figure 4. D3GAtten's network architecture. Each block is a ResNet [50] block that uses KPConv to replace image convolution. Except for the last layer, all layers are followed by batch normalization and ReLU.

B. KEYPOINT DETECTION PIPELINE
Inspired by D2-Net, a recent technique for 2D image matching [51], we develop a single neural network that serves two functions: a dense feature descriptor and a feature detector. However, because of the inconsistent structure and variable sparsity of point clouds, applying the D2-Net concept to the 3D world is not straightforward. Following that, we outline the basic procedures for performing feature description and detection on 3D point clouds with irregular shapes before outlining how the sparsity variation in the 3D domain is handled.

1) DENSE FEATURE EXTRACTION
To perform dense feature extraction, we use KPConv [44] as our backbone network. We will first go over the KPConv formulas briefly.
Given a set of points P ∈ R N ×3 and a set of features F ∈ R N ×D , let x i and f i denote the i-th point in P and its corresponding feature in F. The definition of the generic convolution by kernel g at point x is as follows: where N x denotes the radius neighborhood of point x and x i denotes a supporting point inside this radius neighborhood.
The kernel function is defined as follows: where h represents the correlation function between the kernel point x k and the supporting point x i , K is the number of kernel points, and Wk is the weight matrix of the kernel pointx k . To guarantee that convolution is sparsity invariant, a density normalization factor is added to Eq. 3 that adds up the number of supporting points close to x: Our network produces a dense feature map as a two-dimensional matrix F as its output.

2) DENSE KEYPOINT DETECTION
In D2-Net [51] the local maximums within and across the deep feature maps channels are defined as keypoints, using the same maps that were used for descriptors. To address the non-uniform sample setup of point clouds, this procedure might be replaced with a radius neighborhood to expand their approach to 3D. Local regions with few points (for example, regions near to the limits of interior scenes or far from the Lidar center of outdoor scenes) would have higher scores if we used a softmax function to estimate the local maximum in the spatial dimension. To address this issue, a densityinvariant saliency score to assess a point's saliency in relation to its local neighbourhood is provided.
Using the dense feature map F ∈ R N ×c as input, the following 3D responses are possible D k (k = 1, . . . , c): where F :k is used to indicate the k-th column of the two-dimensional matrix F. Therefore, to discover a keypoint x i , we need: where N x i is the radius neighborhood of x i . The local-max score in D2-Net [51] is defined as: This formulation is not sparsity-invariant. Because the ratings are adjusted by sum, sparse locations naturally receive higher scores than dense ones. As a result, the following density-invariant saliency score is created: According to this method, a point's saliency score is determined by subtracting its characteristic from the mean characteristic of its local neighborhood.
To pick up the most preeminent channel for each point, a channel max score formula was designed: The final keypoint detection score is calculated by combining the two scores: We choose the points with the highest scores as keypoints after obtaining the keypoint score map of an input point cloud.

C. PROPOSED SELF-ATTENTION
The fastai implementation of the self-attention layer described in the SAGAN [52] paper is modified and simplified in our approach. We first briefly outline the current approach and then describe our method.
According to the SAGAN [52] paper, to calculate attention, the image features of the previously hidden layer F ∈ R C×N are first translated into two feature spaces, f and g, where f (F) = W f F and g(F) = W g F with C being the number of channels, and N denotes the number of feature locations of features from the previously hidden layer. For self-attention, β j,i is computed as: where: s ij = f (F i ) T g F j and β j,i specifies how much attention the model pays to the i th location while synthesizing the j th region. The attention layer's output is: where: In the preceding formulation W f , W g , W h and W v are the learned weight matrices, which are realized by 1 * 1 convolutions. In addition, the output of the attention layer is multiplied by a scale parameter and added back to the input feature map. As a result, the final output is provided by: y i = γ O i + F i , where γ is a learnable scalar that is initially set to 0.

1) OUR SUGGESTED SELF-ATTENTION MODULE
The original layer takes the image features x of shape (C, N ) where, N = H · W , and transforms them into f (x) = W f · x and g(x) = W g · x where, W f and W g have shape (C, C ′ ), and C ′ is chosen to be C/8. These matrix multiplications can be expressed as 1 × 1 convolution layers. Then, we compute Our first proposed simplification is to combine W T f · W g into a single (C × C) matrix W . Therefore, S = x T · W · x = S(x, x) (bilinear form) is of shape (N × N ) and will represent the influence of each pixel on the other pixels. As a result, instead of learning weights W f and W g for two convolution layers, we just learn weights W for one convolution layer. Advantages are simplicity, removal of one design choice C ′ = C/8, and a matrix W that offers more possibilities than W T f · W g . Computing the softmax of the matrix S is the following step in the original layer design. We decided to skip this step entirely and operate with unrestricted weights rather than adjusted probability-like weights. The final step in the original version is to calculate h(x) = W h · x where W h is of shape C × C, which is also implemented as a convolution layer 1 × 1. We propose to remove this final convolution layer and have the output be y i = γ · x · S + x. If desired, this last convolution can be re-added as a separate layer, implying a new position for the skip connection.

2) SUGGESTED SELF-ATTENTION MODULE EXPERIMENTS
Comparing the enhanced Self-Attention layer to the original self-attention layer, it appears to offer a significant improvement in terms of complexity. The original self-attention layer used softmax on a matrix N × N , which is O(N 2 ), because softmax is performed N times and softmax is O(N ) according to [53]. In terms of a runtime comparison, we put our self-attention model up against ResNet, VGG and AlexNet models. To make an accurate comparison, we evaluated all models with the same configuration(nr. of epochs = 50, dataset=Imagewoof, image size = 128 and nr. of runs = 20). In the last column of Table 1 we reported the wall clock time.
As shown in Table 1, our suggested self-attention module reduces runtime while producing results that are similar to those of the original self-attention.

D. IMPLEMENTATION DETAILS
In terms of architectural implementation, our network is built in PyTorch [54]. D3Feat [16] served as a starting point for our implementation. We train with momentum SGD, with a fixed learning rate of 0.01, a momentum of 0.98, and a weight decay of 1e − 6. For each point-cloud fragment, we enhance the data by adding Gaussian noise with a standard deviation of 0.015. Except for the last layer, all layers are followed by batch normalization and ReLU. The encoder section has a fixed number of channels (64, 128, 256, 512, 1024). Between the corresponding layers of the encoder and decoder parts, residual connections are used. To obtain the final 32-dimensional features, the output features are computed using a convolution 1 × 1.

V. EXPERIMENTAL VALIDATION A. PAIRWISE REGISTRATION EVALUATION
To validate our proposed method, we used the 3DMatch [33] indoor benchmark to test our model for registration and TUM RGB-D benchmark [55] for pose estimation. Additionally, we compared our method to the most recent point-cloud registration methods. We utilize the same approaches to prepare the training and testing data as in the 3DMatch dataset [21]. There are 62 scenarios, 46 of which are used for training, 8 for validation, and 8 for testing. In 3DMatch, point-cloud pairs have in average 30% overlap.

1) EVALUATION METRICS
We employ three metrics to evaluate the performance under registration for the 3DMatch benchmark: registration recall (RR), feature match recall (FMR), and inlier ratio (IR).

Registration Recall (RR).
The proportion of scan pairings for which the appropriate transformation parameters are identified with RANSAC [26]. It computes the root mean square error among * under the predicted transformationT: (15) and calculates the fraction of alignments with RMSE < 0.2m. Where: * is a collection of ground truth pairings in fragments {i, j} and x * and y * are the 3D coordinates of the ground truth pair. Feature Match Recall (FMR): Defined as the fraction of pairs with τ 2 = 5% ''inlier'' matches with a τ 1 = 10 cm residual under the ground truth transformation 1 as follows: where: M is the total pair of point-clouds, s is the correspondence between a pair of point-clouds (source and target), x i and y i are the 3D coordinates from the point-cloud source and target, T * is the estimation of the ground truth. Inlier Ratio (IR). The proportion of accurate correspondences among the putative matches, in addition to the previous metrics (1) and (2). To estimate the transformation matrix for the metric (2), we use a RANSAC with 50,000 maximum iterations, as described in [21].
The sensitivity of the characteristic to the inlier distance threshold τ 1 and the inlier ratio threshold τ 2 is demonstrated TABLE 2. Feature-matching recall and its standard deviation for the original and rotated data. in Figure 5. Overall, our techniques outperform other methods in terms of Feature Match Recall across diverse scenes, as well as a wide variety of distance and inlier recall thresholds. For run-time, our method is in the same range as the competitors.

3) EVALUATION UNDER DIFFERENT NUMBERS OF KEYPOINTS
We also provide the results when lowering the sampled point number from 5000 to 2500, 1000, 500, or even 250 to further highlight the advantages of using a self-attention module when the keypoints are selected. As shown in Table 3, our strategy is one of the most effective when employing 5000 points for Feature Matching Recall. When keypoints were lowered, our method produced results that were comparable to those achieved by other methods, but not the best. Because our keypoint selection is based on a self-attention module, this was to be expected. Self-attention modules require an even number of points as input to provide effective results. Outliers from a coarse scale are not explicitly rejected by design. False coarse correspondences can be developed into false point correspondences, which may lead to a lower inlier ratio on a finer scale. Despite these flaws, our module has been able to produce competitive results in terms of the inlier ratio.

4) ROTATION AND TRANSLATION INVARIANCE
Rotation and translation invariance is one of the most significant aspects of excellent geometric features. Experiments show that via low-cost data augmentation, a fully convolutional network can empirically obtain significant rotation invariance. We train the model without rotation augmentation and evaluate it on the 3DMatch dataset to show the impact of basic data augmentation on rotation robustness.  To test the rotation invariance of our algorithm, we rotate all fragments of the 3DMatch dataset in the same way as [41]. All of the fragments in the 3DMatch test set are rotated along all three axes with a random angle taken from a uniform distribution across [0 -2π ]. The results are shown in Table 4. We evaluated the model at several keypoints; our model without rotation augmentation is unable to learn rotations from the data, which is why we get such bad results. The last column represents the result obtained after applying the transformation. As one can see, our algorithm registered the two point-clouds without difficulties, being in the top 5 methods with the best FMR for a large number of keypoints.

5) REGISTRATION RESULTS
In order to demonstrate the results obtained in Table 2, we created a visualization to see how our algorithm behaves. As can be seen in Figure 6, on the first two columns we have the point-clouds as input (source and target) on the 3 rd column, we have the two inputs superimposed without applying the transformation calculated with our method, and on the last column, we can see the registration results.

6) VOTE-RELATED INFLUENCE
A sample of how consistent voting works is shown in Figure 7. Consistent voting gives each point the most discriminative characteristic possibility, which causes massively mismatched relationships to be pruned and the proper ones to be reserved.

7) EXPERIMENTS WITH NOISY DATASET
This section focuses on noise robustness testing methodologies. We compared our technique with the most competitive current methods, including NgeNet [11], CoFiNet [13], and PREDATOR [15]. Often, images obtained from the camera are corrupted by noise. Taking into account this fact, we set out to simulate the real-world environment, so we introduced two types of noise: occlusion noise and Gaussian noise. We also addressed the problem where training is done on data without noise and evaluated on data with noise, respectively, trained on data with noise and evaluated on data with noise.

a: TRAINING FOR THE NOISY DATASET
The following situations were considered: 1) we trained the model on noiseless training data and tested it on noisy data. The goal of this test is to see how a pre-trained model performs on a real dataset that is corrupted by noise (sim2real). 2) We trained the model using noise-corrupted training data and evaluated it using the identical data set as in the first scenario. This would be equivalent to (real2real) approach. For each type of noise used, two types of testing were carried out. The settings that specify the radius of the circle for occlusion noise were 30cm and 50cm. For Gaussian noise, the standard deviation parameter was set at 0.5 in the initial phase and then increased to 0.8. The findings are provided in the following subsections.

b: OCCLUSION NOISE
The occlusion test addresses one of the most common problems in the object identification process, the object occlusion problem. This effectively indicates that the template is only partially visible, i.e. an obscured observation is accessible for recognition. In our experimental setting, we tried changing the occlusion by eliminating a radial zone of an item in a random place. The number of points deleted is expressed as a tuning parameter as a percentage of the template's pointcloud size. The results obtained after applying the algorithm are visible in Figure 8a and Figure 8b.

c: GAUSSIAN NOISE
The most typical issue with real-time data is sensor noise, which refers to measurement errors caused by physical sensor limits. The non-systematic element of the noise is often modelled using a Gaussian probability distribution. This assumption was likewise applied in our situation, and we evaluated a system with variable additional Gaussian noise covariance along the XYZ coordinates added separately. The results obtained after applying Gaussian noise, can be seen in Figure 9a and Figure 9b.  Table 5 illustrates the outcomes after training on noise-free data and evaluating on noisy data, whereas Table 6 shows the results after training with noisy data. Figure 10 and 11 provide a graphic illustration of the results. After analyzing the results, we can conclude that our algorithm is robust to noise and outperforms the other approaches in both situations. To demonstrate that adding the attention module improves performance, we tested our algorithm both with and without the newly developed self-attention module. The results show that the developed self-attention module improves performance significantly. Followed by us, is the NgeNet [11] algorithm, which performs well in both scenarios. Additionally, we can say that the 'sim2real' data processing approach works, but 'real2real' method is more relevant in the real life scenarios.

B. POSE ESTIMATION EVALUATION
We utilize the TUM RGB-D [55] benchmark dataset to test the expanded system using our suggested technique. VOLUME 11, 2023    The data set contains a number of sequences that include RGB and depth frames captured by an RGB-D sensor, as well as the ground-truth sensor trajectory. We employ fr1 data sequences recorded in a typical desk in an office environment. The sequences include challenges such as lighting changes, repeated structures, and translational motions along the primary axes.

1) EVALUATION METRICS
To evaluate performance, we use the benchmark's absolute trajectory error (ATE) and relative probability error (RPE) metrics. The ATE is a global consistency metric that calculates the absolute distances between the related poses of the trajectories to calculate the translational drift between the predicted trajectory and the ground truth. The RPE estimates the trajectory's local accuracy over a certain time interval.

2) EVALUATION FOR TUM RGB-D BENCHMARK
We performed two types of tests. In the first, we used the TUM RGB-D dataset without any noise addition, and in the second, we added Gaussian noise with a standard deviation of 0.05. Table 8 shows the results of the test from the perspective of RPE and ATE for noiseless data, while Table 9 shows the results for noisy data generated in a similar way as for previous noise test cases. Figures 12 provide a visual representation of the acquired results for our method.

3) RESULTS
In terms of results analysis, the first test scenario used noiseless data. In this case, NgeNet [11] outperforms other methods followed by PREDATOR [15]. Our method is ranked third, while the CofiNet [13] algorithm is ranked last. The situation is slightly different when Gaussian noise is added to the TUM RBG-D dataset. With this scenario, we want to simulate a'sim2real' environment. In this case, our method outperforms the other existing algorithms and shows that D3GATTEN is a robust method against noise for pose estimation. We achieved the best results in the translation error estimation, while for the rotation error our method proved to provide the best mean estimation.

C. RUNTIME
In Table 7, we compare the runtime of our method on 3DMatch dataset with PREDATOR [15], CoFiNet [13], NgeNet [11], and D3Feat [16] to demonstrate its efficiency. Voxel size 2.5 cm and batch size 1 were set for each of the three approaches. The test run is on 4 x Nvidia A100 with Intel(R) Xeon(R) Gold 6226R @ 2.90GHZ x 16, 750GB RAM. The data load time was also taken into account   while calculating the run-time. The best runtime is provided by PREDATOR, according to our analysis. With the help of this table, we want to show that the addition of the self-attention module did not result in a significantly longer runtime.

VI. CONCLUSION
In this work, we propose a novel self-attention-based keypoint feature extraction method for point clouds. With the adaptation of an efficient self-attention module to the D3Feat methods, we achieved a run-time-efficient and keypoint feature extraction method. The performance of the proposed method was tested on public datasets both for registration and pose estimation tasks. For the registration, the best performances were measured against the existing methods on a larger number of keypoints using FMR metrics and sets a new state of the art for the 3DMatch benchmark with 98.3% feature matching recall. For the pose estimation, the method obtained the best result in terms of translation errors.
With a run-time close to the best from the state-of-the-art, we aim in the future to validate the method in the SLAM context on embedded devices.