Optimizing LBVH‐Construction and Hierarchy‐Traversal to accelerate kNN Queries on Point Clouds using the GPU

Processing point clouds often requires information about the point neighbourhood in order to extract, calculate and determine characteristics. We continue the tradition of developing increasingly faster neighbourhood query algorithms and present a highly efficient algorithm for solving the exact neighbourhood problem in point clouds using the GPU. Both, the required data structures and the kNN query, are calculated entirely on the GPU. This enables real‐time performance for large queries in extremely large point clouds. Our experiments show a more than threefold acceleration, compared to state‐of‐the‐art GPU based methods including all memory transfers. In terms of pure query performance, we achieve over 105 answered neighbourhood queries per millisecond for 16 nearest neighbours on common graphics hardware.


Introduction
A multitude of algorithms in various domains rely on finding the k nearest neighbours (kNN) of a query in a data set, including databases, machine learning, computer vision, computational geometry, robotics, computer graphics or physical simulations. The size of data sets has increased enormously in recent years. Improved scanning technologies have led to the ability to scan huge environments or objects in extremely high resolution. Since such point clouds rarely have more information than an additional vertex normal, various algorithms have to be applied to the scan data, e.g. registration of several data sets, calculation or smoothing of surface normals, removal of residual noise and outliers from the scanning process, sub-sampling, segmentation, reconstruction and feature matching. These almost always require finding the k nearest neighbours for each data set point, which is a very expensive and time-consuming step and a challenge for real-time capability. To cope with a large amount of data in as little time as possible, the trend is to use GPUs to accelerate such costly steps, as they have established themselves as a reliable co-processor of the CPU for highly data-parallel applications.
Our contribution is a performance improvement over previous solutions running on the GPU to enable real-time performance for exact kNN queries even in dynamic scenarios. To achieve this, we create and optimize a spatial acceleration data structure for pointclouds and use a SIMD register-memory based priority queue for kNN search. Both, construction of the data structure and the query are executed entirely on graphics hardware, thus, avoiding expensive memory-copy bottlenecks. Our experiments with synthetic as well as real-world data sets show that our approach is scaleable and on average about 3.3 times faster compared to state-of-the-art methods.

Problem Statement
Nearest neighbour search describes the similarity search in data sets. Consider the metric space S, a data set X = (x 1 , . . . , x n ) ⊆ S, a query set Q = (q 1 , . . . , q n ) ⊆ S and a distance metric m : S 2 → . The k-nearest neighbour search then is the task of finding the k closest (w.r.t. m) data points to each q ∈ Q from the data set X.
High-dimensional metric spaces, memory and/or runtime restrictions and the number of query points as well as regular data set changes are challenging for any algorithm. In the last decades, several approaches have been proposed with proven upper bounds of computational complexity [Ben75,AMN*98], that generally seek to reduce the number of distance computations using different types of trees. However, for many methods based on k nearest neighbours the computation time needed to find these for all input points, still remains the bottleneck. For this reason, many algorithms are specially designed for certain application areas.
In this paper, we consider applications that require the k nearest neighbours (kNN) or k approximate neighbours (kANN) where S is a low-dimensional euclidean space (d = 3) and m is the L 2norm. This occurs especially in the context of real-time processing of 3D data. Examples are simulations, e.g. SPH [GJM77], point-cloud registration [BM92,PCS15] and processing or realtime photon-mapping [PDC*03, MLM13], which must find the kNN for at least each individual data point in the data set. In various cases this set changes constantly (e.g registration, simulations, photon-mapping), which also requires any index structures to be updated or rebuilt each time and forbids any expensive off line preprocessing.
Our main contributions addressing these problems in this paper are: • A highly parallel bottom up LBVH optimization step, specialized on point-cloud data. • A stackless BVH traversal algorithm that can find all exact kNN using a register based priority queue.

Related Work
Since the literature for finding the kNN is vast, we limit ourselves to GPU based kNN techniques. The methods developed so far can be divided into two main classes: Brute force approaches and selection based procedures.

Brute-force methods
There are many techniques to solve the kNN problem via a bruteforce approach using the GPU. Generally, for each query these methods initiate a calculation of the distance to all data points in the data set . The two most important approaches for calculating the  distance matrix are a direct implementation with a self-developed  kernel [GDB08, ZHWG08, KZ09, LLWJ09, LWLJ09, LLWJ10,  BGTP10, KH10, BGT*11, KH12, SPS12, ARBM12, DKMD13],  and the use of an already well optimized matrix multiplication routine [BDHK06, GDB08, SPS12, DKMD13, LA15, THE*15, JDJ17,  KD18], e.g using cuBLAS. User-defined direct implementations are typically optimized by tiling, which divides the distance matrix into several submatrices (tiles) of equal size. The used tile size is optimized in such a way, that a group of query and reference points can be stored in fast shared memory and reused by threads within the same compute block. The calculated distances must then be sorted and the nearest k extracted.

1) Squared Distance matrix:
Bustos et al. [BDHK06] were one of the first using GPUs for NN computation. They computed the squared distance matrix using custom shaders and performed multiple texture reductions to obtain the final nearest neighbour. By using the latest GPGPU architecture, it was possible for Garcia et al. [GDB08] to assign one thread per query for distance sorting after the brute force step. Instead of using one thread per query, Sismantis et al. [SPS12] used a parallelized truncated bitonic sort per query, while Dashti et al.
[DKMD13] relied on radix sort from Nvidias thrust library. Li et al. [LA15] decided to integrate a truncated merge sort directly into the matrix multiplication routine, which discards candidates as it becomes clear that they cannot belong to the top k. A k-selection by Tang et al. [THE*15] was accelerated by using a merge queue, a buffered search, and hierarchical partition to better support the SIMD architecture of GPUs. Johnson et al. [JDJ17] were able to handle data sets that are too large for current GPU main memory by relying on cuBLAS and a specialized k-selection algorithm. A multi GPU approach was proposed by Klusek et al. [KD18], which also computes the squared distance matrix with a subsequent sort operation to extract the kNN, but offer a better data distribution among the available GPUs.
2) custom distance kernel: Garcia [GDB08], Zhao [ZHWG08] and Kuang et al. [KZ09] use a custom matrix multiplication, paired with an insertion sort or radix sort algorithm respectively to find the kNN, leveraging the speed of modern sort libraries. Many approaches split the distance computation of each query into blocks. Liang et al. [LLWJ09,LWLJ09] find the local kNN within each block by simultaneously testing each distance against all others. A single thread per query then merges the lists using heap selection. Liang, Kato and Barrientos [LLWJ10, BGTP10, KH10, BGT*11, KH12] proposed different heap-based approaches for the kNN selection process, using one thread with a max heap per query or a heap-based reduction over multiple blocks [BGT*11]. Truncated sort was introduced by Sismanis et al. [SPS12]. Elements are removed from the vector when it is clear that they cannot belong to the set of smallest k. They describe several algorithms and show that their truncated bitonic sort has excellent performance on the GPU. Arefin et al. [ARBM12] maintain an unsorted array of size k for each query and a pointer to the largest element in the array. A single thread maintains this structure on each level with a linear scan. Dashti et al.[DKMD13] also use a radix-sort approach but on the whole distance matrix. Query-distances of the candidates are first sorted collectively and then sorted by search index to separate the results for each query. Then only the top k elements have to be extracted.
All brute-force approaches have in common, that they are only suitable for small data-and query-sets due to high memory requirements and only perform competitively on high dimensional data.

Selection based procedures
Selection based kNN methods are more diverse and can be further divided into either hashing, graph-based or spatial subdivision strategies bringing also some asymptotically more efficient algorithms to the GPU.

1) Hashing based approaches:
Most of the used hashing-based algorithms are variants of locality-sensitive hashing (LSH) [PLM10,PM11,PM12,LŽ15]. LSH uses several hash functions of the same type, which are location-sensitive. This enables neighbouring points to be more likely to fall into the same hash bucket than points that are far apart from each other. During the query stage, the search point is hashed with all hash-functions and then a linear search is performed in the set of data points with which the query matches. Pan et al. [PM11] additionally use a parallel RPtree or random projections [PM12] to pre-partition their data sets into several groups, so that items similar to each other are clustered together. Subsequently Bi-Level LSH is executed on each partition tree and the kNN are then extracted via short-list search. Lukac et al. improve the efficiency of existing LSH approaches by using Multi-Probe LSH [LŽ15], that also scans the nearby hash buckets of the one into which a query point is hashed. Therefore, a smaller number of hash tables is required to achieve a certain accuracy. Choosing the hash function affects performance and must be done carefully. For kANN a deterministic skip-list data structure is used to hold the kANN neighbours indices and distances. Wiechollek et al. [WWSHL16] introduced a two level product quantization tree (built upon a combination of inverted multiindex and hierarchical PQ) for high dimensional data sets. They combine their method with a new re-ranking algorithm based on closest-line projections and a bin ordering heuristic, which results in good performance for very high dimensional data sets. However, most hashing based approaches only provide approximate nearest neighbours.
2) kNN-Graph based algorithms: Fast kNN queries can be achieved by using a kNN-Graph as acceleration data structure. Fu et al. [FC16] use a hierarchic divide-and-conquer algorithm to construct an initial kNN graph using eight randomized truncated k-d trees for graph-initialization. A nearest neighbour descent is then applied to refine the graph. They report fast query and build times, but only produce approximate results.
3) Spatial partitioning methods: Spatial partitioning methods solve the kNN problem by constructing a spatial index. The index generation can be roughly categorized into those that partition the data and those that partition the space. The former try to cluster data on the basis of their spatial proximity as, i.e. Li et al. [LSP*12], which create multiple lists of the data set with shifted points. These are then sorted using a space filling curve. The kANN can be thus found by considering only the k preceding and k succeeding points in each shifted list. Hachisuka et al. [HJ10] use only a single hash list with exactly one data point per hash entry and can therefore only provide approximate kNN results. Space partitioning based approaches on the other hand make use of grids [PDC*03, LTF*09, SBMN16], octrees [GGG08] and k-d trees [ZHWG08,QMN09,ML09]. The kNN are searched by backtracking [QMN09] or heapbased priority search in the underlying tree-structure. Grid based methods [MB17] have the advantage of a very good and fast radius search, but can degenerate to a full search for exact kNN. This in turn is the strength of hierarchical techniques like octrees or kd trees. However, all these methods are limited to low dimensional problems but provide unbeatable performance compared to the previous approaches in these cases.

Design and Implementation
In the following, we discuss design and implementation of our approach. As already stated, spatial subdivision methods are very suitable for low-dimensional data and allow asymptotically efficient queries. Our proposed method is therefore based on a k-d tree.   [Ape14]. Leaf nodes are numbered from 0 to 7 and internal nodes from 0 to 6. Leaves were previously sorted based on their Morton code. The range of Morton keys covered by each node is indicated by a horizontal red bar.

LBVH construction
For the initial spatial acceleration structure we use the fast LBVH construction of Apetrei et al. [Ape14], which is based on ordering primitives along a space-filling curve. Compared to previous methods, this bottom-up construction algorithm is able to generate both tree-hierarchy and enclosing bounding boxes in one single and simple kernel launch as shown in Algorithm 2.
We first compute a Morton code for each item in the data set and sort all points accordingly using a parallel radix-sort. Subsequently, after creating the leaf nodes, an initial LBVH is built in a single bottom-up traversal by choosing the parent and simultaneously computing the bounding box at each step. The resulting tree is shown in Figure 1.
Our implementation of the kernel proposed by Apetrei et al. differs only in that we store explicit parent pointers per node, the sum of all points in the current subtree (used during optimization), and force an explicit synchronization of the global memory write accesses as outlined in line 10 of Algorithm 2. This is required as Nvidia GPUs use a weakly-ordered memory model. The order in which a thread writes data to global (or shared) memory is not necessarily the order in which the data written by another thread is observed. Depending on the graphics card used (memory read/write ordering is different in different architectures), this leads to nondeterministic behaviour and invalid hierarchies since the pointers and keys set by the find-Parent method in line 9 are read directly in the next iteration during bottom up traversal. For implementation details of the listed methods, we refer to [Ape14].

Tree optimization
By using Morton codes for fast hierarchy generation, however, the space is discretized into a grid. Depending on the data set size and Morton code resolution, the point density and distribution within the data set, it may occur that many points share the same Morton key and are thus placed in the same leaf node, while the majority of the remaining leaves contain only one or very few data points. In terms of data-parallel processing this effect is adverse as it causes diverging threads during a tree traversal and thus a performance drop.
In order to alleviate this problem, we try to create a more shallow hierarchy with leaf nodes of ideally equal data density, as outlined in Figure 2. This is achieved by a second bottom-up traversal of the initial LBVH. At each inner node we decide whether to reduce it into a leaf node (depending on the data density of the left and right children) and continue upwards, or to do nothing and stop the traversal. Merged nodes, that are no longer needed are flagged accordingly. To prevent race conditions between two threads coming from a left and right subtree, only one thread is allowed to collapse and continue (see line 6 in Algorithm 3). The procedure is outlined in Algorithm 3. This way we incrementally fuse spatially related parts of the data set without destroying the underlying tree and at the same time can reuse the pre-calculated bounding volumes, which results in extremely fast processing. A visual example of this procedure applied to the tree in Figure 1 is shown in Figure 3. During the bottom-up traversal a heuristic ϕ decides whether an internal node becomes a leaf node. The necessary pointer adjustments are then performed by the makeLeaf(...)-method. In the following both are described in detail: Collapse heuristic. The heuristic ϕ(node) decides whether the current node becomes a leaf node or not. We use a very simple point = #points in current volume (AABB) = threshold points in volume.
This is trivial as we computed and stored the number of total contained points in each node during initial hierarchy buildup.

MakeLeaf.
The key to an efficient hierarchy adjustment is in two aspects: First, during hierarchy construction we temporarily store the number of contained points in each internal node by adding the primitive number of left and right children during bottom up traversal. Second, the data set items are sorted according to their Morton code in memory. An internal node is thus easily turned into a leaf node by simply replacing it with the leftmost leaf node of its subtree.
Since we start the bottom up traversal at the tree leaves, each thread needs to remember the leaf-id it came from. The make-Leaf(...) method then just adjusts pointers, the bounding box and the primitive count, as outlined in Algorithm 4 and visualized on the right. In this example, the depicted leaf with number 5 will contain all stored primitives of the leaves 5 and 6 of the initial tree.

Tree compaction
Due to the fact that entire subtrees collapse into a single leaf node, the memory layout of the LBVH fragments. This impacts query performance, which is why we compact it in a final phase to allow better coalescing memory accesses. As usually only a few parts of the tree are removed and data ordering in memory does not necessarily have to be preserved, we use a highly modified parallel in-place compaction [DMG10]. This allows us to avoid a complete duplicate of the hierarchy and also maximizes memory throughput. It would also be possible to perform a simple compaction by copying the tree to a new memory location and adjusting the pointers on the fly. However, this includes nearly doubling the required memory for the hierarchy and requires expensive de-/allocations. We opted for the in-place method to achieve the fastest possible compaction and a low memory footprint.

Neighbourhood query
The main problem with hierarchical traversal-methods on a GPU is that only the most coherent part of the operations near the tree root are accelerated (everything cached, nearly no thread divergencies). Consequently, a multitude of highly incoherent per-thread workloads have to be processed.
In the following, two key aspects, which lead to increased query performance when combined, are discussed in more detail: traversal scheme and register based priority queue. We will also briefly explain how to modify our approach to allow a radius and an approximate k nearest neighbour search.

Backtracking traversal
Usually, on the CPU side heap-based traversal strategies are preferred, which use the quadratic distance of a bounding volume to the query point as priority key. On a GPU, however, this heap would have to be kept in global or shared device memory. Due to the necessary data-dependent heap updates this leads to random and irregular memory movements. The resulting latencies, warp divergencies and/or bank conflicts then cause the SIMD units to not Algorithm 4. Register Priority Queue being saturated. For this reason, we opted for a backtracking approach [HDW*11]. The main difference between a backtracking or heap-based traversal is that more nodes need to be visited, compared to using a latency-heavy binary heap of nodes not yet visited. In principal, this is the same as a depth-first search while using the information of the current found nearest neighbours to decide whether a subtree needs to be traversed or not. For further implementation details, the reader is referred to the supplemental material.

Register based priority queue
To keep track of the currently found nearest neighbours usually a second (maximum-) heap is used. For small k, a CPU can usually keep the entire heap in L1 cache, which enables extremely low latency and high bandwidth. However, as mentioned above, heaps generally do not show good data parallelism on GPUs.
We were inspired by the idea of utilizing registers for sorting network primitives on the GPU as proposed by Johnson et al. [JDJ17]. Our approach differs in that we use a simple array in device register memory for our currently found neighbours and keep it sorted using insertion sort. For different kNN sizes we use a compile-time unrolled insertion sort as shown in Algorithm 5. Consequently the compiler can create the code directly and we do not have to provide complex sorting networks for every possible array size and also emit less instructions. Each time, before inserting a new point, we test whether its distance is smaller than the current largest in the heap and insert it only if this is the case. Therefore it can safely be overwritten.
With this approach we benefit from vector parallelism, extremely low memory latency and easy implementation. To enable the CUDA compiler into keeping a sorted list in register only, everything must be known at compile time and enough device registers must be available. During runtime we then choose the appropriate kernel.

Radius search
Closely related to the kNN search is the radius search, which returns all nearest neighbours within a specified search radius. Our approach can be easily modified to use this radius as the abort criterion during traversal. The only change to the backtracking algorithm is, that elements which are smaller than or equal to the current search radius are always inserted. The traversal is aborted if no subtree within the search radius is available. This is only limited by the fact, that the size of the kNN heap must be set in advance and remains fixed during the query.

Approximate kNN
If an exact kNN search is not necessary and to further accelerate our approach, our method can be easily extended to support an approximate k nearest neighbour search. This can be achieved by setting a limit for the number of visited nodes during the query phase. If this threshold is exceeded, the current traversal is aborted.

Algorithm parameters
Our approach therefore only depends on two selectable parameters, which influence hierarchy quality and thus LBVH construction and query runtime. In the following, we discuss their effects in detail:

MMorton code resolution
The Morton code resolution (number of quantization bits per dimension) affects the initial bin size of the LBVH leaves. The higher the resolution, the fewer points will fall into the same bin, which increases size and build time of the initial LBVH and more subtrees have to be merged in the subsequent optimization phase. But with very inhomogeneous distributed point clouds or large-scale 3D scans with fixed scanning positions, it often occurs that an extremely high number of points collide into the same Morton cell, causing extremely unbalanced workloads. Per Node point-threshold The selected point threshold during the optimization step influences the final hierarchy and thus the query-and optimization runtime. A small threshold per leaf node leads to more visited tree-nodes and therefore to longer traversal times and thread divergencies. A too large threshold results in a major part of the runtime during traversal being spent on pointless sorting of found points, which are later discarded anyway.

Results
To ensure a performance analysis that is as extensive as possible, we used both synthetic and real-world data sets from different areas of computer vision and graphics in the following tests. The selected data-sets, shown in Figure 4, provide a broad range of distribution patterns: from uniformly to extremely irregularly distributed (e.g. terrestrial laser scan with extreme densities in close proximity, as well as widely scattered individual measurements).  All measurements were performed on an AMD Ryzen TM 7 2700X CPU @ 3.7 GHz, 32 GB RAM with a NVIDIA GeForce TM GTX 2080TI, running under Linux 5.4.14 with NVIDIA driver version 440.44. We implemented and compiled our hierarchy construction and traversal algorithm with CUDA 10.2.

Hierarchy optimization analysis
In the following we discuss the parameter selection of our approach in detail and show that all presented optimizations contribute to the performance of our approach.

Morton code resolution:
To achieve a good initial LBVH quality we tested different quantization resolutions and measured algorithm runtime, as shown in Figure 5. As expected, low resolutions are suitable for small point clouds as this reduces sorting overhead and creates a smaller hierarchy. For larger point clouds or severely inhomogeneous distributed ones, higher quantization resolutions are absolutely necessary. Due to massive differences between the point  [Ape14] vs. our optimization step enabled using two different thresholds. Our proposed optimization step delivers faster and more consistent runtimes throughout the entire benchmark data set.
clouds in terms of their properties and characteristics, there are significant runtime differences visible. For simplicity, we chose a quantization resolution of 10 bits per dimension for all point clouds. If they are larger than 5 * 10 6 , we use 17 bits. This results in relatively balanced and favourable initial hierarchies, which can cope with very inhomogeneous data and can be optimized in the next step. Due to word size restrictions, a 64 bit key was used in these cases. Per node point threshold: To show the improvement of our optimization step, we compared the query performance using the initial hierarchy of Apetrei et al. [Ape14] to ours using different thresholds. Figure 6 shows the total runtime of hierarchy construction, with and without the optimization step and kNN query (with k = 16) for each point in the input data set without memory transfer times. This way it can be easily estimated whether the optimization overhead is justified. Furthermore the plot depicts, that the initial hierarchy does not scale well for point clouds with different characteristics. Especially with highly inhomogeneous data sets, there are abrupt runtime outliers. Our proposed optimization improves exactly these negative aspects and delivers faster results throughout the whole benchmark data set. We use a threshold value of 32, since higher values did not lead to any significant improvement. Hierarchy memory compaction As outlined in Section 4.3 we compact the memory layout after the optimization step using an in-place compaction to prevent memory fragmentation. In the following benchmark, we compare the query performance after the hierarchy optimization with a threshold of 32 using no compaction, a default out of place compaction and our used in-place compaction. Figure 7 shows the total runtime of hierarchy construction, optimization, compaction and kNN query (with k = 16) for each point in the input data set without memory transfer times for the three different profiles.
The graph clearly shows that the successive memory compaction step provides an increased global device-memory bandwidth during the query (and compaction) and accelerates the total runtime noticeably. In most cases, apart from a few negligible exceptions, in-place compaction is the fastest option. This allows for larger models and eliminates the need for further expensive memory de-/allocations.

Traversal optimization analysis
In this subsection, we discuss the performance gains we achieved through our register-based heap approach, compared to a binaryheap that resides in global memory and our decision to use a backtracking traversal scheme. For this purpose, we also implemented a binary heap that resides in global device memory.

Register based kNN-Heap
To demonstrate the advantage of sorting and storing the currently found kNN in register memory instead of global device memory, we compared the runtime of both approaches. The results are shown in Figure 8. As before, we compare the total runtime: the sum of hierarchy construction, optimization and a full kNN query (k = 16) for each point without any memory transfer times. The massive advantage of using device register memory for neighbour search only is evident. Traversal-scheme Heapbased traversal strategies are typically used on CPUs, since they expand and examine fewer nodes overall. This generally leads to a faster runtime compared to backtracking approaches. On the GPU, however, due to its possible size, this node heap must be stored and sorted in slow global device memory, which leads to random mem-  ory accesses, low bandwidth and thus longer runtime. We therefore use a backtracking approach. Figure 9 shows that the overhead of additional expanded nodes can be compensated for by the lower memory latency. The decision for a backtracking approach thus delivers a faster result in almost all cases.

Register limit analysis
Two main components limit the scalability of our approach: • The available number of registers is limited (amount differs from GPU to GPU). If more registers are requested than available the compiler spills addresses to global and slow GPU main memory. • The number of compare-and-swap (CAS) operations increases linearly with the size of the requested neighbourhood (k), which generates a significant compute overhead with large k.
Therefore we compared our register-based method with a heapbased approach using global memory. Figure 10 shows the query runtimes of both methods with increasing k. As benchmark point cloud we used a very inhomogeneously distributed large scale city scan. It can be clearly seen that our algorithm runtime approaches more and more the heap-based method with increasing neighbourhood size. However, the runtime difference remains significant. This is due to the fact that each insert operation of the heap leads to many global memory accesses and incoherent thread loads. The small runtime bumps, occurring at k = 24, 49 and 69 also happen with other data-sets. This indicates a significantly different emitted kernel code. We cannot provide measurements beyond a neighbourhood size of 114, as the kernels did not provide correct results on our graphics card with a neighbourhood larger than that, because of local memory corruptions. This indicates that the compiler cannot allocate enough memory and/or register addresses and represents a hard upper limit of our approach.

Runtime break down
In Figure 11a runtime break down of our approach without any memory transfer times is shown. As usual a kNN-query with k = 16 (for each point in the input cloud) was used on four different point clouds from the benchmark set. Our proposed LBVH optimization takes only a fraction of the total runtime. For a data set the size of 14 million points (Lucy from the Stanford 3D Repository), the LBVH can be constructed and optimized in less than 18 milliseconds. A kNN query with a neighbourhood search size of k = 16 for each of the 14 million points thus takes 79 milliseconds on average. This corresponds to a query performance of over 10 5 answered neighbourhood queries per millisecond. Downloading the results of this query alone would take longer than the kernel needs for the calculation. For a runtime break down including all memory transfer times, we refer the reader to the supplemental material.

Kernel Performance Analysis
The main performance limiting factors of k nearest neighbour calculation using trees on a GPU are SIMD efficiency and latency caused by memory access and instruction dependencies. We profiled our query kernel on Apetrei's [Ape14] as well as our proposed BVH, using different combinations of traversal strategies and a global as well as the proposed register-based kNN-heap. Table 1 shows the number of executed instructions per clock cycle, cache efficiency, and the warp-stall reasons.
The analysis clearly shows that each of our optimization decisions contributes to the overall performance improvement, since the issued instructions per clock cycle (IPC) increase significantly. This is especially evident when switching from the global memory to the register-based kNN-heap. Using the optimized BVH the cache efficiency increases massively. In combination with the registerheap and a backtracking strategy the warp-stall reasons shift from memory-to instruction-dependency. Random memory access and memory latency are still very present but significantly decreased. This is only a logical consequence of the way data is now processed. It is mainly fetched from registers and stored again after the instruction has been executed. Our analysis also shows, that the kernel usage and memory bandwidth changes from 5% and 40%, respectively, for Apetreis BVH with heap-traversal and a global kNN-heap to about 70% and 35% using our proposed approach. The massive increase of compute-usage at almost the same memory bandwidth is our main source of performance improvement.

Comparison
We compared our approach to different existing and publicly available methods and also included one approximate algorithm (GPULSH) to show the high query performance of our exact approach. In the following, we give a brief description of the used reference algorithms and the used parameters or adjustments: FAISS is a library for efficient similarity search and clustering of high dimensional dense vectors, that also delivers GPU implementations as drop-in replacements for their CPU equivalents [JDJ17]. Because we want exact results, and no data set training, we build and use the IndexFlatL2 entirely on the GPU that only performs brute-force L2 distance search. KNNCUDA by Garcia et al. [GDB08] is a kNN brute force technique for high-dimensional feature-data, based on distance matrix calculation and an optimized insertion sort method. The knn_cuda_global-kernel was used, as it was the fastest of the three available implementations. Also, due to the massive GPU memory requirements, we had to split the query vector into several parts depending on its size, so that the entire query could be processed. ExactCUDAkNN proposed by Kłusek et al. [KD18], represents another brute force approach, computing the squared distance matrix with a subsequent sort operation to extract the kNN.
BFKNN is the fourth brute force method to compute the kNNs on a GPU [LA15]. Similar to KNNCUDA, we had to split the query vector into several parts to be able to fit everything into GPU main memory. GPUFLANN [ML09] is a nearest neighbour library for high-and low-dimensional data also featuring a GPU implementation that was written by Andreas Mützel. Is based on a top-down constructed k-d tree and a heap-based traversal. GPULSH presents a GPU-based locality sensitive hashing (LSH) algorithm, to perform an approximate kNN search in high dimensional spaces. The used data-structure of Pan et al. [PM11] avoids expensive operations like sort and attempts to reduce the search space by partitioning it into several groups using LSH.
All methods were tested with the same benchmark data set, shown in Figure 4. To ensure a fair comparison, the runtime was averaged over 10 iterations including a preceding device warm-up (to disable a possible GPU power save state). For all methods the total algorithm runtimes were measured, including memory transfer times, but without IO-timings, which required adjustments for some methods.
Additionally, we validated our results with a brute force approach on the CPU. Our implementation yields identical results except in cases in which adjacent distances are equal and the order is undefined. That is, the returned indices are sorted in the increasing order of the corresponding distances. This also applies to the radius search.

kNN-Search runtime comparison
In the following, we compare the runtimes for the kNN search of our approach for different neighbourhood sizes over a wide range of highly diverse point clouds with the aforementioned reference algorithms. For each single point in each point cloud the k nearest neighbours are searched. Figure 12 shows the achieved runtimes. All brute-force algorithms are in some cases more than one order of magnitude slower than the approximate approach (GPULSH), as well as both spatial subdividing methods, which include GPU-FLANN and our approach which provides the fastest results. Our method is mostly independent of data set characteristics and thus point distribution, respectively point density. We achieve an almost linear runtime behaviour depending on the input data set. Compared to the fastest reference algorithm GPUFLANN we achieve an average speedup of about 3.3. For a detailed speedup analysis we refer the reader to the supplemental material.

Query performance
Many parallelized brute-force methods are well suited to quickly answer a few kNN queries in a huge data set. Figure 13 shows the advantage of our method over existing ones even for answering just a single kNN query.
Despite the required LBVH construction and optimization beforehand, our approach delivers a significant speedup of 4.3 on average. ExactCUDAkNN was not able to answer a single query and the reference algorithms KNNCUDA and BFKNN could not provide valid results for all data sets due to internal errors or memory restrictions.

Radius-search runtime comparison
Another very important query for point cloud driven algorithms is to get all closest neighbours of a point within a given radius. We therefore measured the total algorithm runtime to find all nearest neighbours within a search-radius of 0.5%, 1%, and 5% of the current point cloud bounding box extent. As only GPUFLANN and our implementation offer the possibility to launch a radius-search, we configured both algorithms for a maximum neighbourhood count of 64 for a fair comparison. In Figure 14 the achieved runtimes are plotted. Again, our approach consistently delivers a faster runtime over the whole benchmark set and achieves an average speedup of about 3.3, similar to the kNN query. For a detailed speedup analysis we refer the reader to the supplemental material once more.

Further Tests
In our attempts to further improve our approach, we have tested and implemented additional methods, which have not led to any acceleration. Two important methods are discussed in more detail below.

Hilbert codes
As already stated in Section 4.1, we use Morton keys to sort and organize the input data. Instead of Morton keys we also tested Hilbert codes, as they have a better memory locality of spatially close data [MJFS01]. However, this only led to a minimal runtime improvement during the query phase, which was eliminated by the additional overhead of the Hilbert code calculation. The bottleneck here is not the memory access to the raw point-cloud data itself, but the randomized and highly divergent access to the traversed tree data structure.

Tree optimizations
For this reason we tested further LBVH algorithms, which delivered comparable hierarchy construction times but showed a different storage pattern [Kar12] or methods that perform a tree optimization after initial construction [KA13] and [DP15]. However, none of the tested methods achieved a noticeable acceleration either, since each further tree optimization consumed the previously gained time in the later query phase. Figure 14: Runtime comparison of a radius-query of our approach compared to GPUFLANN [ML09]. All other reference methods do not offer a radius query. As search radius we have chosen 0.5%, 1% and 5% of the bounding box diagonal of each data set. For the times shown here, all neighbours of each point in each input data set are searched. Both our approach and GPUFLANN are configured to return a maximum of 64 (sorted) nearest neighbours within the search radius. Our method is on average over three times faster than GPUFLANN. As with the kNN query, GPUFLANN is not able to process large point clouds with large queries, so there is little measurement data missing.

Overlapping Kernel/Memcpy
We also experimented with overlapping query computation and result transfer. But to allow asynchronous memory transfer, pinned host memory must be allocated. In all of our experiments the massive overhead of allocating pinned memory on our test machine was higher than the achieved speedup during the query. However, with a different operating system or a more advanced hardware configuration this could change in the future, and thus bring a notable performance boost.

Conclusion
We have proposed an optimized massively parallel method for extremely fast k nearest neighbour search in point clouds. We proposed a new technique to optimize an existing LBVH and to accelerate the required kNN-heap during the query phase by using a heap, which is directly implemented on hardware registers instead of using slow off-chip memory.
In order to demonstrate the effectiveness of our optimizations, we performed multiple benchmarks on a data set from very different point clouds and created by different acquisition modalities, to obtain an analysis that was as realistic and diversified as possible. Compared to other state-of-the-art algorithms, our benchmarks show that the proposed method is on average about 3.3 times faster than the fastest reference algorithm, and thus offers almost real-time performance even with extremely large data sets.
Due to the extremely fast hierarchy construction and its optimization, our method is also suitable for dynamic scenes, simulations or real-time ICP. It is also robust against extremely inhomogeneously distributed point sets and can easily handle large neighbourhood queries. In addition, it is possible to optimize our approach to specific point cloud properties using the available parameters.
Due to its robustness and high query performance, our approach represents an important advance in many areas where algorithms rely on fast kNN or radius queries.
One of the main limitations of our presented work is the restricted data set size due to the available GPU memory. This could be elim-inated by using out-of-core streaming techniques, which we want to address in future research. Another disadvantage is the restricted neighbourhood size due to the number of existing device registers, which depends on the GPU used.
A possible future project is to develop a better collapse heuristic, which leads to even more homogeneous hierarchies, and thus faster queries.

Supporting Information
Additional supporting information may be found online in the Supporting Information section at the end of the article.