Fast exact k nearest neighbors search using an orthogonal search tree
Introduction
The k nearest neighbors’ (kNN) problem is to find the nearest k neighbors for a query point from a given data set. This problem occurs in many scientific and engineering applications including pattern recognition [1], object recognition [2], data clustering [3], [4], function approximation [5], vector quantization [6], [7], and pattern classification [8], [9].
The intuitive method of finding the nearest k neighbors for a query point Q from a data set S={X1, X2, …, Xn} of n data points is to compute n distances between the query point and all data points in the data set. This method is known as the full search algorithm (FSA). In general, the squared Euclidean distance is used to measure the distance between two points, for the query point Q=[q1, q2, …, qd]T with dimension d and a data point Xi=[xi1, xi2, …, xid]T from data set S, the distance between these two points is defined as below:It is obviously to see that the finding process of k nearest neighbors for a query point using FSA is very time consuming. To reduce the computational complexity of the kNN finding process, many algorithms [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26] were proposed. These algorithms can be categorized into two classes.
An algorithm in the first class creates a search tree for storing data points in a more efficient way and the pre-created tree structure is searched in the kNN finding process using a branch and bound search strategy. Fukunaga and Narendra [10] applied hierarchical clustering technique to decompose data points. The result of decomposition is represented by a tree (referred to as Ball tree) and a branch and bound search algorithm was applied to reduce the number of distance computations. In general, the Ball tree has good performance on high dimensional problems [11]. The performance of Ball tree is highly influenced by the clustering algorithm. To further improve the performance of Ball tree, five Ball tree construction algorithms were delivered by Omohundros [12]. Friedman et al. [13] presented a balanced k-dimensional tree (referred to as k-d tree) based on spatial decomposition and the refined version of the k-d tree method was given by Sproull [14]. The k-d tree performs better than the Ball tree when the dimension is small. Kim and Park [15] created a multiple branch search tree using the ordered partition method and a branch and bound search algorithm to find nearest neighbors. Mico et al. [16] proposed a different branch and bound search algorithm using a pre-stored distance table to reject more impossible nodes from a pre-created binary tree. McNames [17] introduced a fast kNN search method based on a principal axis search tree (PAT). Wang and Gan [18] integrated the PAT algorithm and projected clusters to reduce the query time for high-dimensional data sets. Chen et al. [19] used a lower-bound tree (LB tree) and winner update search method to speed up the nearest neighbors search problem.
The other class finds the nearest neighbors without using a tree structure. Cheng et al. [20] proposed the min–max method to decrease the number of multiplication operations. Bei and Gray [21] presented the partial distortion search (PDS) algorithm, which allows the early termination of a distance calculation between the query point and a data point. Ra and Kim [22] utilized the equal-average nearest neighbor search algorithm to eliminate impossible data points using the difference between mean values of the query point and a data point. Tai et al. [23] applied the projection values of data points to eliminate unlike data points. In reference [24], Nene and Nayar proposed a projection value based search method to find nearest neighbors within a given distance around a query point. Lu et al. [25] used the mean value, variance, and norm of a data point to reject unlikely data points. Lai et al. [26] applied projection values and triangle inequality to speed up the kNN finding process (referred to as FKNNUPTI).
Among available methods, the PAT algorithm always has good performance for many types of benchmark data sets [17], [18], [26]. Using the PAT method, a principal axis search tree is off-line created according to projection values of data points onto principal axes of tree nodes. The principal axis of a node is evaluated using data points in the node and the principal component analysis (PCA) [17], [28] method.
The kNN finding process for a query point using the PAT method is to delete impossible nodes from the tree using projection values of boundary points and a node elimination criterion. Once a node is determined to be deleted, data points belong to that node are impossible to be the kNN of the query point and distance calculations between the query point and those data points can be excluded. If a leaf node cannot be rejected, distances between the query point and all data points in this leaf node must be computed. In general, more than 90% distance calculations can be saved using the PAT method [26].
To find the kNN for a query point using the PAT method, boundary points and their projection values onto principal axes of nodes should be evaluated. For a large data set or a search tree with high depth, the computation time for evaluating such information may not be ignored. To reduce the computation time for evaluating such information and to eliminate more impossible data points, a novel method is presented in this paper. The proposed method creates an orthogonal search tree (OST) for a data set using an orthonormal basis evaluated from the data set and a similar search tree construction process as that presented in the PAT method. In the proposed method, an orthogonal vector is chosen from the orthonormal basis for a node instead of using the principal axis of a node to divide data points of a node into several groups. For a node in an OST, the orthogonal vector selected by the node is perpendicular to those orthogonal vectors chosen by ancestors of the node. Since orthogonal vectors chosen by nodes in a path of an OST are mutually orthogonal and the number of orthogonal vectors is small, our proposed method requires no boundary points and only little computation time on evaluating projection values in the kNN finding process. To further reduce the computation load, a point elimination inequality is also presented in our method to reject unlikely data points, which cannot be deleted using the node elimination inequality. Through applying the orthonormal basis and two effective inequalities in our method, computation time can be effectively reduced by comparing with five kNN search methods [10], [13], [17], [19], [26] through using data sets from different kinds of data distributions and real data.
The rest of this paper is organized as follows. In Section 2, the PAT algorithm is briefly reviewed. Our proposed method is presented and described in detail in Section 3. Experimental results and conclusions are given in 4 Experimental results, 5 Conclusions, respectively.
Section snippets
The principal axis search tree algorithm
The PAT algorithm [17] includes two processes that are the principal axis search tree construction process and the k nearest neighbors searching process. The principal axis search tree construction process is to partition a data set into distinct subsets using the PCA method and a tree structure. The kNN searching process is to find the k nearest neighbors for a query point from the constructed PAT. These two processes are introduced in the following sections.
The orthogonal search tree algorithm
To reduce the computation time on evaluating boundary points and projection values in the kNN searching process for a query point, we use the orthonormal basis of a data set to build the search tree. In this paper, the orthonormal basis of a data set is evaluated using the PCA [28] and consists of d orthogonal vectors, where d is the dimension of the data set.
Let O={V1, V2, …, Vd} be the orthonormal basis of a data set and Vi denote the ith orthogonal vector in O, X=[x1, x2, …, xd]T and Y=[y1, y
Experimental results
To evaluate the performance of the proposed algorithm, the uniform Markov source [26], [29], the auto-correlated data [19], the clustered Gaussian data [19], [27], one set of real images [26], and the Statlog data set from reference [30] are used. The proposed algorithm OST is compared with the full search algorithm (FSA), Ball tree [10], k-d tree [13], PAT algorithm [17], LB tree [19], and FKNNUPTI method [26] in terms of the average number of distance calculations per data point and total
Conclusions
In this paper, we have presented a novel fast exact kNN search algorithm using the orthogonal search tree to speed up the kNN searching process. The proposed method creates an orthogonal search tree for a data set using an orthonormal basis evaluated from the data set. To find the kNN for a query point from an orthogonal search tree, projection values of the query point onto orthogonal vectors and a node elimination inequality are applied for pruning unlikely nodes. For a node, which cannot be
Acknowledgement
This work was supported by a Grant from National Science Council of Taiwan, ROC under Grant no. NSC-98-2221-E-343-008.
About the Author – YI-CHING LIAW was born in Taiwan, in 1970. He received his B.S., M.S., and Ph.D. degrees in Information Engineering and Computer Science all from Feng-Chia University, Taiwan, in 1992, 1994, and 2004, respectively. From 1999 to 2004, he was an engineer with Industrial Technology Research Institute, Hsinchu, Taiwan. In 2005, he joined the Department of Computer Science and Information Engineering, Nanhua University, Chiayi, Taiwan as an assistant professor. Since 2008, he has
References (31)
Improvement of the fast exact pairwise-nearest-neighbor algorithm
Pattern Recognition
(2009)- et al.
A near pattern-matching scheme based on principal component analysis
Pattern Recognition Letters
(1995) - et al.
Image restoration of compressed image using classified vector quantization
Pattern Recognition
(2002) - et al.
Boosting k-nearest neighbor classifier by means of input space projection
Expert Systems with Applications
(2009) - et al.
A fast branch and bound nearest neighbor classifier in metric spaces
Pattern Recognition Letters
(1996) - et al.
Fast and versatile algorithm for nearest neighbor search based on a lower bound tree
Pattern Recognition
(2007) - et al.
Fast k-nearest-neighbor search based on projection and triangular inequality
Pattern Recognition
(2007) - et al.
Pattern Recognition
(2003) - et al.
Visual learning and recognition of 3D objects from appearance
International Journal of Computer Vision
(1995) - et al.
Optimal cluster preserving embedding of nonmetric proximity data
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2003)
Vector Quantization and Signal Compression
Nearest neighbor pattern classification
IEEE Transactions on Information Theory
A branch and bound algorithm for computing k-nearest neighbors
IEEE Transactions on Computers
New algorithms for efficient high-dimensional nonparametric classification
Journal of Machine Learning Research
Cited by (46)
Efficient nearest neighbor search in high dimensional hamming space
2020, Pattern RecognitionCitation Excerpt :Typical methods of this kind include hierarchical k-means tree [25] and vocabulary tree [5]. Besides, other tree structures (principal axis tree [26], lower bound tree [6], orthogonal tree [7]) and triangular inequality [27] have also been used in fast nearest neighbors search. Muja and Lowe [18] presented a wide range of comparisons to show that the multiple randomized KD-Trees proposed in [4] is the most effective one for matching high dimensional descriptors.
Granger Causality Driven AHP for Feature Weighted kNN
2017, Pattern RecognitionAn improved location difference of multiple distances based nearest neighbors searching algorithm
2016, OptikCitation Excerpt :Chen et al. [20] used winner update search method and a lower-bound tree (LB tree) to speed up the algorithm. The performance of those algorithms deteriorates with the increase in dimensions, which is shown in [21] and our experiments. The reason is that higher dimensions lead to higher complexity of tree structures.
Location difference of multiple distances based k-nearest neighbors algorithm
2015, Knowledge-Based SystemsCitation Excerpt :These methods can speed up the process of finding nearest neighbors to some extent, but their time complexities have not been reduced any more so that they are still not enough efficient for different datasets. Among all available methods, the PAT algorithm [18,19,27] and algorithm based on an orthogonal search tree (OST) [28] have good performance for many types of benchmark datasets. PAT method creates a principal axis search tree according to projection values of data points onto principal axes of tree nodes.
SKIRT: The design of a suite of input models for Monte Carlo radiative transfer simulations
2015, Astronomy and ComputingComparative Analysis of Nearest Neighbor Query Processing Techniques
2015, Procedia Computer Science
About the Author – YI-CHING LIAW was born in Taiwan, in 1970. He received his B.S., M.S., and Ph.D. degrees in Information Engineering and Computer Science all from Feng-Chia University, Taiwan, in 1992, 1994, and 2004, respectively. From 1999 to 2004, he was an engineer with Industrial Technology Research Institute, Hsinchu, Taiwan. In 2005, he joined the Department of Computer Science and Information Engineering, Nanhua University, Chiayi, Taiwan as an assistant professor. Since 2008, he has been an associate professor at the same department. His current research interests are in data clustering, fast algorithm, image processing, video processing, and multimedia system.
About the Author – CHIEN-MIN WU was born in Taiwan, ROC, in 1966. He received the B.S. degree in automatic control engineering from the Feng-Jea University, Taichung, Taiwan, in 1989, the M.S. degree in electrical and information engineering from Yu-Zu University, Chung-Li, Taiwan, in 1994, and the Ph.D. degree in electrical engineering from National Chung Cheng University, Chia-Yi, Taiwan, in 2004. In July 1994, he joined the Technical Development Department, Philips Ltd. Co., where he was a Member of the Technical Staff. Currently, he is also a faculty member of the Department of Computer Science and Information Engineering, NanHua University, Dalin, Chia-Yi, Taiwan, ROC. His current research interests include ad hoc wireless network protocol design, IEEE 802.11 MAC protocols, and fast algorithm.
About the Author – MAW-LIN LEOU was born in Taiwan, in 1964. He received the B.S. degree in communication engineering from National Chiao-Tung University, Taiwan in 1986, the M.S. degree in electrical engineering from the National Taiwan University, Taiwan in 1988, and the Ph.D. degree in electrical engineering from the National Taiwan University in 1999. From 1990 to 1992, he was with the Telecommunication Labs, Ministry of Transportation and Communications in Taiwan, as an Assistant Researcher. From 1992 to 1999, he has been an instructor in the Department of Electronic Engineering, China Institute of Technology, Taiwan. From 1999 to 2005, he has been an associate professor in the Department of Electronic Engineering, Nan-Jeon Institute of Technology, Taiwan. Since 2005, he is a faculty member in the Department of Computer Science and Information Engineering, NanHua University, Taiwan. His current research interests include adaptive arrays, adaptive signal processing, wireless communication, and fast algorithm.