Elsevier

Pattern Recognition

Volume 43, Issue 6, June 2010, Pages 2351-2358
Pattern Recognition

Fast exact k nearest neighbors search using an orthogonal search tree

https://doi.org/10.1016/j.patcog.2010.01.003Get rights and content

Abstract

The problem of k nearest neighbors (kNN) is to find the nearest k neighbors for a query point from a given data set. In this paper, a novel fast kNN search method using an orthogonal search tree is proposed. The proposed method creates an orthogonal search tree for a data set using an orthonormal basis evaluated from the data set. To find the kNN for a query point from the data set, projection values of the query point onto orthogonal vectors in the orthonormal basis and a node elimination inequality are applied for pruning unlikely nodes. For a node, which cannot be deleted, a point elimination inequality is further used to reject impossible data points. Experimental results show that the proposed method has good performance on finding kNN for query points and always requires less computation time than available kNN search algorithms, especially for a data set with a big number of data points or a large standard deviation.

Introduction

The k nearest neighbors’ (kNN) problem is to find the nearest k neighbors for a query point from a given data set. This problem occurs in many scientific and engineering applications including pattern recognition [1], object recognition [2], data clustering [3], [4], function approximation [5], vector quantization [6], [7], and pattern classification [8], [9].

The intuitive method of finding the nearest k neighbors for a query point Q from a data set S={X1, X2, …, Xn} of n data points is to compute n distances between the query point and all data points in the data set. This method is known as the full search algorithm (FSA). In general, the squared Euclidean distance is used to measure the distance between two points, for the query point Q=[q1, q2, …, qd]T with dimension d and a data point Xi=[xi1, xi2, …, xid]T from data set S, the distance between these two points is defined as below:D(Xi,Q)=j=1d(xijqj)2It is obviously to see that the finding process of k nearest neighbors for a query point using FSA is very time consuming. To reduce the computational complexity of the kNN finding process, many algorithms [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26] were proposed. These algorithms can be categorized into two classes.

An algorithm in the first class creates a search tree for storing data points in a more efficient way and the pre-created tree structure is searched in the kNN finding process using a branch and bound search strategy. Fukunaga and Narendra [10] applied hierarchical clustering technique to decompose data points. The result of decomposition is represented by a tree (referred to as Ball tree) and a branch and bound search algorithm was applied to reduce the number of distance computations. In general, the Ball tree has good performance on high dimensional problems [11]. The performance of Ball tree is highly influenced by the clustering algorithm. To further improve the performance of Ball tree, five Ball tree construction algorithms were delivered by Omohundros [12]. Friedman et al. [13] presented a balanced k-dimensional tree (referred to as k-d tree) based on spatial decomposition and the refined version of the k-d tree method was given by Sproull [14]. The k-d tree performs better than the Ball tree when the dimension is small. Kim and Park [15] created a multiple branch search tree using the ordered partition method and a branch and bound search algorithm to find nearest neighbors. Mico et al. [16] proposed a different branch and bound search algorithm using a pre-stored distance table to reject more impossible nodes from a pre-created binary tree. McNames [17] introduced a fast kNN search method based on a principal axis search tree (PAT). Wang and Gan [18] integrated the PAT algorithm and projected clusters to reduce the query time for high-dimensional data sets. Chen et al. [19] used a lower-bound tree (LB tree) and winner update search method to speed up the nearest neighbors search problem.

The other class finds the nearest neighbors without using a tree structure. Cheng et al. [20] proposed the min–max method to decrease the number of multiplication operations. Bei and Gray [21] presented the partial distortion search (PDS) algorithm, which allows the early termination of a distance calculation between the query point and a data point. Ra and Kim [22] utilized the equal-average nearest neighbor search algorithm to eliminate impossible data points using the difference between mean values of the query point and a data point. Tai et al. [23] applied the projection values of data points to eliminate unlike data points. In reference [24], Nene and Nayar proposed a projection value based search method to find nearest neighbors within a given distance around a query point. Lu et al. [25] used the mean value, variance, and norm of a data point to reject unlikely data points. Lai et al. [26] applied projection values and triangle inequality to speed up the kNN finding process (referred to as FKNNUPTI).

Among available methods, the PAT algorithm always has good performance for many types of benchmark data sets [17], [18], [26]. Using the PAT method, a principal axis search tree is off-line created according to projection values of data points onto principal axes of tree nodes. The principal axis of a node is evaluated using data points in the node and the principal component analysis (PCA) [17], [28] method.

The kNN finding process for a query point using the PAT method is to delete impossible nodes from the tree using projection values of boundary points and a node elimination criterion. Once a node is determined to be deleted, data points belong to that node are impossible to be the kNN of the query point and distance calculations between the query point and those data points can be excluded. If a leaf node cannot be rejected, distances between the query point and all data points in this leaf node must be computed. In general, more than 90% distance calculations can be saved using the PAT method [26].

To find the kNN for a query point using the PAT method, boundary points and their projection values onto principal axes of nodes should be evaluated. For a large data set or a search tree with high depth, the computation time for evaluating such information may not be ignored. To reduce the computation time for evaluating such information and to eliminate more impossible data points, a novel method is presented in this paper. The proposed method creates an orthogonal search tree (OST) for a data set using an orthonormal basis evaluated from the data set and a similar search tree construction process as that presented in the PAT method. In the proposed method, an orthogonal vector is chosen from the orthonormal basis for a node instead of using the principal axis of a node to divide data points of a node into several groups. For a node in an OST, the orthogonal vector selected by the node is perpendicular to those orthogonal vectors chosen by ancestors of the node. Since orthogonal vectors chosen by nodes in a path of an OST are mutually orthogonal and the number of orthogonal vectors is small, our proposed method requires no boundary points and only little computation time on evaluating projection values in the kNN finding process. To further reduce the computation load, a point elimination inequality is also presented in our method to reject unlikely data points, which cannot be deleted using the node elimination inequality. Through applying the orthonormal basis and two effective inequalities in our method, computation time can be effectively reduced by comparing with five kNN search methods [10], [13], [17], [19], [26] through using data sets from different kinds of data distributions and real data.

The rest of this paper is organized as follows. In Section 2, the PAT algorithm is briefly reviewed. Our proposed method is presented and described in detail in Section 3. Experimental results and conclusions are given in 4 Experimental results, 5 Conclusions, respectively.

Section snippets

The principal axis search tree algorithm

The PAT algorithm [17] includes two processes that are the principal axis search tree construction process and the k nearest neighbors searching process. The principal axis search tree construction process is to partition a data set into distinct subsets using the PCA method and a tree structure. The kNN searching process is to find the k nearest neighbors for a query point from the constructed PAT. These two processes are introduced in the following sections.

The orthogonal search tree algorithm

To reduce the computation time on evaluating boundary points and projection values in the kNN searching process for a query point, we use the orthonormal basis of a data set to build the search tree. In this paper, the orthonormal basis of a data set is evaluated using the PCA [28] and consists of d orthogonal vectors, where d is the dimension of the data set.

Let O={V1, V2, …, Vd} be the orthonormal basis of a data set and Vi denote the ith orthogonal vector in O, X=[x1, x2, …, xd]T and Y=[y1, y

Experimental results

To evaluate the performance of the proposed algorithm, the uniform Markov source [26], [29], the auto-correlated data [19], the clustered Gaussian data [19], [27], one set of real images [26], and the Statlog data set from reference [30] are used. The proposed algorithm OST is compared with the full search algorithm (FSA), Ball tree [10], k-d tree [13], PAT algorithm [17], LB tree [19], and FKNNUPTI method [26] in terms of the average number of distance calculations per data point and total

Conclusions

In this paper, we have presented a novel fast exact kNN search algorithm using the orthogonal search tree to speed up the kNN searching process. The proposed method creates an orthogonal search tree for a data set using an orthonormal basis evaluated from the data set. To find the kNN for a query point from an orthogonal search tree, projection values of the query point onto orthogonal vectors and a node elimination inequality are applied for pruning unlikely nodes. For a node, which cannot be

Acknowledgement

This work was supported by a Grant from National Science Council of Taiwan, ROC under Grant no. NSC-98-2221-E-343-008.

About the Author – YI-CHING LIAW was born in Taiwan, in 1970. He received his B.S., M.S., and Ph.D. degrees in Information Engineering and Computer Science all from Feng-Chia University, Taiwan, in 1992, 1994, and 2004, respectively. From 1999 to 2004, he was an engineer with Industrial Technology Research Institute, Hsinchu, Taiwan. In 2005, he joined the Department of Computer Science and Information Engineering, Nanhua University, Chiayi, Taiwan as an assistant professor. Since 2008, he has

References (31)

  • A. Gersho et al.

    Vector Quantization and Signal Compression

    (1991)
  • T.M. Cover et al.

    Nearest neighbor pattern classification

    IEEE Transactions on Information Theory

    (1967)
  • K. Fukunaga et al.

    A branch and bound algorithm for computing k-nearest neighbors

    IEEE Transactions on Computers

    (1975)
  • T. Liu et al.

    New algorithms for efficient high-dimensional nonparametric classification

    Journal of Machine Learning Research

    (2006)
  • S.M. Omohundro, Five balltree construction algorithms, Technical Report 89-063,...
  • Cited by (46)

    • Efficient nearest neighbor search in high dimensional hamming space

      2020, Pattern Recognition
      Citation Excerpt :

      Typical methods of this kind include hierarchical k-means tree [25] and vocabulary tree [5]. Besides, other tree structures (principal axis tree [26], lower bound tree [6], orthogonal tree [7]) and triangular inequality [27] have also been used in fast nearest neighbors search. Muja and Lowe [18] presented a wide range of comparisons to show that the multiple randomized KD-Trees proposed in [4] is the most effective one for matching high dimensional descriptors.

    • An improved location difference of multiple distances based nearest neighbors searching algorithm

      2016, Optik
      Citation Excerpt :

      Chen et al. [20] used winner update search method and a lower-bound tree (LB tree) to speed up the algorithm. The performance of those algorithms deteriorates with the increase in dimensions, which is shown in [21] and our experiments. The reason is that higher dimensions lead to higher complexity of tree structures.

    • Location difference of multiple distances based k-nearest neighbors algorithm

      2015, Knowledge-Based Systems
      Citation Excerpt :

      These methods can speed up the process of finding nearest neighbors to some extent, but their time complexities have not been reduced any more so that they are still not enough efficient for different datasets. Among all available methods, the PAT algorithm [18,19,27] and algorithm based on an orthogonal search tree (OST) [28] have good performance for many types of benchmark datasets. PAT method creates a principal axis search tree according to projection values of data points onto principal axes of tree nodes.

    View all citing articles on Scopus

    About the Author – YI-CHING LIAW was born in Taiwan, in 1970. He received his B.S., M.S., and Ph.D. degrees in Information Engineering and Computer Science all from Feng-Chia University, Taiwan, in 1992, 1994, and 2004, respectively. From 1999 to 2004, he was an engineer with Industrial Technology Research Institute, Hsinchu, Taiwan. In 2005, he joined the Department of Computer Science and Information Engineering, Nanhua University, Chiayi, Taiwan as an assistant professor. Since 2008, he has been an associate professor at the same department. His current research interests are in data clustering, fast algorithm, image processing, video processing, and multimedia system.

    About the Author – CHIEN-MIN WU was born in Taiwan, ROC, in 1966. He received the B.S. degree in automatic control engineering from the Feng-Jea University, Taichung, Taiwan, in 1989, the M.S. degree in electrical and information engineering from Yu-Zu University, Chung-Li, Taiwan, in 1994, and the Ph.D. degree in electrical engineering from National Chung Cheng University, Chia-Yi, Taiwan, in 2004. In July 1994, he joined the Technical Development Department, Philips Ltd. Co., where he was a Member of the Technical Staff. Currently, he is also a faculty member of the Department of Computer Science and Information Engineering, NanHua University, Dalin, Chia-Yi, Taiwan, ROC. His current research interests include ad hoc wireless network protocol design, IEEE 802.11 MAC protocols, and fast algorithm.

    About the Author – MAW-LIN LEOU was born in Taiwan, in 1964. He received the B.S. degree in communication engineering from National Chiao-Tung University, Taiwan in 1986, the M.S. degree in electrical engineering from the National Taiwan University, Taiwan in 1988, and the Ph.D. degree in electrical engineering from the National Taiwan University in 1999. From 1990 to 1992, he was with the Telecommunication Labs, Ministry of Transportation and Communications in Taiwan, as an Assistant Researcher. From 1992 to 1999, he has been an instructor in the Department of Electronic Engineering, China Institute of Technology, Taiwan. From 1999 to 2005, he has been an associate professor in the Department of Electronic Engineering, Nan-Jeon Institute of Technology, Taiwan. Since 2005, he is a faculty member in the Department of Computer Science and Information Engineering, NanHua University, Taiwan. His current research interests include adaptive arrays, adaptive signal processing, wireless communication, and fast algorithm.

    View full text