Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality

We present two algorithms for the approximate nearest neighbor problem in high dimensional spaces. For data sets of size n living in IR d , the algorithms require space that is only polynomial in n and d , while achieving query times that are sub-linear in n and polynomial in d . We also show applications to other high-dimensional geometric problems, such as the approximate minimum spanning tree.


Introduction
The nearest neighbor (NN) problem is defined as follows: Given a set P of n points in a metric space defined over a set X with distance function D, preprocess P to efficiently answer queries for finding the point in P closest to a query point q ∈ X.A particularly interesting case is that of the d-dimensional Euclidean space where X = IR d under some s norm.This problem is of major importance in several areas, such as data compression, databases, data mining, information retrieval, image and video databases, machine learning and signal processing.The diverse interest in the problem stems from its wide applicability.Specifically, many large data sets consists of objects that can be represented as a vector of features (i.e., a point in IR d ); in such cases, finding an object similar to a given one can be achieved by finding a nearest neighbor in the feature space.The number of features (i.e., the dimensionality) ranges anywhere from tens to millions.For example, one can represent a 1000 × 1000 image as a vector in a 1,000,000-dimensional space, one dimension per pixel.
The problem and its variants is one of the prototypical questions in computational geometry.Originally posed in the 1960s by Minsky and Papert ([MP69], pp.222), it has been the subject of substantial research efforts since then.Many efficient solutions have been discovered for the case when the points lie in a space of constant dimension.For example, if the points lie in the plane, the nearest neighbor problem can be solved with O(log n) time per query, using only O(n) storage [SH75,LT80].
Unfortunately, as the dimension grows, these algorithms become less and less efficient.More specifically, their space or time requirements grow exponentially in the dimension.In particular, the nearest neighbor problem has a solution with O(d O(1) log n) query time, but using n O(d)  space ( [Mei93], building on [Cla88]).This is partly because the Voronoi decomposition of P , i.e., the decomposition of IR d into cells such that all points within each cell have the same nearest neighbor in P , has complexity n Θ (d) .Alternatively, if one insists on linear (or near-linear) storage, the best known running time bound even for random point sets is of the form min(2 O(d) , dn), which is essentially linear in n even for moderate d.Worse still, the exponential dependence of space and/or time on the dimension (called "curse of dimensionality") has been observed in practice as well [WSB98].
The lack of success in removing the exponential dependence on the dimension led many researchers to conjecture that no efficient solutions exists for these problems when the dimension is large enough (e.g., see [MP69]).At the same time, it raised the question whether it is possible to remove the exponential dependence on d, if we allow the answers to be approximate.Specifically, in the c-approximate nearest neighbor problem, instead of reporting the point p closest to q, the algorithm is allowed to report any point within distance c times the distance from q to p, for some approximation factor c > 1.The appeal of this approach is that, in many cases, an approximate nearest neighbor is almost as good as the exact one.In particular, if the distance measure accurately captures some notion of quality, then small differences in the distance should not matter much.Moreover, an efficient approximation algorithm can be used to solve the exact nearest neighbor problem, simply by enumerating all approximate nearest neighbors and returning the closest point encountered.
Our results.In this paper, we provide several results for the approximate nearest problem under the s norms for s ∈ [1, 2].For concreteness, we start with the case of s = 1.Let c = 1 + , where > 1/n.We show: 1.A deterministic data structure for (1 + γ)(1 + )-NN, for any constant γ > 0, with space and preprocessing time bounded by O(n log 2 n) × O(1/ ) d , and query time O(d log n).
1b.A randomized data structure for (1 + γ)(1 + )-NN, for any constant γ > 0, with space and preprocessing time bounded by n O(log(1/ )/ 2 ) , and query time polynomial in d, log n and 1/ .It is obtained by reducing the dimension d to O(log n/ 2 ) (using the Johnson-Lindenstrauss lemma [JL84]), and then utilizing the result above.

2.
A randomized data structure for c(1 + γ)-NN, for any constant γ > 0, with space that is subquadratic in n and query time sub-linear in n.Specifically, the data structure uses space O(dn + n 1+1/c log 2 n log log n), and has query time of O(dn 1/c log 2 n log log n).The algorithm is based on Locality-Sensitive Hashing, see below for further details.
The results can be further extended to s norm for s ∈ [1, 2].This is done by using an embedding from d s into d 1 , for d = O(d log(1/ )/ 2 ), which preserves all distances by a factor of 1 + [JS82].The reduction increases the query complexity by an additive term of O(dd ) .
The above results naturally complement each other.The first one (1 and 1b) shows that it is possible to construct approximate data structures that suffer from only a mild form of the "curse of dimensionality".The data structure (1) is the first to support (1+ )-NN queries in time logarithmic both in n and 1/ , while requiring only space that is near linear in n.If is strictly separated from zero, then data structure (1b) removes the "curse" completely.Unfortunately, the resulting data structure is mostly of theoretical interest, since the log(1/ )/ 2 term in the exponent makes the data structure highly memory-intensive even for relatively large .On the other hand, the second algorithm (2) provides a more modest improvement in the running time, but the space penalty is much less severe.As a result, the second algorithm is much more practical when the dimension d is high.
An additional benefit of the first algorithm is that in fact it provides a low-complexity approximate Voronoi decomposition.Specifically, we show that the algorithm can be used to construct a decomposition of IR d into O(n log 2 n) × O(1/ ) d+1 simple cells, such that for each cell C one can identify a single point in P that is an (1 + )-approximate nearest neighbor for any q ∈ C.This is an approximate analog of the Voronoi decomposition, which provides such a structure for the case of the exact nearest neighbor.However, the complexity of our decomposition is much lower than that of the exact one.
Finally, we show how to use our efficient approximate nearest neighbor algorithm to solve other proximity problems.In particular, we show how to approximate the Geometric Minimum Spanning Tree (MST) of a given set on n points by performing near-linear number of calls to an approximate nearest neighbor data structure.By plugging in the second data structure (based on Locality-Sensitive Hashing) we can find a c-approximate MST in time O(dn 1+1/c log 3 n).
Our techniques.Our approximate nearest neighbor data structures are obtained in two steps.First, we show how to solve a "decision version" of the approximate nearest neighbor problem, that we term the approximate near neighbor problem1 .Here, we are given a fixed radius r, and the goal is to build a data structure for a given point-set P that for any query q does the following: if there is a point in P within distance r from q, then it reports a point p ∈ P within distance cr from q.We show how to construct two such data structures.One utilizes the aforementioned approximate Voronoi decomposition, leading to a fast lookup time and space that is mildly exponential in d.The second one is based on the notion of locality-sensitive hashing (LSH).Its key idea is to hash the points in a way that the probability of collision is much higher for objects which are close to each other than for those which are far apart.Then, one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point.We show that such families of hash functions exist for Hamming distance and its variants, and extend them to s norms.
In the second step, we show how to reduce the approximate nearest neighbor problem to the near neighbor problem.A simple way of performing this task is to build several data structures for the latter problem, with radii r = ∆, ∆/c, ∆/c 2 , . .., where ∆ is the largest possible distance between the query and the point-set.Unfortunately, in general this approach leads to space complexity that is not bounded by any function of n.We overcome this problem by providing a more efficient reduction, which utilizes only some of the radii r.Our reduction multiplies the space complexity of the near neighbor data structure by a factor of O(log 2 n), and the query time by O(log n).Composing the reduction with the aforementioned algorithms for the approximate near neighbor problem provides algorithms for the approximate nearest neighbor problem.
Relation to conference papers.Thanks to the amount of time that has passed since the publication of the initial conference papers [IM98a,HP01] on which this paper is based, we were able to simplify some of the arguments considerably.As a result, several algorithms in this paper are simpler, and sometimes more general, than the algorithms in the original papers.In particular, the reduction from the near to the nearest neighbor presented here is a simplification of the reduction in [HP01] (which itself was much simpler and more efficient than the reduction in [IM98a]).It works for general metric spaces, and can be performed using near-linear number of approximate near neighbor queries.In contrast, the preprocessing algorithm in [HP01] was tailored to s spaces, while the algorithm in [IM98a] ran in quadratic time.
We note, however, that a side-effect of our reduction (as well as the reduction of [HP01]) is that in the approximate near neighbor instances the ratio of the diameter of the point-sets to search radius r is no longer small (polylogarithmic).The latter property is useful, e.g., when designing efficient LSH functions for d 1 space.We can ensure a somewhat weaker property directly by exploiting the "randomized bucketing" approach of [Ber93] (Lemma 3.1).Luckily, the property suffices for our purpose.
Another result that was a subject to a substantial generalization is the algorithm for computing an approximate MST.The algorithm outlined in the manuscript [IM98b] relied on a data structure for maintaining approximately close pairs of points under updates, which was used to simulate Kruskal's MST algorithm.Instead, in this paper we present a general reduction from the approximate MST problem to dynamic approximate near neighbor data structure, that works in arbitrary metrics.The reduction is still based on Kruskal's algorithm.However, it has a particularly simple form, that was inspired by the algorithm of [BOR99] (see the comments in the next subsection).

Related work
Prior Work.There has been a substantial amount of work on approximate nearest neighbor problem in the computational geometry literature.However, all work prior to this paper yielded algorithms that involved factors exponential in d in either the space or in the query time bound.One of the earliest work on this topic is [AM93] (improved upon in [Cla88] and [Cha97]), who gave an algorithm with query time 1/ O(d) • log n and space 1/ O(d) • n.Another line of research [AMN + 94] resulted in an algorithm with linear space O(dn) but query time (d/ ) O(d) • log n.Other authors [Ber93,Cha97] considered algorithms with approximation factor polynomial in d, and provided algorithms that avoid exponential factors in the space and query time bounds.Finally, two approximation algorithms were given in [Kle97]: one with O(dn log d) space and O(n) query time and one with n O(d) space and (d + log n) O(1) query time (for fixed > 0).The last algorithm, although still suffering from the curse of dimensionality, has significantly influenced the developments in this paper.
The locality-sensitive hashing approach introduced in this paper has several ancestors in the literature, which investigated multi-index hashing-based algorithms for retrieving similar pairs of vectors with respect to the Hamming distance.Although the analytic framework adopted in those papers made the results generally incomparable to ours, some of the insights are shared.A few of the papers considered a closely related problem of finding all "close" pairs of points in a data set.For simplicity, we translate them into the near neighbor framework, since they can be solved by performing essentially n separate near neighbor queries.
Typically, the hash functions projected the vectors on some subset of the coordinates {1 . . .d}.In some papers [PRR95,GPY94] the authors considered the probabilistic model where the data points are chosen uniformly at random, and the query point is a "random" point "close" to one of the points in the data set.A different approach [KWZ95] is to assume that the data set is arbitrary, but almost all points are far from the query point.Finally, the paper [CR93] proposed an algorithm which did not make any assumption on the input, and provided a method for determining the parameters (denoted here by k and L) to achieve desired level of sensitivity and accuracy.
On a related front, the authors of [Bro97,BGMZ97] considered similarity search between sets (say A, B) using the Jaccard coefficient s(A, B) = |A∩B| |A∪B| .They proposed a family of hash functions h(A) such that Pr[h(A) = h(B)] = s(A, B), which can be plugged into our framework.Although their main motivation was to construct short similarity-preserving "sketches" of sets, they also discuss methods for performing similarity search using these functions.
Concurrent developments.Parallel to our conference paper [IM98a], the paper [KOR98] presented an algorithm with bounds similar to our result (1b).Specifically, it provides a data structure that, in case of the d 1 norm, achieves O(d(log n + 1/ ) O(1) ) query time using space (dn) O(1/2 ) .For the 2 norm, the query time becomes O(d 2 (log n + 1/ ) O(1) ).The probabilistic guarantee provided by their data structure is somewhat stronger: if the construction procedure is correct (which happens with a controlled probability) then the data structure returns a correct approximate nearest neighbor to all queries q. 2 Although the technical development is somewhat different, the general approach is similar, in that a form of a randomized dimensionality reduction is used to reduce the dimension to O(log n/ 2 ).
In another parallel development, the paper [BOR99] presented a (1+ )-approximation algorithm for MST (for < 1), with running time O(dn 1−a 2 ) for some absolute constant a > 0.
Further developments.Since the conference versions of this paper have appeared in [IM98a, IM98b, HP01], there have been many further developments on approximate nearest neighbor search.In this section, we briefly discuss some of those results.For a more in-depth treatment of the material see [Ind04].
The reduction from the approximate nearest to the near neighbor problem, and the resulting approximate Voronoi decompositions were further improved in [AM02] by using a well-separated pairs decomposition [CK95] to generate the cells needed to construct the approximate Voronoi diagram (note, that unlike our construction, the number of cells is linear in n but has exponential dependency on the dimension).This construction uses the same framework as suggested in [HP01].This result was extended to improve the tradeoff between the dependency on on the query time and space used.See [AMM09] for further details and related work.This construction results in an approximate Voronoi diagram of complexity n/ O(d) .
An alternative construction was suggested in [SSS06].Building on the reduction from nearestneighbor search to near neighbor suggested in [HP01], they reduced the space requirements by a logarithmic factor.By slightly changing the near-neighbor problem, they reduce the space bound to linear, which yields an approximate Voronoi diagram of complexity n/ O(d) .
The exponent in the space bound achieved by the algorithm (1b) is likely to be close to the optimal.Specifically [AIP06] showed that any (1 + )-approximate data structure in Hamming space that makes only a constant number of memory accesses needs n Ω(1/ 2 ) storage (our data structure makes only one memory access).There has been plenty of work on lower bounds for the exact and approximate nearest neighbor problem, see [Ind04] for an overview.
There has been substantial progress in designing LSH functions for various measures and understanding their limitations, see [AI08] for an overview.In particular, it is known that for the 1 norm, the 1/c bound for the running time exponent given in this paper is tight ([OWZ09], building on [MNP06]).Lower bounds in more general models of computations were provided in [PTW10].In contrast, for the 2 norm, one can further reduce the exponent to 1/c 2 ([AI06], building on [DIIM04]).In addition, a different algorithm exploiting LSH functions was proposed in [Pan06]; this data structure achieves near-linear storage, at the price of increasing the exponent by a constant factor.
From a more general perspective, the existence of fast approximate nearest neighbor algorithms for high-dimensional 1 spaces enabled solving this problem for other metrics, by embedding them into 1 with low distortion.This approach has been useful for variants of edit distance [MS00, CMS01b, CMS01a, Cor03, CK06, OR07], earth-mover distance [Cha02, IT03, GD05, AIK08] and other metrics.Lower bounds for such embeddings have been investigated as well [KN05].
Finally, the LSH algorithm and its variants have become popular practical algorithms for similarity search in high dimensions.They have been successfully applied to computational problems in a variety of areas, including web clustering [HGI00], computational biology [Buh01, BT01], computer vision (see selected articles in [SDI06]), computational drug design [DGJC06] and computational linguistics [RPH05].

Notation
We use (X, D) to denote a metric space defined over X with the distance function D. Typically, we will use X = IR d for some dimension d > 0 and D(p, q) = p − q s for some s ≥ 1.In those cases, we will assume that d ≤ n, where n = |P |.For convenience, we assume that n is even.
For any point p ∈ X, and r > 0, we use B(p, r) to denote the set {q ∈ X : D(p, q) ≤ r}.For a parameter r > 0, let UB P (r) = ∪ p∈P B(p, r) denote the union of equal balls of radius r centered at the points of P .Moreover, let CC P (r) be a partitioning of P induced by the connected components of UB P (r).That is, two points p, q ∈ P belong to the same set in the partitioning if there is a path contained in UB P (r) connecting p to q; that is, there is a sequence p = p 1 , . . ., p k = q ∈ P , such that D(p i , p i+1 ) ≤ 2r, for i = 1, . . ., k − 1.
Define D P (q) = min p∈P D(q, p).For a parameter c ≥ 1, p ∈ P is a c-approximate nearest neighbor of q if D(q, p ) ≤ cD P (q).Definition 1.1.The c-approximate nearest neighbor problem (or c-NN) with failure probability f is to construct a data structure over a set of points P in metric space (X, D) supporting the following query: given any fixed query point q ∈ X, report a c-approximate nearest neighbor of q in P with probability 1 − f .We use NearestNbr(P, c, f ) to denote a data structure solving this problem.We skip the argument f if it is equal to some absolute constant from (0, 1).
Definition 1.2.The (c, r)-approximate near neighbor problem (or (c, r)-NN) with failure probability f is to construct a data structure over a set of points P in metric space (X, D) supporting the following query: given any fixed query point q ∈ X, if D P (q) ≤ r, then report some p ∈ P ∩B(q, cr), with probability 1 − f .We use NearNbr(P, c, r, f ) to denote a data structure solving this problem.We skip the argument f in this or other definitions involving it if it is equal to some absolute constant from (0, 1).
Note that, according to the above definition, if q / ∈ UB P (r) then the algorithm is allowed to report any point in P , or no point at all (i.e., p is null).For convenience, if the reported point p is further than cr from q (a condition that the algorithm can check), then we set p to null.
Let n = |P |.If the data structure NearNbr(P, c, r) is defined by the context, we will use T (n, c, r, f ) and Q(n, c, r, f ) to denote the construction and query time of the data structure, respectively.We will also use S(n, c, r, f ) to denote its space complexity.Additionally, if the data structure supports updates (insertions or deletions of points to or from P ), we denote the update time by U (n, c, r, f ).The time to perform a deletion alone is denoted by D(n, c, r, f ).We will skip the argument r if the time and space functions do not depend on it, and the argument f if it is assumed to be equal to some small absolute constant (e.g., 1/3).
For the algorithms discussed in this paper, we assume that T (n, c, r, f ) = Ω(nd), S(n, c, r, f ) = Ω(nd) and Q(n, c, r, f ) = Ω(d), i.e., the time and space bounds are at least linear in the input size.Moreover, by standard replication arguments, we have T (n, c, f ) = O(log 1/f )T (n, c); a similar relationship holds for the other complexity measures.
Insertions and deletions in randomized data structures.In this paper we present several randomized data structures over sets of points P that support approximate nearest neighbor queries and updates.There exist several possible types of guarantees that one can require from such a data structure.In this paper we consider two models: "oblivious" and "adaptive".
Both models require the notion of a well-formed sequence.Formally, consider any sequence S of m operations on the data structures, where the ith operation is of the form (op i , p i ), where op i ∈ {Query, Delete, Insert}.Given two sets of points P and Q, the sequence is called well-formed with respect to P and Q if it satisfies the following conditions: (A) Only the points in P can be deleted or inserted.(B) Only the points in Q can be queried.(C) A point can be deleted only after it has been inserted first.
If Q (or P , resp.) is the set of all points in the metric space, we skip Q (or P , resp.) in the above definition.
We start by defining the oblivious model where we require that for any well-formed sequence, the data structure should be correct with some probability.This is the default model used in this paper.The formal definition is as follows.
Definition 1.3.A randomized fully dynamic (r, c)-NN data structure Z with failure probability f is called oblivious if for any sequence S that is well-formed: Note that one can bound the probability of correctness for all operations in the sequence via the union bound.
In the adaptive model, the requirements are stronger than in the oblivious case.Specifically, once a data structure is constructed correctly (which happens with probability 1−f ), then it should correctly execute any well-formed sequence of update and query operations, as long as the points come from a pre-defined set.This in particular implies that the query and update operations can depend on the results of prior operations.
Definition 1.4.Consider any two sets of points P and Q.A randomized fully dynamic (c, r)-NN data structure Z with failure probability f is called adaptive with respect to P, Q if Pr[Z is correct for any sequence S that is well-formed with respect to P, Q] ≥ 1 − f.Miscellaneous.For any P ⊂ X, let P and P be two partitions of P .We say that P refines P (denoted by P P ) if for any S ∈ P we have S ∈ P such that S ⊂ S .
We use D H (p, q) to denote the Hamming distance between vectors p and q, i.e., the number of coordinates i such that p i = q i .

Outline
In this section, we outline the reduction from the approximate nearest neighbor problem to a sequence of approximate near neighbor problems.
We start from a description of an "ideal" reduction, which assumes a solution to the exact near neighbor problem (i.e., for c = 1), and also assumes some constraints on the point set.The actual reduction overcomes these requirements at the price of making it somewhat more complicated.
For simplicity, assume n is even.Let γ ∈ (1/n, 1/2) be a parameter to be determined later (the lower bound of 1/n ensures that log(n/γ) = O(log n), which simplifies some expressions).Let r med be equal to the smallest value of r such that UB P (r) has a component of size at least n/2 + 1, and let UB med = UB P (r med ).For the purpose of exposition, we assume that the largest component of UB med has size exactly n/2 + 1.
We first construct a NearNbr data structure Z med,2 = NearNbr(P, c, r med /2).Given a query point q, if q is within distance r med /2 to some point p ∈ P (which can be decided by one near neighbor query in Z med,2 ), then we continue the search for the nearest neighbor recursively in the connected component of UB med that contains p.Note that such connected component is unique.Moreover, observe that the search continues recursively into a subset of P of cardinality at most n/2 + 1.
Alternatively, if D P (q) ≥ r top for r top r med (which can be decided by a single near neighbor query in NearNbr(P, c, r top )), then q is "far away" from the points of P , and we can continue the search on a decimated subset of P .Namely, from each connected component of UB med , we extract one point of P that lies inside it.This results in a set P ⊂ P that contains at most n/2 points.We continue the recursive search into P .Although continuing the search into P introduces cumulative error into the search results, we show that the overall error introduced is smaller than 1 + O(γ) if we set r top = Θ(nr med log(n)/γ).
Overall, given a query point, either we found its (1 + γ)-approximate nearest neighbor using O(log M ) = O(log (n/γ)) near neighbor queries, or alternatively, we performed two near neighbor queries and continued the search recursively into a set having at most n/2+1 points of P .Thus, one can find an (1 + O(γ))-approximate nearest neighbor by performing O(log (n/γ)) NearNbr queries.
This concludes the description of the ideal algorithm.Unfortunately, it suffers from several problems: (i) it assumes a reduction to the exact near neighbor problem, (ii) it assumes that a component of size n/2 + 1 always exists, and (iii) it requires computing the value of r med , which is expensive.Instead, we will only compute an approximation r * med to r med , such that r med ≤ r * med = O(nr med ).This will require defining r * bot that will play the role of r med in the above argument, and readjusting the value of r top .

Subroutines
Approximating r med .Recall that r med is equal to the smallest value of r such that UB P (r) has a component of size at least n/2 + 1, and UB med = UB P (r med ).We start from describing an algorithm that approximates the value of r med .
Lemma 2.1.There exists a randomized algorithm that, given a set P of n points and a distance function D, returns an estimate r * med such that r med ≤ r * med ≤ (n − 1)r med with probability at least 1/2.The algorithm runs in time O(n).
Proof: The algorithm first selects a point p uniformly at random from P .Then r * med is defined to be the median of the set D(p, p ) over p ∈ P .Note that the median of n − 1 distances is uniquely defined.
To prove the lemma, define C ⊂ P to be the set of points inducing the largest connected component in UB P (r med ).By definition of r med , we have |C|/|P | > 1/2.Therefore, the point p belongs to C with given probability.From now on we condition the analysis on this event.
First, we show that r med ≤ r * med .Indeed, the point p, together with the n/2 points closest to p, induces a connected component in UB P (r * med ) of size n/2 + 1.By definition, r med is the smallest radius that induce such a component in UB P (r * med ).On the other hand, all points in C are within the distance of (n−1)r med from p. Since |C| > n/2, it follows that the median of the distances (i.e., r * med ) is at most (n − 1)r med .
Let c > 1 be the approximation factor of the near-neighbor data structure that we reduce to.For λ = O(log n/γ) where γ > 0 is a parameter to be determined, we define r * bot = Approximating CC P (r).We now describe an algorithm for approximating the connected components CC P (r).Specifically, for given parameters r and c, we will show how to compute a partitioning P of P such that P is a refinement of CC P (cr) and CC P (r) is a refinement of P. The algorithm uses a NearNbr(P, c, r, f ) data structure that is adaptive with respect to the query set P with failure probability at most f , where f is some constant.The algorithm is presented as Algorithm 2.1.The basic idea of the algorithm is simple: we compute connected components of a graph with edges of the form (q, p ), where q ∈ P and p is a point output by an approximate NearNbr query on q.This is done by implementing a standard graph search.However, some care is needed to ensure that the running time of the algorithm is low, even though the graph itself can have Ω(n 2 ) edges.This is achieved by deleting from NearNbr the points that have been reached during the search.This ensures that each new edge found by querying NearNbr leads to a vertex that has not been reached yet.
procedure ApproximateCC(P, c, r) Construct a NearNbr(P, c, r, f ) data structure for P .P = ∅ E = ∅ while P = ∅ do Select any p ∈ P .Delete p from P and from the associated NearNbr structure.S = {p} C = ∅ C is the next connected component to be computed repeat Select any q from S.
We will enumerate all edges going out of q Add q to C. Delete q from S. repeat Let p be the answer of the NearNbr structure on q. if p is not null then Add {q, p } to E .Delete p from P and the associated NearNbr structure.Add p to S. end if until p is null until S = ∅ Connected component C has been completed Add C to the partition P. end while end procedure Algorithm 2.1: Finding an approximation to CC P (r).
Claim 2.3.The algorithm ApproximateCC computes a partitioning P that refines CC P (cr) and is refined by CC P (r), i.e., CC P (r) P CC P (cr).Its running time is O(T (n, c, f ) + nD(n, c, f ) + nQ(n, c, f )) and the probability of failure is at most f .Interval near neighbor.Our algorithm will make use of a collection of NearNbr data structures which solve the approximate nearest neighbor assuming the distance from the query to the data set belongs to a given range.We will refer to this collection as Interval NearNbr, or INearNbr.3 log M , so that given a query point q, one can decide that either: (i) D P (q) ≤ a, (ii) D P (q) ≥ b, or (iii) find a point p ∈ P , so that D(q, p ) ≤ c(1 + γ)D P (q).The answer returned is correct with probability at least 2/3.If either case (i) or case (ii) occurs then only two NearNbr queries are carried out, while if case (iii) occurs then O(log(log(b/a)/γ)) NearNbr are performed.
Clearly, if a query to Z 1 does not return null, we conclude that case (i) has occurred.Similarly, if a query to Z M returns null, we conclude that case (ii) must have happened.Note that in both cases we only perform one NearNbr query.
Otherwise, it must be the case that a/c ≤ D P (q) ≤ cb.By performing a binary search on Z 1 , . . ., Z M we can find an index i such that, on query q, Z i reports null and Z i+1 reports some point p .Clearly, we have Thus, p is a c(1 + γ)-approximate nearest neighbor of q in P .

The reduction
Data structure construction.Let f = 1 E log n for some constant E > 1.We will use NearNbr data structures with probability of failure f .The preprocessing algorithm is described as Algorithm 2.2.It can be viewed as building a search tree defined by the recursive calls, with each tree node containing a subset of points and a collection of NearNbr data structures over that subset.
We now analyze the complexity of the construction algorithm.First, observe that when the procedure terminates, we have k ≤ n/2 and k ≤ k.The latter inequality holds since cr * bot ≤ r * med and therefore P P .Without the loss of generality assume that n ≥ 3. We have the following recurrence: The proof is by induction.First, we apply the inductive assumption and substitute for B(n i ) and B(k ) in Equation 1.Since k ≤ k we have subject to the same constraints as above.
We relax the assumption that n i 's are integers.Using the convexity of the negated entropy function H(p) = p log p + (1 − p) log(1 − p), applied on pairs of n i 's, we observe that for each k, the maximum is achieved when at most two values n i (say, n 1 and n 2 ) are set to a value greater than 1.Since 1 • log(1) = 0, the right hand side of Equation 2 simplifies to where

Case (ii) :
The search procedure continues on the set P of representatives, with the property that for any point p ∈ P , there is a point p ∈ P such that D(p, p ) ≤ cnr * med .Moreover, we know that D P (q) ≥ r * top = r * med nλc ≥ λD(p, p ).It follows that if the recursive procedure finds an a-approximate nearest neighbor of q in P , then this point is also a (1 + 1/λ)a-approximate nearest neighbor of q in P .

Case (iii) :
The search procedure determines a (1 + γ)c approximate nearest neighbor directly by querying INearNbr.
It follows that the search procedure reports an (1+1/λ) h (1+γ)c-approximate nearest neighbor, where h = log n + O(1) is the height of the data structure tree.We set λ = O(log(n)/γ) such that The above analysis required that each of the O(log n) NearNbr data structures encountered in the process was correct.The probability that this does not happen is at most O(1/E).

The result
We now state our main result of this section.The result is stated in two forms: without preprocessing, and with preprocessing.
Assume that we are given a data structure for the (c, r)-approximate near neighbor using space S(n, c, f ) and with query time is Q(n, c, f ), with failure probability f .Then there exists a data structure for answering c(1 Theorem 2.10.Let P be a given set of n points in a metric space, and let c = 1+ > 1, f ∈ (0, 1), and γ ∈ (1/n, 1) be prescribed parameters.
Assume that we are given a data structure for the (c, r)-approximate near neighbor over P .that can be constructed in T (n, c, f ) time, it uses S(n, c, f ) space, its query time is Q(n, c, f ), and its deletion time is D(n, c, f ), with failure probability f .Moreover, assume that the data structure is adaptive with respect to any set of size n O(1) .
Then one can build, in expected O(T (n, c, f

Approximate near neighbor
In this section, we describe data structures for the approximate near neighbor problem (i.e., (c, r)-NN) in d s .They can be used as subroutines in the reduction from the earlier section.Note that by scaling we can assume r = 1.We start by describing two simplifying reductions, followed by two algorithms for the approximate near neighbor problem.

Reductions
Coordinate range reduction.We start by describing a method for reducing the range of coordinate values of points in P ∪ {q}, for the case of the 1 norm.It is essentially due to [Ber93], who used it for the purpose of designing approximate nearest neighbor algorithms with approximation factor polynomial in d.This subroutine improves the efficiency of the data structures.
Lemma 3.1.Fix a > 1, and suppose there is a data structure for (c, 1)-NN in [0, a] d under the 1 norm, using space S(n, d, c, r, f ) and query time Q(n, d, c, r, f ), with failure probability f .Then there exists a data structure for the same problem over d 1 , with asymptotically the same time and space bounds, and failure probability f + 1/a.
Proof: First, we impose an orthogonal cubic grid on d 1 , where each grid cell has side length a.The grid is translated by a random vector chosen uniformly at random from [0, a] d .Consider any pair of points p, q ∈ l d 1 such that p − q 1 ≤ 1.The probability that both p and q fall into the same grid cell is at least Therefore, we can proceed as follows: • For each grid cell C such that C ∩ P = ∅, build a separate data structure for the set C ∩ P .
• In order to answer a query q ∈ C, query the data structure for C ∩ P .In this way we reduced the (c, 1)−NN problem for points in d 1 to the case when the coordinates of points live in a cube of side a.The query time is increased by an additive term O(d), which (as per the assumptions in Section 1.2) is subsumed by O(Q(n, d, c, r, f )).Since (as per the assumptions in Section 1.2) S(n, d, c, r, f ) is at least linear in dn, it follows that the space bound of the new data structure remains O(S(n, d, c, r, f )).The probability of failure is increased by an additive term of 1/a.
Hamming cube.Next, we show that the problem can be simplified further, by assuming that the points live in a low-dimensional Hamming cube.Lemma 3.2.Fix δ > 0, a > 1, and suppose there is a data structure for (c, r)-NN in a dMdimensional Hamming cube, for M = O(ad/δ) using space S(n, d, c, r, f ) and query time Q(n, d, c, r, f ), with failure probability f .Then there exists a data structure for (c(1 + δ), 1)-NN for d 1 with the same time and space bounds, and failure probability f + 1/a.Proof: By Lemma 3.1 we can assume all points live in [0, a] d .If we round all coordinates of points and queries to the nearest multiple of δ/d, then the inter-point distance are affected only by an additive factor of δ.Thus, solving a (c, 1 + δ)-NN on the resulting point-set and queries yields a solution to (c(1 + δ), 1)-NN over the original space d 1 .In order to solve the (c, 1 + δ)-NN problem, define a scaling factor s = d/δ.Observe that, if we multiply all coordinates of the data points and queries by s, then all coordinates become integers in the range {0 . . .M }.In this case, we use a mapping unary : {0 . . .M } d → {0, 1} dM , such that unary((x 1 , . . ., x d )) = unary(x 1 ) . . .unary(x d ), where unary(x) is a string of x ones followed by a string of M − x zeros.Observe that the 1 distance between any two points is equal to the Hamming distance of their images [LLR94].Thus, it now suffices to solve an (c, (1 + δ)s)-NN on the point-set and queries mapped by U nary.
In the following we focus on solving the problem in s norms for s ∈ {1, 2}.The generalization to any s ∈ [1, 2] follows from the theorem of [JS82], that states that for any d, γ, there exists a mapping f : , so that for any p, q ∈ d 1 we have p − q s ≤ f (p) − f (q) 1 ≤ (1 + γ) p − q s .Note that the mapping f is defined for all points in d s , even if they are not known during the preprocessing.The mapping f is chosen at random from a certain distribution over linear mappings from d s to d 1 , which takes time O(dd ).This allows us to reduce (c(1 + γ), r)-NN in d s to (c, r)-NN in d 1 .

Locality-sensitive hashing
In this section, we describe an algorithm based on the concept of locality-sensitive hashing (LSH).
The basic idea is to hash the data and query points in a way that the probability of collision is much higher for points that are close to each other than for those which are far apart.Formally, we require the following.
In order for a locality-sensitive family to be useful, it has to satisfy inequalities p 1 > p 2 and r 1 < r 2 .We can solve the (c, r)-NN problem using a locality-sensitive family as follows.We choose r 1 = r and r 2 = c • r.Given a family H of such hash functions for these parameters, we amplify the gap between the "high" probability p 1 and "low" probability p 2 by concatenating several functions.In particular, for k specified later, define a function family G = {g : X → U k } such that g(p) = (h i 1 (p), . . ., h i k (p)), where h it ∈ H, for I = {i 1 . . .i k } ⊂ {1 . . .|H|}.For an integer L we choose L functions g 1 , . . ., g L from G independently and uniformly at random.Let I j denote the multi-set defining g j .During preprocessing, we store a pointer to each p ∈ P in the buckets g 1 (p) . . .g L (p).Since the total number of buckets may be large, we retain only the non-empty buckets by resorting to "standard" hashing.
To process a query q, we carry out a brute-force search for the neighbor of q in buckets g 1 (q), . . ., g L (q).As it is possible that the total number of points stored in those buckets is large, we interrupt the search after finding the first 3L points (including duplicates).Let p 1 , . . ., p t be the points encountered during this process.If there is any point p j such that D(p j , q) ≤ r 2 then we return the point p j , else we return null.
Correctness.The parameters k and L are chosen so as to ensure that with some constant probability the following two events hold.For any p * we define two events: • E 1 (q, p * ) occurs iff either p * / ∈ B(q, r) or g j (p * ) = g j (q) for some j = 1 . . .L. • E 2 (q) occurs iff the total number of collisions of q with points from P − B(q, r 2 ) is less than 3L, i.e.
Our first result is established for the oblivious case.
Theorem 3.4.Suppose there is a (r, cr, p 1 , p 2 )-sensitive family H for (X, D), where p 1 , p 2 ∈ (0, 1) and let ρ = log 1/p 1 log 1/p 2 .Then there exists a fully dynamic data structure for (c, r)-NN over a set P ⊂ X of at most n points, such that: • The query procedure requires at most O(n ρ /p 1 ) distance computations, evaluations of the hash functions from G (that is, O(n ρ /p 1 • log 1/p 2 n ) evaluations of the hash functions from H), or other operations; the same bound hold for the updates.• The data structure uses at most O(n 1+ρ /p 1 ) words of space, in addition to the space needed to store the set P .The failure probability of the data structure is at most 1/3 + 1/e < 1.
Proof: Let q be a query point, and let P be the set of points at the time when the query is executed.Assume that there exists p * ∈ B(q, r 1 ) (otherwise there is nothing to prove).
Observe that if the events E 1 (q, p * ) and E 2 (q) hold, then the algorithm is correct.Thus it suffices to ensure that E 1 holds with probability P 1 and E 2 holds with probability P 2 such that both P 1 and P 2 are strictly greater than half.
Set k = log 1/p 2 n .Then the probability that g(p ) = g(q) for p ∈ P − B(q, r 2 ) is at most Thus the expected number of elements from P − B(q, r 2 ) colliding with q under fixed g j is at most 1; the expected number of such collisions with any g j is at most L. By Markov's inequality, the probability that this number exceeds 3L is less than 1/3.Therefore the probability that event E 2 (q) holds is P 2 ≥ 2/3.
Consider now the probability of g j (p * ) = g j (q).It is bounded from below by Thus the probability that such a g j exists is at least ζ = 1 − (1 − p 1 n −ρ ) L .By setting L = n ρ /p 1 we get ζ > 1 − 1/e.The theorem follows.
Remark 3.5.In the conference paper [IM98a], the theorem and the proof implicitly assumed that the probabilities p 1 and p 2 are bounded away from 0. Although this assumption holds for the LSH functions described in this paper (see Section 3.2.1),this does not have to be the case in general.Therefore, in this paper we make the dependence on p 1 and p 2 explicit.See [OWZ09] for a further discussion on this point.
We now consider the adaptive case.Let m be an upper bound on the number of queries performed by the data structure, and let n be an upper bound on the number of points in the data structure at any time.Compared to the oblivious case, the main difference is that now we will need to ensure that all possible events E 1 and E 2 hold simultaneously.To this end we reduce the probability of failure by constructing I = c log(m + n) instances Z 1 . . .Z I of NearNbr using independent random coins, maintaining the same set P .In order to answer a query q, we query all Z i 's and return any valid answer.

Lemma 3.6. There exists an absolute constant c such that
Pr[for all p ∈ P and q ∈ Q, there exists t = 1 . . .I such that both E 1 (q, p) and E 2 (q) hold] ≥ 1−1/n.

Proof: We have
Pr ∃ p,q ∀ t (E 1 (p, q) ∧ E 2 (q)) does not hold for Z i ≤ p,q Pr ∀ t (E 1 (p, q) ∧ E 2 (q)) does not hold for Z i = p,q Pr (E 1 (p, q) ∧ E 2 (q)) does not hold for Z 1 I ≤ nm (1/e + 1/3) I , which is at most 1/n for an appropriate absolute constant c > 1.
Since E 1 and E 2 are monotone properties, if they hold for all the points of P and Q they also hold for any subsets, we have the following.
Corollary 3.7.Consider any point-set Q of size m, and any point-set P of size n.There exists an absolute constant c such that the data structure obtained by running c log(n + m) copies of the algorithm from Theorem 3.4 is adaptive with respect to P and Q with probability of failure at most 1/n.

Hamming metric
In order to use Theorem 3.4 for the Hamming metric, we need to specify the proper family of hash functions.To this end we use the family of all projections of the bit vector onto one of its coordinates.
Proposition 3.8.Let D(p, q) be the Hamming metric for p, q ∈ Σ d , where Σ is any finite alphabet.Then for any r, > 0, the family H = {h i : Remark 3.9.Note that by padding the input and query points with dummy coordinates equal to 0, we can increase d by a constant factor, and ensure that the probabilities p 1 and p 2 in the above family are bounded away from 0. This allows us to drop the dependence on these parameters in the reminder of this paper.Proof: We use Proposition 3.8 and Theorem 3.4.First, we need to estimate the value of ρ = ln 1/p 1 ln 1/p 2 , where p This is trivially true for x = 0. Furthermore, taking the derivative, we see f (x) = −t + t(1 − x) t−1 , which is non-positive for x ∈ [0, 1) and t ≥ 1.Therefore, f is non-increasing in the region in which we are interested, and so f (x) ≤ 0 for all values in this region.
Since ρ = 1−p 1 1−p 2 = 1 c , the bound on the exponent follows.The running time now follows from Theorem 3.4 since log 1/p 2 n = O(d/r • log n).Finally, the space need to store the set P is O(dn).

Extension to normed spaces
In this section, we show how to extend the LSH algorithm to the 1 norm.The algorithm follows by composing Lemma 3.2 and Corollary 3.10.Theorem 3.12.For any c > 1, δ > 0, there exists a randomized data structure (with constant probability of success) for the (c(1+δ), 1)-NN over a set of at most n points in d 1 , using O(dn+n 1+1/c ) space and with O(d/δ • n 1/c log n) query and update time.
Proof: The algorithm follows by composing Lemma 3.2 and Corollary 3.10.The latter lemma is invoked with r = (1 + δ)d and the dimension of the Hamming cube bounded by O(d 2 /δ).The theorem follows by observing that the distance computation (as in the statement of Corollary 3.10) can be done in O(d) time.
Corollary 3.14.For any c > 1, δ > 0, there exists a randomized data structure for the (c(1+δ), 1)-NN over a set of at most n points in d 1 , using O(dn+n 1+1/c log n) space and with O(d/δ •n 1/c log 2 n) query and update time.For any point-set P of size n and query set Q of size n O(1) the algorithm is adaptive with respect to P and Q, with the failure probability of at most 1/n.

Bucketing
In this section, we show how to solve the (c, 1)-NN problem in any norm d s , for any c = 1 + , for ∈ (0, 1/2).The query time is very fast, i.e., O(d).At the same time, the algorithm uses (C/ ) d n space for some constant C > 1.Furthermore, in case of the s norm for s ∈ {1, 2} we can use the Johnson-Lindenstrauss lemma [JL84] to reduce d to O(log n/ 2 ).This leads to space bound polynomial in n, for fixed > 0.
Assume for now that s = 2. Impose a uniform grid of side length / √ d on IR d .Clearly, the distance between any two points belonging to one grid cell is at most .For each ball B i = B(p i , 1), p i ∈ P , define B i to be the set of grid cells intersecting B i .Store all elements from ∪ i B i in a hash table, together with the information about the corresponding ball(s).After preprocessing, to answer a query q it suffices to compute the cell which contains q and check if it is stored in the table.

Dimensionality reduction.
If the dimension d is high, we can reduce the space bound using the Johnson-Lindenstrauss (JL) lemma [JL84].For 2 we can apply the JL Lemma directly and reduce d to d = O(log n/ 2 ).This leads to n O(log(1/ )/ 2 ) space and preprocessing time, while the query time3 becomes O(dd ).
For 1 we proceed as follows.First, we apply Lemma 3.2, and reduce the points to dMdimensional Hamming space for M = O(d).Since for any two points p, q in the Hamming space we have D H (p, q) = p − q 2 , we can now apply the dimensionality reduction to the points in the Hamming space.This increases the query time by an additive term of O(dM d ).This can be further reduced to O(d), by pre-computing and storing dot products of unary(p l ) and the random vectors used to perform the dimensionality reduction.Since each p l is in the range {0 . . .M }, it follows that for each j we need to store M = O(d/δ) numbers, for the total storage of O(d 2 /δ).
Thus, we obtain the following theorem.

Approximate Voronoi diagram
Section 3.3 and Theorem 2.10 reduce the c-NN problem to performing O(log n) lookups in appropriate grids.It is natural to ask if one can collapse all these grids together, and obtain some natural geometric representation of the input point set.In this section we show that this is indeed the case.Specifically, we show the following theorem.Proof: We follow the reduction from the approximate nearest to near neighbor in Section 2 combined with the bucketing approach to approximate near neighbor given in Section 3.3.First, observe that by scaling and minor readjustment of parameters, we can assume that the side lengths of the grids used in the reduction are powers of 2. Also, we can assume that they are centered at the origin.As a result, the grids are nested -a cell in a finer grid is fully contained in some cell of a coarser grid.Consider instances Z 1 . . .Z k of NearNbr generated by the reduction.Let r i be the radius of the instance Z i , and let t i be the cell side length of the grid corresponding to Z i .For every Z i and every point p represented by that data structure, we label every such grid cell intersecting the ball B(p, r i ) with p.If a grid cell is marked by several balls, its label is the smallest such ball.In the end of this process we have a total of O(n(C/ ) d+1 log 2 n) grid cells (of different resolution) that were labeled.
Consider any query point q, and let U be the grid cell of the smallest side length that contains q.By straightforward but tedious argument following the recursive construction in Theorem 2.10 we have that the label of U is a correct approximate nearest neighbor for q.The decomposition can be now obtained by superimposing all grids onto the space IR d , where any point takes the label of the smallest grid cell containing it.This concludes the description of the approximate Voronoi decomposition construction.In the reminder of this section, we describe various ways in which the construction can be used for finding an approximate nearest neighbor.
From the perspective of this paper, the simplest approach is to apply the search algorithm from Section 2.3.An alternative (described e.g., in [HP01,HPM04,Har11]) is to construct a socalled compressed d-dimensional quadtree for the grid cells constructed in Theorem 3.18.The data structure enables locating the smallest grid cell in time O(d log n), which matches the time achieved by our search algorithm.However, it is possible to improve it further.In particular, consider data sets with "low" aspect ratio, i.e., the ratio of their diameter to the closest distance between any two distinct points.For such point sets, one can perform searches in quadtrees in time faster than Since the cost of e t is at most cr i and the cost of e * t is greater than r i−1 , we have that the cost of e t is at most c(1 + γ) times the cost of e * t .The lemma follows.

Lemma 2. 4 .
For 1/2 > γ > 0 and c ≥ 1, given range [a, b], and a point-set P , one can construct a collection of M = O((log b/a)/γ) NearNbr data structures with approximation parameter c and failure probability f = 1

Lemma 2. 5 .
The number of points stored in all NearNbr structures in the constructed tree is O(M n log n), where M = O(log(n)/γ) is the number of times each point is replicated in INearNbr.Proof: Let B(n) be the maximum number of points stored in the INearNbr structures.We analyze B(n) and multiply the result by O(M ).

Corollary 3. 10 .
For any c > 1, there exists an algorithm for (c, r)-NN in Hamming metric over Σ d using O(n 1+1/c ) space (in addition to space needed to store the input point set).The query and update time are bounded by the time needed to perform O(n 1/c ) distance computations (each taking at most O(d) time) and hash function evaluations (each taking O(d/r • log n) time).

Theorem 3. 18 .
Figure 1: (A) The point-set.(B) The generated set of NearNbr instances (different colors correspond to different data points).(C) The approximate Voronoi decomposition.

Theorem 4. 2 .
A c(1+2γ)-approximate MST of n points in d 1 can be computed in time O dn 1+1/c log 3 (n)/γ .Proof: The algorithm makes O(log n) calls to ApproximateCC, each using O(n) query and delete operations on a data structure provided by Corollary 3.14.The extension to d s norms for s ∈ [1, 2] follows from the embedding of [JS82].
Note that r * top /r * bot = Θ(n 2 log n).Assuming that r * med is computed correctly, i.e., that r med ≤ r * med ≤ (n − 1)r med , then we also have r * bot c < r med .Suppose that r * med is computed correctly, i.e., that r med ≤ r * med ≤ (n − 1)r med .Then • Each connected component of UB P (r * bot c) is induced by at most n/2 points.• There exists at least one connected component C ∈ CC P (r * med ) that has size at least n/2 + 1, and thus there are at most n/2 connected components in CC P (r * med ).
and k i=1 n i = n.We show this solves to B(n) ≤ Cn log n + 1 for some C > 1. {P 1 . . .P k } of P s.t.CC P (r * med ) P CC P (cr * med ).until |P | ≤ n/2 and |P i | ≤ n/2, for all i = 1 . . .k. for i = 1 . . .k do Select a representative p i from P i .Construct INearNbr over P with range [r * bot /2, r * top ] and parameters γ and c (Lemma 2.4).Continue recursively on P 1 , . . ., P k and P = {p 1 . . .p k }.Constructing the approximate nearest neighbor data structure.
2 and n i ≤ n/2.Cn log n − Cn + 2n, which is at most Cn log n if n ≥ 3 and C is large enough.
2 n) space, and is constructed in expected time O(T (n, c, f Thus, B = 2 O(d) r d /d d/2 ≤ (C/ ) d .Hence, the total space required is O(n) × O(1/ ) d .The query time is the time to compute the hash function.For general s norms, we set the grid side length to /d 1/s .The bound on |B| applies unchanged.
To see this, observe that |B i | is bounded by the volume of a d-dimensional ball of radius R = 2/ • √ d.We use the following fact [Pis89, page 11].