Metrized Small World Approach for Nearest Neighbor Search

In different areas attempts are made to organize data into multi-linked structures which are well suited for information search, in particular the nearest neighbor search where the result data items are metrically close to a given data item. These structures often take the form of trees (M-Tree, cover tree, KD-tree, GNAT) or networks (M-Chord, VoroNet, RayNet) built over a set of data items. In this paper we give the regular approach to the construction of links between data items which provides logarithmical time complexity of the nearest neighbor search in the structure. According to this approach, data items are organized into an undirected graph with Small World properties, which ensure the existence of a short path between any two data items regardless of the graph size. We propose different construction and search algorithms depending on the properties of the metric which determines the proximity of data items. The types of metric we consider are abstract metric and ordered metric. Further we extend the ordered metric approach to compound data items in the form of attribute-value pair sets to enable inclusion search by an arbitrary subset of attribute-value pairs. Finally we provide simulation results for the structure with compound data items.


Introduction
The nearest neighbor search problem is defined as follows: given a set S of n points in some metric space , build a data structure on S so that for a given query point one can efficiently find a point which minimizes .Different approaches exist for building such a structure.The works [4,5,11] suggest hierarchical tree structures constructed using information about metric proximity of the elements.One notable shortcoming of this approach is the presence of the mandatory root node in tree-like structures which makes building totally distributed implementations problematic.
There are also ways to build a distributed structure over the set S. The works [12] suggest distributed hash table as the data structure using the pivot-based metric space indexing approach.
The work [6] discusses the VoroNet distributed data structure.The elements of S are two-dimensional Euclidian space points.Each point from S is linked to all of its neighbor points on Voronoi diagram (Delaunay graph) plus additional distant points to give the structure Small World properties.Greedy search algorithm is used.
The following work [7] by the same authors considers the structure where the elements are points in a n-dimensional Euclidean space.The main difference from the previous work is that every point is connected with only a subset of the Voronoi neighbors to avoid exponential dependence of complexity on the number of dimensions.But this link set reduction leads to inexact search results, i.e. the result point is not always the nearest neighbor of the query point although number of such result can be made insignificant.Another drawback of this approach is that it can only be applied to the points of Euclidian space with a fixed number of dimensions.
In this paper we propose a regular approach to the construction of links between data elements in the form of an undirected graph with Small World properties [9,10] to provide logarithmical complexity of the nearest neighbor search.We called the resulting structure Metrized Small World [1] (MSW).
We propose different construction and search algorithms depending on the properties of the metric which determines the proximity of data items.
The rest of the paper is structured as follows.Section 2 describes the construction of MSW structure based on abstract semi-metric.Section 3 describes MSW structure construction algorithms for ordered metrics.In the section 4 we extend the ordered metric approach to compound data items in the form of attribute-value pair sets to enable inclusion search by an arbitrary subset of attribute-value pairs.Finally we provide simulation results for the structure with compound data items in the section 5.

Metrized Small World data structure
Metrized Small World data structure on the set of data items S is expressed by the graph .Each vertex corresponds to a single element of the set S. Each edge is associated with a link between two data items from the set S. Assume that equivalent to where s is the data item which corresponds to the vertex v. Then the search of the nearest neighbor of the query point comes to finding the vertex with the minimal distance to .
In the work [1] we gave the construction and search algorithms for that structure.In the paper [2] we also suggested a distributed storage architecture based on the proposed structure.Here we re-cite those algorithm according to the notation assumed for this paper.
We provide the algorithm which adds vertex to the graph , where is the set of previously added vertices.Thus the parameters of the algorithm arethe set of previously added vertices, the vertex being added, an arbitrarily selected vertex from (the starting point of the search) and two integer numbers m and n. Algorithm: 1. Arbitrarily select an element 2. Let VisitedList be the set of visited elements.We shown that the structure constructed using this algorithm provides the necessary condition for the existence of effective search algorithm, because the Small World properties of the graph ensure the existence of a short path between any two vertices.But this structure requires search algorithms which are more complex than the greedy algorithm due to the existence of metric local minimums.An advantage of this approach is that the proximity measure M can be any function which is a general metric or even semi-metric defined over the set S.

Single-attribute Distributed Metrized Small World Data Structure
In the paper [3] we gave the algorithm for constructing the similar structure for a narrower class of metrics, i.e. for the metrics for which the order between data items is defined.If any data item will be linked with its direct predecessor and successor with regard to the metric, there will be no local minimums.The condition of the data item being linked to its direct successor and predecessor ensures the existence of the Delaunay graph which in its turn provides for correctness of the greedy search algorithm which attempts to minimize the distance from the query on each step.
Algorithm: The nearest neighbor search is performed by following links from one element to another in the direction of the minimal metric.
The Small World properties of the graph ensure the logarithmical search complexity for a random data set.The absence of the root element and the construction of the structure on the data item level provides for creating a completely distributed implementation of the structure.As can be seen on Fig. 1 and 2, both average shortest path length and maximum vertex degree scale logarithmically with the number of vertexes.Therefore the structure is suitable for storing very large amounts of data.
The nearest neighbor search is reduced to finding the minimum of the metric from the query to a data item.If the distance between the query and the found data item is lower than the query radius than the fond data item is the result, otherwise there is no result.If we must find all data items inside the query radius, we perform a sequential search in both directions from the first found data item.
The proposed data addition algorithm is incremental, i.e. the addition of a new data item affects only a small number of existing data items.

Multi-attribute Distributed Metrized Small World Data Structure
In the two previous sections we considered the elements as atomic entities relative to the metric.Now we want to extend our approach to composite data items.We will consider the composite objects which are represented by an unordered set of atomic objects for all of which one common ordered metric is defined.
Then we define the search problem as the search of at least one of all of the composite objects which include the given set of atomic objects.This data model is often used for describing application domain entities with a set of tags or keywords, e.g.images, hyperlinks, musical tracks, blog posts etc.This model can also represent objects consisting of non-fixed set of attribute-value pairs.
Therefore for convenience we will consider arbitrary strings (or tags) as atomic objects.Hence the composite objects will be represented as unordered sets of tags.
Our main idea was to construct the graph in a way that with any matching subset of atomic would constitute the sub graph (layer) consisting of a single connected component which in its turn would form the MSW structure described in the previous section.Then the search for an element containing the given set of tags would be performed by first finding object from sub graph (layer) consisting of objects containing the tag t 1 .After that, inside this subgraph-layer another element from the subgraph-layer is recursively for.The subgraph-layer consists of objects containing both tags t1 and t2.The process continues until an object form the subgraph-layer is found which consists of objects containing all the given tags .For demonstration purposes we provide the example of the network of objects almost all of which contain three tags.Dashed curved lines show the links between objects which contain tags which are neighbors in lexicographical order.Solid straight lines show the links between objects having a common subset of tags.
Further we give a more formal description of the construction and search algorithms for this structure Let be the set of all possible tags which are distinct string values.
For each data element let there be the unordered set of tags associated with the object.Given a query set we must find the set of resulting data elements such that , i.e. all data elements which have all of the tags specified in the query.
Let the set be the MSW structure built over a set of elements .Every element of a link between pair of tags in data elements (it can be the same element).If is no element corresponding to a pair tags, there is no link between them.Two identical tags on the different items cannot have links simultaneously in one .We consider a tag being a member of the if .
We can use our algorithm described in the section 3 of this paper to search for given tag in MSW.Let be the MSW layer built over a set of tags .For every tag that is a member of .Let the be the operation of searching for a single element, member of for which .The tag (member of ) is the entry point of the algorithm described in the second section of this paper.
be the operation of addition of the tag of the element the MSW layer .The tag used as the entry point.The time of the operation is logarithmic to the number tags in We consider an element a member of the MSW layer if it has been partially added to least once.Let be the operation of complete addition of the element the MSW layer .The operation is performed using the following algorithm: Algorithm:  where is a random tag of Constructing link using the above approach is to a certain degree equivalent to indexing by all possible combinations of columns in a relational database.The main advantage of this approach is the possibility to quickly find an object or a set of objects with any given set of tags without regard to the quantity of objects with a certain subset of tags (atomary objects).Further we give the experimental data obtained on the structure prototype to confirm the theoretical assumptions regarding the advantages of our approach.

Experimental data
The experiments were set up as follows.
In the first experiment a set of N objects was generated half of which contained the single common tag -X‖, other half contained the single common tag -Y‖ and a single object with both -X‖ and -Y‖ tags.The objects were added to the structure in random order.We measured the time of search for the object containing -X‖ and -Y‖ tags.The measurement was repeated many times for different values of N, the set of random objects was regenerated each time.See the left graph.
In the second experiment the test set contained N random objects containing equal amounts of object containing two common tags -X‖, ‖Y‖; -Y‖, ‖Z‖; -X‖, -Z‖ and the single object containing all three tags -X‖,‖Y‖,‖Z‖.See the right graph.
The results are shown on Figure 4.The graphs show that in both cases the object search time depends logarithmically on the number the objects in the structure which confirms our theoretical assumptions.

Conclusion and future work
We believe that the key to the building of searchoriented distributed systems is the construction of multilinked structures similar to social networks.But the metric distance between data items must be correlated to the number of links which separate them.In this paper we described the methods of construction of such structures for certain data types.The necessary and sufficient condition of correctness of the greedy search algorithm is the inclusion of Delaunay graph into the structure graph.Failure to satisfy this particular condition was the obstacle for using the greedy search algorithm with the structure described in the section II.The condition of existence of Delaunay subgraph has been satisfied in the structures described in sections III an IV.But supporting the correct Voronoi tessellation as in [6] or in section IV requires large overhead with the number of dimensions greater than two.For this reason we intend to focus our further research on finding the compromise between search accuracy and calculation overhead.

Figure 1 .
Figure 1.Average shortest path length between two vertexes

Figure 3 .
Figure 3. Example Multi-attribute Distributed Metrized Small World Data Structure.The dashed lines represent the edges in the layer.Solid straight lines show the links between objects having a common subset of tags.

Figure 4 .
Figure 4. Experimental results.Left: two common tags.Right: three common tags.
Select the first element p from CandidateList not contained in VisitedList.If no such element exists then break.5.3.Add p to VisitedList.5.4.Add the set of p neighbor elements to CandidateList. 6. Mutually connect the element with m arbitrary elements from VisitedList.
3. Let CandidateList be the set of candidate elements for link establishment sorted by value of