Abstract
We consider the problem of similarity search in databases with costly metric distance measures. Given limited main memory, our goal is to develop a reference-based index that reduces the number of comparisons in order to answer a query. The idea in reference-based indexing is to select a small set of reference objects that serve as a surrogate for the other objects in the database. We consider novel strategies for selection of references and assigning references to database objects. For dynamic databases with frequent updates, we propose two incremental versions of the selection algorithm. Our experimental results show that our selection and assignment methods far outperform competing methods.
Article PDF
Similar content being viewed by others
References
Baeza-Yates, R., Perleberg, C.: Fast and practical approximate string matching. In: CPM, pp. 185–192 (1992)
Baeza-Yates, R.A., Cunto, W., Manber, U., Wu, S.: Proximity matching using fixed-queries trees. In: CPM ’94: Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching, pp. 198–212. Springer, London (1994)
Baeza-Yates R.A. and Navarro G. (1999). Faster approximate string matching. Algorithmica 23(2): 127–158
Bairoch A., Boeckmann B., Ferro S. and Gasteiger E. (2004). Swiss-Prot: juggling between evolution and stability. Briefings Bioinf. 1: 39–55
Benson D., Karsch-Mizrachi I., Lipman D., Ostell J., Rapp B. and Wheeler D. (2000). GenBank. Nucl. Acids Res. 28(1): 15–18
Bhattacharya, A., Ljosa, V., Pan, J.Y., Verardo, M.R., Yang, H., Faloutsos, C., Singh, A.K.: ViVo: Visual vocabulary construction for mining biomedical images. In: ICDM, pp. 50–57 (2005)
Bozkaya, T., Ozsoyoglu, M.: Distance-based indexing for high-dimensional metric spaces. In: ACM SIGMOD, pp. 357–368 (1997)
Brisaboa, N.R., Fariña, A., Pedreira, O., Reyes, N.: Similarity search using sparse pivots for efficient multimedia information retrieval. In: ISM ’06: Proceedings of the Eighth IEEE International Symposium on Multimedia (2006)
Burkhard W.A. and Keller R.M. (1973). Some approaches to best-match file searching. Commun. ACM 16(4): 230–236
Bustos B., Navarro G. and Chavez E. (2003). Pivot selection techniques for proximity searching in metric spaces. Pattern Recogn. Lett. 24(14): 2357–2366
Chan, S., Martinez, K., Lewis, P.H., Lahanier, C., Stevenson, J.: Handling sub-image queries in content-based retrieval of high resolution art images. In: ICHIM, pp. 157–163 (2001)
Chavez, E., Marroquin, J.L., Baeza-Yates, R.: Spaghettis: an array based algorithm for similarity queries in metric spaces. In: SPIRE ’99: Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware, p. 38. IEEE Computer Society, Washington (1999)
Chavez E., Marroquin J.L. and Navarro G. (2001). Fixed queries array: a fast and economical data structure for proximity searching. Multimedia Tools Appl. 14(2): 113–135
Chavez E., Navarro G., Baeza-Yates R. and Marroquin J.L. (2001). Searching in metric spaces. ACM Comput. Surv. 33(3): 273–321
Ciaccia, P., Patella, M., Zezula, P.: M-Tree: An efficient access method for similarity search in metric spaces. In: The VLDB Journal, pp. 426–435 (1997)
Filho, R.F.S., Traina, A.J.M., Traina, C., Faloutsos, C.: Similarity search without tears: The OMNI family of all-purpose access methods. In: ICDE, pp. 623–630 (2001)
Gumbel E.J. (1958). Statistics of Extremes. Columbia University Press, New York
Gusfield D. (1997). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, 1st edn. Cambridge University Press, Cambridge
Hjaltason G.R. and Samet H. (2003). Index-driven similarity search in metric spaces. ACM Trans. Database Syst. 28(4): 517–580
Kahveci, T., Singh, A.: An efficient index structure for string databases. In: VLDB, pp. 351–360. Rome (2001)
Leuken, R.H.V., Veltkamp, R.C., Typke, R.: Selecting vantage objects for similarity indexing. In: ICPR ’06: Proceedings of the 18th International Conference on Pattern Recognition, pp. 453–456. IEEE Computer Society, Washington (2006)
Ljosa, V., Bhattacharya, A., Singh, A.K.: Indexing spatially sensitive distance measures using multi-resolution lower bounds. In: EDBT, pp. 865–883 (2006)
Mico M.L., Oncina J. and Vidal E. (1994). A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recogn. Lett. 15: 9–17
Myers E.W. (1986). An o(ND) difference algorithm and its variations. Algorithmica 1(2): 251–266
Needleman S.B. and Wunsch C.D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. JMB 48: 443–53
Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: ICCV ’98: Proceedings of the Sixth International Conference on Computer Vision, p. 59. IEEE Computer Society, Washington (1998)
Ruiz E.V. (1986). An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recogn. Lett. 4(3): 145–157
Samet, H.: Foundations of Multidimensional Metric and Data Structures. Morgan Kaufmann (2006)
Skopal, T., Pokorný, J., Snásel, V.: PM-tree: Pivoting metric tree for similarity search in multimedia databases. In: ADBIS (Local Proceedings) (2004)
Traina, C., Traina, A.J.M., Filho, R.F.S., Faloutsos, C.: How to improve the pruning ability of dynamic metric access methods. In: CIKM, pp. 219–226 (2002)
Traina, C., Traina, A.J.M., Seeger, B., Faloutsos, C.: Slim-trees: high performance metric trees minimizing overlap between nodes. In: EDBT, pp. 51–65 (2000)
Ukkonen E. (1985). Algorithms for approximate string matching. Inf. Control 64: 100–118
Venkateswaran, J., Lachwani, D., Kahveci, T., Jermaine, C.M.: Reference-based indexing of sequence databases. In: VLDB, pp. 906–917 (2006)
Vieira, M.R., Traina, C., Chino, F.J.T., Traina, A.J.M.: DBM-tree: a dynamic metric access method sensitive to local density data. In: SBBD, pp. 163–177 (2004)
Vitter J.S. (1985). Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1): 37–57
Vleugels, J., Veltkamp, R.: Efficient image retrieval through vantage objects. In: VISUAL, pp. 575–584. Springer, Heidelberg (1999)
Yianilos, P.: Data structures and algorithms for nearest Neighbor search in general metric spaces. In: SODA, pp. 311–321 (1993)
Yianilos, P.: Excluded middle vantage point forests for nearest neighbor search. In: DIMACS Implementation Challenge: Near Neighbor Searches Workshop (1999)
Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: the Metric Space Approach. Springer, Heidelberg (2006)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is partially supported by the National Science Foundation under Grant No. 0347408.
Rights and permissions
About this article
Cite this article
Venkateswaran, J., Kahveci, T., Jermaine, C. et al. Reference-based indexing for metric spaces with costly distance measures. The VLDB Journal 17, 1231–1251 (2008). https://doi.org/10.1007/s00778-007-0062-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-007-0062-1