skip to main content
research-article

Metric space similarity joins

Published:24 June 2008Publication History
Skip Abstract Section

Abstract

Similarity join algorithms find pairs of objects that lie within a certain distance ϵ of each other. Algorithms that are adapted from spatial join techniques are designed primarily for data in a vector space and often employ some form of a multidimensional index. For these algorithms, when the data lies in a metric space, the usual solution is to embed the data in vector space and then make use of a multidimensional index. Such an approach has a number of drawbacks when the data is high dimensional as we must eventually find the most discriminating dimensions, which is not trivial. In addition, although the maximum distance between objects increases with dimension, the ability to discriminate between objects in each dimension does not. These drawbacks are overcome via the introduction of a new method called Quickjoin that does not require a multidimensional index and instead adapts techniques used in distance-based indexing for use in a method that is conceptually similar to the Quicksort algorithm. A formal analysis is provided of the Quickjoin method. Experiments show that the Quickjoin method significantly outperforms two existing techniques.

References

  1. Aggarwal, C. C. 2003. Towards systematic design of distance functions for data mining applications. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, D.C., 9--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, R., Faloutsos, C., and Swami, A. N. 1993. Efficient similarity search in sequence databases. In Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms (FODO). D. B. Lomet, Ed. Lecture Notes in Computer Science, vol. 730. Springer-Verlag, 69--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Agrawal, R., Psaila, G., Wimmers, E. L., and Zaït, M. 1995. Querying shapes of histories. In Proceedings of 21st VLDB International Conference on Very Large Data Bases. U. Dayal, P. M. D. Gray, and S. Nishio, Eds. Zurich, Switzerland, 502--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Aref, W. G. and Samet, H. 1994. Hashing by proximity to process duplicates in spatial databases. In Proceedings of the 3rd International Conference on Information and Knowledge Management (CIKM). Gaithersburg, MD, 347--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Arge, L., Procopiuc, O., Ramaswamy, S., Suel, T., and Vitter, J. S. 1998. Scalable sweeping-based spatial join. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB). A. Gupta, O. Shmueli, and J. Widom, Eds. New York, NY, 570--581. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Böhm, C., Braunmüller, B., Breunig, M., and Kriegel, H.-P. 2000. High performance clustering based on the similarity join. In Proceedings of the 9th CIKM International Conference on Information and Knowledge Management. McLean, VA, 298--305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Böhm, C., Braunmüller, B., Krebs, F., and Kriegel, H.-P. 2001. Epsilon grid order: An algorithm for the similarity join on massive high-dimensional data. In Proceedings of the ACM SIGMOD Conference. Santa Barbara, CA, 379--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Böhm, C. and Kriegel, H.-P. 2001. A cost model and index architecture for the similarity join. In Proceedings of the 17th IEEE International Conference on Data Engineering. Heidelberg, Germany, 411--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Brin, S. 1995. Near neighbor search in large metric spaces. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB). U. Dayal, P. M. D. Gray, and S. Nishio, Eds. Zurich, Switzerland, 574--584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Brinkhoff, T., Kriegel, H.-P., and Seeger, B. 1993. Efficient processing of spatial joins using R-trees. In Proceedings of the ACM SIGMOD Conference. Washington, DC, 237--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chávez, E., Navarro, G., Baeza-Yates, R., and Marroquín, J. 2001. Searching in metric spaces. ACM Comput. Surv. 33, 3, 273--322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduction to Algorithms, 2nd Ed. MIT Press/McGraw-Hill, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dittrich, J.-P. and Seeger, B. 2000. Data redundancy and duplicate detection in spatial join processing. In Proceedings of the 16th IEEE International Conference on Data Engineering. San Diego, CA, 535--546. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dittrich, J.-P. and Seeger, B. 2001. GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, CA, 47--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dohnal, V., Gennaro, C., Savino, P., and Zezula, P. 2003. D-Index: Distance searching index for metric data sets. Multimedia Tools Appl. 21, 1, 9--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dohnal, V., Gennaro, C., Savino, P., and Zezula, P. 2003a. Similarity join in metric spaces. In Proceedings of the 25th European Conference on IR Research (Advances in Information Retrieval) (ECIR'03). Pisa, Italy, 452--467. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Dohnal, V., Gennaro, C., and Zezula, P. 2003b. Similarity join in metric spaces using eD-Index. In Proceedings of the 14th International Conference on Database and Expert Systems Applications. (DEXA'03). Prague, Czech Republic, 484--493.Google ScholarGoogle Scholar
  18. Elmasri, R. and Navathe, S. B. 2004. Fundamentals of Database Systems, 4th Ed. Addison-Wesley, Upper Saddle River, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Enderle, J., Hampel, M., and Seidl, T. 2004. Joining interval data in relational databases. In Proceedings of the ACM SIGMOD Conference. Paris, France, 683--694. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Faloutsos, C., Barber, R., Flickner, M., Hafner, J., Niblack, W., Petkovic, D., and Equitz, W. 1994. Efficient and effective querying by image content. J. Intell. Info. Syst. 3, 3/4, 231--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Faloutsos, C. and Lin, K.-I. 1995. FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the ACM SIGMOD Conference. San Jose, CA, 163--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Guttman, A. 1984. R-trees: a dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD Conference. Boston, MA, 47--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Harada, L., Nakano, M., Kitsuregawa, M., and Takagi, M. 1990. Query processing for multi-attribute clustered records. In Proceedings of the 16th International Conference on Very Large Data Bases. D. McLeod, R. Sacks-Davis, and H.-J. Schek, Eds. Brisbane, Queensland, Australia, 59--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hettich, S. and Bay, S. D. 1999. The UCI KDD archive. Department of Information and Computer Science. University of California, Irvine, CA. [http://kdd.ics.uci.edu].Google ScholarGoogle Scholar
  25. Hjaltason, G. R. and Samet, H. 2003. Index-driven similarity search in metric spaces. ACM Trans. Datab. Syst. 28, 4, 517--580. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hoare, C. A. R. 1962. Quicksort. Comput. J. 5, 1, 10--15.Google ScholarGoogle ScholarCross RefCross Ref
  27. Hristescu, G. and Farach-Colton, M. 1999. Cluster-preserving embedding of proteins. Tech. rep., Department of Computer Science, Rutgers University, Piscataway, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Huang, Y.-W., Jing, N., and Rundensteiner, E. A. 1997. Integrated query processing strategies for spatial path queries. In Proceedings of the 13th IEEE International Conference on Data Engineering. A. Gray and P.-A. Larson, Eds. Birmingham, UK, 477--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Huttenlocher, D. P., Kedem, K., and Kleinberg, J. M. 1992. On dynamic voronoi diagrams and the minimum Hausdorff distance for point sets under euclidean motion in the plane. In Proceedings of the 8th Annual Symposium on Computational Geometry (SCG '92). Berlin, Germany, 110--119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jacox, E. and Samet, H. 2003. Iterative spatial join. ACM Trans. Datab. Syst. 28, 3, 268--294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jacox, E. and Samet, H. 2007. Spatial join techniques. ACM Trans. Datab. Syst. 32, 1, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jagadish, H. V. 1990. Linear clustering of objects with multiple attributes. In Proceedings of the ACM SIGMOD Conference. Atlantic City, NJ, 332--342. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ja'Ja', J. 2000. A perspective on quicksort. Comput. Science Engin. 2, 1, 43--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kahveci, T., Lang, C., and Singh, A. K. 2003. Joining massive high-dimensional datasets. In Proceedings of the 19th IEEE International Conference on Data Engineering. Bangalore, India, 264--276.Google ScholarGoogle Scholar
  35. Kalashnikov, D. and Prabhakar, S. 2003. Similarity joins for low- and high-dimensional data. In Proceedings of the 8th International Conference on Database Systems for Advanced Applications (DASFAA'03). Kyoto, Japan, 7--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kedem, G. 1982. The quad-CIF tree: a data structure for hierarchical on-line algorithms. In Proceedings of the 19th Design Automation Conference. Las Vegas, NV, 352--357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kitsuregawa, M., Harada, L., and Takagi, M. 1989. Join strategies on KD-tree indexed relations. In Proceedings of the 5th IEEE International Conference on Data Engineering. Los Angeles, CA, 85--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Koperski, K. and Han, J. 1995. Discovery of spatial association rules in geographic information systems. In Proceedings of the 4th International Symposium on Advances in Spatial Databases (SSD'95). M. J. Egenhofer and J. R. Herring, Eds. Lecture Notes in Computer Science. vol. 951, Springer-Verlag, 47--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., and Protopapas, Z. 1996. Fast nearest neighbor search in medical image databases. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB). T. M. Vijayaraman, A. P. Buchmann, C. Mohan, and N. L. Sarda, Eds. Mumbai (Bombay), India, 215--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Koudas, N. and Sevcik, K. C. 1997. Size separation spatial join. In Proceedings of the ACM SIGMOD Conference. J. Peckham, Ed. Tucson, AZ, 324--335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Koudas, N. and Sevcik, K. C. 1998. High dimensional similarity joins: algorithms and performance evaluation. In Proceedings of the 14th IEEE International Conference on Data Engineering. Orlando, FL, 466--475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Koudas, N. and Sevcik, K. C. 2000. High dimensional similarity joins: algorithms and performance evaluation. IEEE Trans. Knowl. Data Engin. 12, 1, 3--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Levenshtein, V. A. 1966. Binary codes capable of correcting deletions, insertion, and reversals. Cybern. Control Theory 10, 8, 707--710.Google ScholarGoogle Scholar
  44. Linial, N., London, E., and Rabinovich, Y. 1995. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15, 215--245.Google ScholarGoogle ScholarCross RefCross Ref
  45. Lo, M.-L. and Ravishankar, C. V. 1995. Generating seeded trees from data sets. In Proceedings of the 4th International Symposium on Advances in Spatial Databases (SSD'95). M. J. Egenhofer and J. R. Herring, Eds. Lecture Notes in Computer Science, vol. 951, Springer-Verlag, 328--347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Merrett, T. H., Kambayashi, Y., and Yasuura, H. 1981. Scheduling of page-fetches in join operations. In Proceedings of the 7th International Conference on Very Large Data Bases. Cannes, France, 488--498. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Neyer, G. and Widmayer, P. 1997. Singularities make spatial join scheduling hard. In Proceedings of the 8th International Symposium on Algorithms and Computation (ISAAC). Singapore, 293--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Nievergelt, J., Hinterberger, H., and Sevcik, K. C. 1984. The grid file: an adaptable, symmetric multikey file structure. ACM Trans. Datab. Syst. 9, 1, 38--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Orenstein, J. A. 1986. Spatial query processing in an object-oriented database system. In Proceedings of the ACM SIGMOD Conference. Washington, DC, 326--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Orenstein, J. A. 1989. Strategies for optimizing the use of redundancy in spatial databases. In Proceedings of the 1st Symposium on Design and Implementation of Large Spatial Databases (SSD'89). A. Buchmann, O. Günther, T. R. Smith, and Y.-F. Wang, Eds. Lecture Notes in Computer Science, vol. 409, Springer-Verlag, 115--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Papadias, D. and Arkoumanis, D. 2002. Approximate processing of multiway spatial joins in very large databases. In Proceedings of the 8th International Conference on Extending Database Technology, (EDBT'02). C. S. Jensen, K. G. Jeffery, J. Pokorný, S. Saltenis, E. Bertino, K. Böhm, and M. Jarke, Eds. Lecture Notes in Computer Science, vol. 2287, 179--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Patel, J. M. and DeWitt, D. J. 1996. Partition based spatial-merge join. In Proceedings of the ACM SIGMOD Conference. Montréal, Canada, 259--270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Preparata, F. P. and Shamos, M. I. 1985. Computational Geometry: An Introduction. Springer-Verlag, Berlin, Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Robinson, J. T. 1981. The K-D-B-tree: A search structure for large multidimensional dynamic indexes. In Proceedings of the ACM SIGMOD Conference. Ann Arbor, MI, 10--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Rucklidge, W. J. 1995. Locating objects using the Hausdorff distance. In Proceedings of the 5th International Conference on Computer Vision (ICCV'95). Boston, MA, 457--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Rucklidge, W. J. 1996. Efficient Visual Recognition Using the Hausdorff Distance. Springer-Verlag, Berlin, Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Samet, H. 1990a. Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison-Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Samet, H. 1990b. The Design and Analysis of Spatial Data Structures. Addison-Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Samet, H. 2006. Foundations of Multidimensional and Metric Data Structures. Morgan-Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Shim, K., Srikant, R., and Agrawal, R. 1997. High-dimensional similarity joins. In Proceedings of the 13th IEEE International Conference on Data Engineering. Birmingham U.K., 301--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Shim, K., Srikant, R., and Agrawal, R. 2002. High-dimensional similarity joins. IEEE Trans. Knowl. Data Engin. 14, 1, 156--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Tasan, M. and Özsoyoglu, Z. M. 2004. Improvements in distance-based indexing. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management. M. Hatzopoulos and Y. Manolopoulos, Eds. Santorini, Greece, 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Uhlmann, J. K. 1991. Satisfying general proximity/similarity queries with metric trees. Inform. Process. Lett. 40, 4, 175--179.Google ScholarGoogle ScholarCross RefCross Ref
  64. Voronoi, G. 1909. Nouvelles applications des paramètres continus à la théorie des formes quadratiques. Deuxiême mémoire: Recherches sur les parallèlloèdres primitifs. Seconde partie. Journal für die Reine und Angewandte Mathematik 136, 2, 67--181.Google ScholarGoogle ScholarCross RefCross Ref
  65. Wang, J. T., Wang, X., Lin, K.-I., Shasha, D., Shapiro, B. A., and Zhang, K. 1999. Evaluating a class of distance-mapping algorithms for data mining and clustering. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, CA, 307--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Xia, C., Lu, J., Ooi, B. C., and Hu, J. 2004. Gorder: an efficient method for KNN join processing. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB). M. A. Nascimento, M. T. Özsu, D. Kossmann, R. J. Miller, J. A. Blakely, and K. B. Schiefer, Eds. Toronto, Canada, 756--767. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Yianilos, P. N. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms. Austin, TX, 311--321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Zezula, P., Amato, G., Dohnal, V., and Batko, M. 2006. Similarity search: The metric space approach. Advances in Database Systems, Vol. 32. Springer, Berlin, Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Zhang, J., Mamoulis, N., Papadias, D., and Tao, Y. 2004. All-nearest-neighbors queries in spatial databases. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management, M. Hatzopoulos and Y. Manolopoulos, Eds. Santorini, Greece, 297--306. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Metric space similarity joins

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Database Systems
          ACM Transactions on Database Systems  Volume 33, Issue 2
          June 2008
          309 pages
          ISSN:0362-5915
          EISSN:1557-4644
          DOI:10.1145/1366102
          Issue’s Table of Contents

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 June 2008
          • Accepted: 1 December 2007
          • Revised: 1 April 2007
          • Received: 1 August 2006
          Published in tods Volume 33, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader