Abstract
Similarity join algorithms find pairs of objects that lie within a certain distance ϵ of each other. Algorithms that are adapted from spatial join techniques are designed primarily for data in a vector space and often employ some form of a multidimensional index. For these algorithms, when the data lies in a metric space, the usual solution is to embed the data in vector space and then make use of a multidimensional index. Such an approach has a number of drawbacks when the data is high dimensional as we must eventually find the most discriminating dimensions, which is not trivial. In addition, although the maximum distance between objects increases with dimension, the ability to discriminate between objects in each dimension does not. These drawbacks are overcome via the introduction of a new method called Quickjoin that does not require a multidimensional index and instead adapts techniques used in distance-based indexing for use in a method that is conceptually similar to the Quicksort algorithm. A formal analysis is provided of the Quickjoin method. Experiments show that the Quickjoin method significantly outperforms two existing techniques.
- Aggarwal, C. C. 2003. Towards systematic design of distance functions for data mining applications. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, D.C., 9--18. Google ScholarDigital Library
- Agrawal, R., Faloutsos, C., and Swami, A. N. 1993. Efficient similarity search in sequence databases. In Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms (FODO). D. B. Lomet, Ed. Lecture Notes in Computer Science, vol. 730. Springer-Verlag, 69--84. Google ScholarDigital Library
- Agrawal, R., Psaila, G., Wimmers, E. L., and Zaït, M. 1995. Querying shapes of histories. In Proceedings of 21st VLDB International Conference on Very Large Data Bases. U. Dayal, P. M. D. Gray, and S. Nishio, Eds. Zurich, Switzerland, 502--514. Google ScholarDigital Library
- Aref, W. G. and Samet, H. 1994. Hashing by proximity to process duplicates in spatial databases. In Proceedings of the 3rd International Conference on Information and Knowledge Management (CIKM). Gaithersburg, MD, 347--354. Google ScholarDigital Library
- Arge, L., Procopiuc, O., Ramaswamy, S., Suel, T., and Vitter, J. S. 1998. Scalable sweeping-based spatial join. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB). A. Gupta, O. Shmueli, and J. Widom, Eds. New York, NY, 570--581. Google ScholarDigital Library
- Böhm, C., Braunmüller, B., Breunig, M., and Kriegel, H.-P. 2000. High performance clustering based on the similarity join. In Proceedings of the 9th CIKM International Conference on Information and Knowledge Management. McLean, VA, 298--305. Google ScholarDigital Library
- Böhm, C., Braunmüller, B., Krebs, F., and Kriegel, H.-P. 2001. Epsilon grid order: An algorithm for the similarity join on massive high-dimensional data. In Proceedings of the ACM SIGMOD Conference. Santa Barbara, CA, 379--390. Google ScholarDigital Library
- Böhm, C. and Kriegel, H.-P. 2001. A cost model and index architecture for the similarity join. In Proceedings of the 17th IEEE International Conference on Data Engineering. Heidelberg, Germany, 411--420. Google ScholarDigital Library
- Brin, S. 1995. Near neighbor search in large metric spaces. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB). U. Dayal, P. M. D. Gray, and S. Nishio, Eds. Zurich, Switzerland, 574--584. Google ScholarDigital Library
- Brinkhoff, T., Kriegel, H.-P., and Seeger, B. 1993. Efficient processing of spatial joins using R-trees. In Proceedings of the ACM SIGMOD Conference. Washington, DC, 237--246. Google ScholarDigital Library
- Chávez, E., Navarro, G., Baeza-Yates, R., and Marroquín, J. 2001. Searching in metric spaces. ACM Comput. Surv. 33, 3, 273--322. Google ScholarDigital Library
- Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduction to Algorithms, 2nd Ed. MIT Press/McGraw-Hill, Cambridge, MA. Google ScholarDigital Library
- Dittrich, J.-P. and Seeger, B. 2000. Data redundancy and duplicate detection in spatial join processing. In Proceedings of the 16th IEEE International Conference on Data Engineering. San Diego, CA, 535--546. Google ScholarDigital Library
- Dittrich, J.-P. and Seeger, B. 2001. GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, CA, 47--56. Google ScholarDigital Library
- Dohnal, V., Gennaro, C., Savino, P., and Zezula, P. 2003. D-Index: Distance searching index for metric data sets. Multimedia Tools Appl. 21, 1, 9--33. Google ScholarDigital Library
- Dohnal, V., Gennaro, C., Savino, P., and Zezula, P. 2003a. Similarity join in metric spaces. In Proceedings of the 25th European Conference on IR Research (Advances in Information Retrieval) (ECIR'03). Pisa, Italy, 452--467. Google ScholarDigital Library
- Dohnal, V., Gennaro, C., and Zezula, P. 2003b. Similarity join in metric spaces using eD-Index. In Proceedings of the 14th International Conference on Database and Expert Systems Applications. (DEXA'03). Prague, Czech Republic, 484--493.Google Scholar
- Elmasri, R. and Navathe, S. B. 2004. Fundamentals of Database Systems, 4th Ed. Addison-Wesley, Upper Saddle River, NJ. Google ScholarDigital Library
- Enderle, J., Hampel, M., and Seidl, T. 2004. Joining interval data in relational databases. In Proceedings of the ACM SIGMOD Conference. Paris, France, 683--694. Google ScholarDigital Library
- Faloutsos, C., Barber, R., Flickner, M., Hafner, J., Niblack, W., Petkovic, D., and Equitz, W. 1994. Efficient and effective querying by image content. J. Intell. Info. Syst. 3, 3/4, 231--262. Google ScholarDigital Library
- Faloutsos, C. and Lin, K.-I. 1995. FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the ACM SIGMOD Conference. San Jose, CA, 163--174. Google ScholarDigital Library
- Guttman, A. 1984. R-trees: a dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD Conference. Boston, MA, 47--57. Google ScholarDigital Library
- Harada, L., Nakano, M., Kitsuregawa, M., and Takagi, M. 1990. Query processing for multi-attribute clustered records. In Proceedings of the 16th International Conference on Very Large Data Bases. D. McLeod, R. Sacks-Davis, and H.-J. Schek, Eds. Brisbane, Queensland, Australia, 59--70. Google ScholarDigital Library
- Hettich, S. and Bay, S. D. 1999. The UCI KDD archive. Department of Information and Computer Science. University of California, Irvine, CA. [http://kdd.ics.uci.edu].Google Scholar
- Hjaltason, G. R. and Samet, H. 2003. Index-driven similarity search in metric spaces. ACM Trans. Datab. Syst. 28, 4, 517--580. Google ScholarDigital Library
- Hoare, C. A. R. 1962. Quicksort. Comput. J. 5, 1, 10--15.Google ScholarCross Ref
- Hristescu, G. and Farach-Colton, M. 1999. Cluster-preserving embedding of proteins. Tech. rep., Department of Computer Science, Rutgers University, Piscataway, NJ. Google ScholarDigital Library
- Huang, Y.-W., Jing, N., and Rundensteiner, E. A. 1997. Integrated query processing strategies for spatial path queries. In Proceedings of the 13th IEEE International Conference on Data Engineering. A. Gray and P.-A. Larson, Eds. Birmingham, UK, 477--486. Google ScholarDigital Library
- Huttenlocher, D. P., Kedem, K., and Kleinberg, J. M. 1992. On dynamic voronoi diagrams and the minimum Hausdorff distance for point sets under euclidean motion in the plane. In Proceedings of the 8th Annual Symposium on Computational Geometry (SCG '92). Berlin, Germany, 110--119. Google ScholarDigital Library
- Jacox, E. and Samet, H. 2003. Iterative spatial join. ACM Trans. Datab. Syst. 28, 3, 268--294. Google ScholarDigital Library
- Jacox, E. and Samet, H. 2007. Spatial join techniques. ACM Trans. Datab. Syst. 32, 1, 7. Google ScholarDigital Library
- Jagadish, H. V. 1990. Linear clustering of objects with multiple attributes. In Proceedings of the ACM SIGMOD Conference. Atlantic City, NJ, 332--342. Google ScholarDigital Library
- Ja'Ja', J. 2000. A perspective on quicksort. Comput. Science Engin. 2, 1, 43--49. Google ScholarDigital Library
- Kahveci, T., Lang, C., and Singh, A. K. 2003. Joining massive high-dimensional datasets. In Proceedings of the 19th IEEE International Conference on Data Engineering. Bangalore, India, 264--276.Google Scholar
- Kalashnikov, D. and Prabhakar, S. 2003. Similarity joins for low- and high-dimensional data. In Proceedings of the 8th International Conference on Database Systems for Advanced Applications (DASFAA'03). Kyoto, Japan, 7--16. Google ScholarDigital Library
- Kedem, G. 1982. The quad-CIF tree: a data structure for hierarchical on-line algorithms. In Proceedings of the 19th Design Automation Conference. Las Vegas, NV, 352--357. Google ScholarDigital Library
- Kitsuregawa, M., Harada, L., and Takagi, M. 1989. Join strategies on KD-tree indexed relations. In Proceedings of the 5th IEEE International Conference on Data Engineering. Los Angeles, CA, 85--93. Google ScholarDigital Library
- Koperski, K. and Han, J. 1995. Discovery of spatial association rules in geographic information systems. In Proceedings of the 4th International Symposium on Advances in Spatial Databases (SSD'95). M. J. Egenhofer and J. R. Herring, Eds. Lecture Notes in Computer Science. vol. 951, Springer-Verlag, 47--66. Google ScholarDigital Library
- Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., and Protopapas, Z. 1996. Fast nearest neighbor search in medical image databases. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB). T. M. Vijayaraman, A. P. Buchmann, C. Mohan, and N. L. Sarda, Eds. Mumbai (Bombay), India, 215--226. Google ScholarDigital Library
- Koudas, N. and Sevcik, K. C. 1997. Size separation spatial join. In Proceedings of the ACM SIGMOD Conference. J. Peckham, Ed. Tucson, AZ, 324--335. Google ScholarDigital Library
- Koudas, N. and Sevcik, K. C. 1998. High dimensional similarity joins: algorithms and performance evaluation. In Proceedings of the 14th IEEE International Conference on Data Engineering. Orlando, FL, 466--475. Google ScholarDigital Library
- Koudas, N. and Sevcik, K. C. 2000. High dimensional similarity joins: algorithms and performance evaluation. IEEE Trans. Knowl. Data Engin. 12, 1, 3--18. Google ScholarDigital Library
- Levenshtein, V. A. 1966. Binary codes capable of correcting deletions, insertion, and reversals. Cybern. Control Theory 10, 8, 707--710.Google Scholar
- Linial, N., London, E., and Rabinovich, Y. 1995. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15, 215--245.Google ScholarCross Ref
- Lo, M.-L. and Ravishankar, C. V. 1995. Generating seeded trees from data sets. In Proceedings of the 4th International Symposium on Advances in Spatial Databases (SSD'95). M. J. Egenhofer and J. R. Herring, Eds. Lecture Notes in Computer Science, vol. 951, Springer-Verlag, 328--347. Google ScholarDigital Library
- Merrett, T. H., Kambayashi, Y., and Yasuura, H. 1981. Scheduling of page-fetches in join operations. In Proceedings of the 7th International Conference on Very Large Data Bases. Cannes, France, 488--498. Google ScholarDigital Library
- Neyer, G. and Widmayer, P. 1997. Singularities make spatial join scheduling hard. In Proceedings of the 8th International Symposium on Algorithms and Computation (ISAAC). Singapore, 293--302. Google ScholarDigital Library
- Nievergelt, J., Hinterberger, H., and Sevcik, K. C. 1984. The grid file: an adaptable, symmetric multikey file structure. ACM Trans. Datab. Syst. 9, 1, 38--71. Google ScholarDigital Library
- Orenstein, J. A. 1986. Spatial query processing in an object-oriented database system. In Proceedings of the ACM SIGMOD Conference. Washington, DC, 326--336. Google ScholarDigital Library
- Orenstein, J. A. 1989. Strategies for optimizing the use of redundancy in spatial databases. In Proceedings of the 1st Symposium on Design and Implementation of Large Spatial Databases (SSD'89). A. Buchmann, O. Günther, T. R. Smith, and Y.-F. Wang, Eds. Lecture Notes in Computer Science, vol. 409, Springer-Verlag, 115--134. Google ScholarDigital Library
- Papadias, D. and Arkoumanis, D. 2002. Approximate processing of multiway spatial joins in very large databases. In Proceedings of the 8th International Conference on Extending Database Technology, (EDBT'02). C. S. Jensen, K. G. Jeffery, J. Pokorný, S. Saltenis, E. Bertino, K. Böhm, and M. Jarke, Eds. Lecture Notes in Computer Science, vol. 2287, 179--196. Google ScholarDigital Library
- Patel, J. M. and DeWitt, D. J. 1996. Partition based spatial-merge join. In Proceedings of the ACM SIGMOD Conference. Montréal, Canada, 259--270. Google ScholarDigital Library
- Preparata, F. P. and Shamos, M. I. 1985. Computational Geometry: An Introduction. Springer-Verlag, Berlin, Germany. Google ScholarDigital Library
- Robinson, J. T. 1981. The K-D-B-tree: A search structure for large multidimensional dynamic indexes. In Proceedings of the ACM SIGMOD Conference. Ann Arbor, MI, 10--18. Google ScholarDigital Library
- Rucklidge, W. J. 1995. Locating objects using the Hausdorff distance. In Proceedings of the 5th International Conference on Computer Vision (ICCV'95). Boston, MA, 457--464. Google ScholarDigital Library
- Rucklidge, W. J. 1996. Efficient Visual Recognition Using the Hausdorff Distance. Springer-Verlag, Berlin, Germany. Google ScholarDigital Library
- Samet, H. 1990a. Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison-Wesley, Reading, MA. Google ScholarDigital Library
- Samet, H. 1990b. The Design and Analysis of Spatial Data Structures. Addison-Wesley, Reading, MA. Google ScholarDigital Library
- Samet, H. 2006. Foundations of Multidimensional and Metric Data Structures. Morgan-Kaufmann, San Francisco, CA. Google ScholarDigital Library
- Shim, K., Srikant, R., and Agrawal, R. 1997. High-dimensional similarity joins. In Proceedings of the 13th IEEE International Conference on Data Engineering. Birmingham U.K., 301--311. Google ScholarDigital Library
- Shim, K., Srikant, R., and Agrawal, R. 2002. High-dimensional similarity joins. IEEE Trans. Knowl. Data Engin. 14, 1, 156--171. Google ScholarDigital Library
- Tasan, M. and Özsoyoglu, Z. M. 2004. Improvements in distance-based indexing. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management. M. Hatzopoulos and Y. Manolopoulos, Eds. Santorini, Greece, 161--170. Google ScholarDigital Library
- Uhlmann, J. K. 1991. Satisfying general proximity/similarity queries with metric trees. Inform. Process. Lett. 40, 4, 175--179.Google ScholarCross Ref
- Voronoi, G. 1909. Nouvelles applications des paramètres continus à la théorie des formes quadratiques. Deuxiême mémoire: Recherches sur les parallèlloèdres primitifs. Seconde partie. Journal für die Reine und Angewandte Mathematik 136, 2, 67--181.Google ScholarCross Ref
- Wang, J. T., Wang, X., Lin, K.-I., Shasha, D., Shapiro, B. A., and Zhang, K. 1999. Evaluating a class of distance-mapping algorithms for data mining and clustering. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, CA, 307--311. Google ScholarDigital Library
- Xia, C., Lu, J., Ooi, B. C., and Hu, J. 2004. Gorder: an efficient method for KNN join processing. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB). M. A. Nascimento, M. T. Özsu, D. Kossmann, R. J. Miller, J. A. Blakely, and K. B. Schiefer, Eds. Toronto, Canada, 756--767. Google ScholarDigital Library
- Yianilos, P. N. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms. Austin, TX, 311--321. Google ScholarDigital Library
- Zezula, P., Amato, G., Dohnal, V., and Batko, M. 2006. Similarity search: The metric space approach. Advances in Database Systems, Vol. 32. Springer, Berlin, Germany. Google ScholarDigital Library
- Zhang, J., Mamoulis, N., Papadias, D., and Tao, Y. 2004. All-nearest-neighbors queries in spatial databases. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management, M. Hatzopoulos and Y. Manolopoulos, Eds. Santorini, Greece, 297--306. Google ScholarDigital Library
Index Terms
- Metric space similarity joins
Recommendations
Index-driven similarity search in metric spaces (Survey Article)
Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this ...
Similarity Joins
Similarity Joins are extensively used in multiple application domains and are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several ...
Top-k String Similarity Joins
SSDBM '20: Proceedings of the 32nd International Conference on Scientific and Statistical Database ManagementTop-k joins have been extensively studied in relational databases as ranking operations when every object has, among others, at least one ranking attribute. However, the focus has mostly been the case when the join attributes are of primitive data types ...
Comments