Abstract
The Resource Description Framework (RDF) is a W3C standard for representing graph-structured data, and SPARQL is the standard query language for RDF. Recent advances in information extraction, linked data management and the Semantic Web have led to a rapid increase in both the volume and the variety of RDF data that are publicly available. As businesses start to capitalize on RDF data, RDF data management systems are being exposed to workloads that are far more diverse and dynamic than what they were designed to handle. Consequently, there is a growing need for developing workload-adaptive and self-tuning RDF data management systems. To realize this objective, we introduce a fast and efficient method for dynamically clustering records in an RDF data management system. Specifically, we assume nothing about the workload upfront, but as SPARQL queries are executed, we keep track of records that are co-accessed by the queries in the workload and physically cluster them. To decide dynamically and in constant-time where a record needs to be placed in the storage system, we develop a new locality-sensitive hashing (LSH) scheme, Tunable-LSH. Using Tunable-LSH, records that are co-accessed across similar sets of queries can be hashed to the same or nearby physical pages in the storage system. What sets Tunable-LSH apart from existing LSH schemes is that it can auto-tune to achieve the aforementioned clustering objective with high accuracy even when the workloads change. Experimental evaluation of Tunable-LSH in an RDF data management system as well as in a standalone hashtable shows end-to-end performance gains over existing solutions.
Similar content being viewed by others
Notes
This uniformity condition simplifies the sensitivity analysis of Tunable-LSH, but it is not a requirement from an algorithmic point of view. Relaxing this condition is left as future work.
Groups are separated by vertical dashed lines.
In practice, this translation is not required because the system maintains positional vectors instead.
References
Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: SW-Store: a vertically partitioned DBMS for semantic web data management. VLDB J. 18, 385–406 (2009)
Aggarwal, C.C.: A survey of stream clustering algorithms. In: Aggarwal, C.C., Reddy, C.K. (eds.) Data Clustering: Algorithms and Applications, pp. 231–258. CRC Press, Boca Raton, Florida (2013)
Agrawal, S., Chaudhuri, S., Narasayya, V.R.: Automated selection of materialized views and indexes in SQL databases. In: Proceedings of the 26th International Conference on Very Large DataBases, pp. 496–505 (2000)
Ailamaki, A., DeWitt, D.J., Hill, M.D., Wood, D.A.: DBMSs on a modern processor: where does time go? In: Proceedings of the 25th International Conference on Very Large DataBases, pp. 266–277 (1999)
Al-Harbi, R., Abdelaziz, I., Kalnis, P., Mamoulis, N., Ebrahim, Y., Sahli, M.: Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning. VLDB J. 25(3), 355–380 (2016)
Al-Harbi, R., Ebrahim, Y., Kalnis, P.: Phd-store: an adaptive SPARQL engine with dynamic partitioning for distributed RDF repositories. CoRR, arXiv:1405.4979 (2014)
Aluç, G.: Workload Matters: A Robust Approach to Physical RDF Database Design. Ph.D. thesis, University of Waterloo (2015). https://uwspace.uwaterloo.ca/handle/10012/9774
Aluç, G., DeHaan, D., Bowman, I.T.: Parametric plan caching using density-based clustering. In: Proceedings of the 28th International Conference on Data Engineering, pp. 402–413 (2012)
Aluç, G., Hartig, O., Özsu, M. T., Daudjee, K.: Diversified stress testing of rdf data management systems. In: Proceedings of the 13th International Semantic Web Conference, pp. 197–212 (2014)
Aluç, G., Özsu, M.T., Daudjee, K.: Workload matters: why RDF databases need a new design. Proc. VLDB Endow. 7(10), 837–840 (2014)
Aluç, G., Özsu, M.T., Daudjee, K., Hartig, O.: Chameleon-db: a workload-aware robust RDF data management system. Technical Report CS-2013-10. University of Waterloo (2013)
Aluç, G., Özsu, M. T., Daudjee, K., Hartig, O.: Executing queries over schemaless RDF databases. In: Proceedings of the 31st International Conference on Data Engineering, pp. 807–818 (2015)
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: Proceedings of the 47th Annual Symposium on Foundations of Computer Science, pp. 459–468 (2006)
Arias, M., Fernández, J.D., Martínez-Prieto, M.A., de la Fuente, P.: An empirical study of real-world SPARQL queries. CoRR, arXiv:1103.5043 (2011)
Athitsos, V., Potamias, M., Papapetrou, P., Kollios, G.: Nearest neighbor retrieval using distance-based hashing. In: Proceedings of the 24th International Conference on Data Engineering, pp. 327–336 (2008)
Bast, H., Buchhold, B.: Qlever: A query engine for efficient sparql+text search. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 647–656 (2017)
Bello, R.G., Dias, K., Downing, A., Feenan, J.J., Finnerty, Jr. J.L., Norcott, W.D., Sun, H., Witkowski, A., Ziauddin, M.: Materialized views in oracle. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 659–664 (1998)
Berendt, B., Dragan, L., Hollink, L., Luczak-Rösch, M., Demidova, E., Dietze, S., Szymanski, J., Breslin, J.G., editors. In: Joint Proceeding of the the 5th International Workshop on Using the Web in the Age of Data and the 2nd International Workshop on Dataset PROFIling and fEderated Search for Linked Data, Volume 1362 of CEUR Workshop Proceedings. CEUR-WS.org (2015)
Bingmann, T.: STX B+ tree C++ template classes. https://panthema.net/2007/stx-btree/ (2007). Accessed 16 Aug 2018
Bislimovska, B., Aluç, G., Özsu, M.T., Fraternali, P.: Graph search of software models using multidimensional scaling. In: Proceedings of the Workshops of the EDBT/ICDT 2015 Joint Conference, pp. 163–170 (2015)
Bornea, M.A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P., Udrea, O., Bhattacharjee, B.: Building an efficient RDF store over a relational database. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 121–132 (2013)
Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences, pp. 21–29 (1997)
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)
Bruno, N., Chaudhuri, S.: To tune or not to tune? a lightweight physical design alerter. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 499–510 (2006)
Carroll, J.J., Dickinson, I., Dollin, C., Reynolds, D., Seaborne, A., Wilkinson, K.: Jena: implementing the semantic web recommendations. In: Proceedings of the 13th International World Wide Web Conference—Alternate Track Papers and Posters, pp. 74–83 (2004)
Ceri, S., Navathe, S.B., Wiederhold, G.: Distribution design of logical database schemas. IEEE Trans. Softw. Eng. 9(4), 487–504 (1983)
Charikar, M.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the 34th Annual ACM Symposium on the Theory of Computing, pp. 380–388 (2002)
Chaudhuri, S., Narasayya, V.: Self-tuning database systems: a decade of progress. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 3–14 (2007)
Datar, M., Immorlica, N., Indyk, P., Mirrokni. V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the 20th Annual Symposium on Computational Geometry, pp. 253–262 (2004)
Erling, O.: Virtuoso, a hybrid RDBMS/graph column store. IEEE Data Eng. Bull. 35(1), 3–8 (2012)
Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., El Abbadi, A.: Approximate nearest neighbor searching in multimedia databases. In: Proceedings 17th International Conference on Data Engineering , pp. 503–511 (2001)
Foley, J.D., van Dam, A., Feiner, S.K., Hughes, J.F.: Computer Graphics: Principles and Practice, 2nd edn. Addison-Wesley Longman Publishing Co. Inc, Boston (1990)
French, K.R., Schwert, G.W., Stambaugh, R.F.: Expected stock returns and volatility. J. Finan. Econ. 19, 3–30 (1987)
Galarraga, L., Hose, K., Schenkel, R.: Partout: a distributed engine for efficient RDF processing. In: Proceedings of the 23rd International World Wide Web Conference, Companion Volume, pp. 267–268 (2014)
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of 25th International Conference on Very Large Data Bases, pp. 518–529 (1999)
Goasdoué, F., Karanasos, K., Leblay, J., Manolescu, I.: View selection in semantic web databases. Proc. VLDB Endow. 5(2), 97–108 (2011)
Graefe, G., Idreos, S., Kuno, H.A., Manegold, S.: Benchmarking adaptive indexing. In: Proceedings of the Performance Evaluation, Measurement and Characterization of Complex Systems—2nd TPC Technology Conference TPCTC, pp. 169–184 (2010)
Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: Triad: a distributed shared-nothing RDF engine based on asynchronous message passing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 289–300 (2014)
Halim, F., Idreos, S., Karras, P., Yap, R.H.C.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. Proc. VLDB Endow. 5(6), 502–513 (2012)
Hamming, R.W. (ed.): Coding and Information Theory. Prentice-Hall, Englewood Cliffs (1986)
Harbi, R., Abdelaziz, I., Kalnis, P., Mamoulis, N.: Evaluating SPARQL queries on massive RDF datasets. Proc. VLDB Endow. 8(12), 1848–1851 (2015)
Harris, S., Seaborne, A., Prud’hommeaux. E.: SPARQL 1.1 query language. W3C Recommendation (2013)
Harth, A., Umbrich, J., Hogan, A., Decker, S.: Yars2: A federated repository for querying graph structured data from the web. In: Proceedings of the 6th International Semantic Web Conference, pp. 211–224 (2007)
He, L., Shao, B., Li, Y., Xia, H., Xiao, Y., Chen, E., Chen, L.: Stylus: a strongly-typed store for serving massive RDF data. Proc. VLDB Endow. 11(2), 203–216 (2017)
Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: Proceedings of Workshops of the 29th IEEE International Conference on Data Engineering, pp. 1–6 (2013)
Houle, M.E., Sakuma, J.: Fast approximate similarity search in extremely high-dimensional data sets. In: Proceedings of the 21st International Conference on Data Engineering, pp. 619–630 (2005)
Husain, M.F., McGlothlin, J.P., Masud, M.M., Khan, L.R., Thuraisingham, B.M.: Heuristics-based query processing for large RDF graphs using cloud computing. IEEE Trans. Knowl. Data Eng. 23(9), 1312–1327 (2011)
Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research, pp. 68–78 (2007)
Idreos, S., Manegold, S., Kuno, H.A., Graefe, G.: Merging what’s cracked, cracking what’s merged: adaptive indexing in main-memory column-stores. PVLDB 4(9), 585–597 (2011)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)
Jaccard, P.: The distribution of flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31, 264–323 (1999)
Kirchberg, M., Ko, R.K.L., Lee, B.-S.: From linked data to relevant data—time is the essence. CoRR, arXiv:1103.5046 (2011)
Krause, E.F. (ed.): Taxicab Geometry: An Adventure in Non-Euclidean Geometry. Dover, New York (1986)
Kruskal, J.B.: Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1–27 (1964)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Dokl. 10(8), 707–710 (1966)
Lightstone, S., Teorey, T.J., Nadeau, T.P.: Physical Database Design: the Database Professional’s Guide to Exploiting Indexes, Views, Storage, and More. Morgan Kaufmann, Burlington (2007)
McGlothlin, J.P., Khan, L.R.: Materializing and persisting inferred and uncertain knowledge in RDF datasets. In: Proceedings of the 24th International Conference on Artificial Intelligence (2010)
Morrison, A., Ross, G., Chalmers, M.: Fast multidimensional scaling through sampling, springs and interpolation. Inf. Vis. 2(1), 68–77 (2003)
Morsey, M., Lehmann, J., Auer, S., Ngomo, A.-C.N.: DBpedia SPARQL benchmark—performance assessment with real queries on real data. In: Proceedings of the 10th International Semantic Web Conference, pp. 454–469 (2011)
Morton, G.M.: A computer oriented geodetic data base; and a new technique in file sequencing. Technical report. IBM Ltd., Ottawa, Canada (1966)
Nah, F.F.-H.: A study on tolerable waiting time: how long are Web users willing to wait? Behav. IT 23(3), 153–163 (2004)
Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 91–113 (2010)
Neumann, T., Weikum, G.: x-RDF-3X: fast querying, high update rates, and consistency for RDF databases. Proc. VLDB Endow. 3(1), 256–263 (2010)
Papailiou, N., Konstantinou, I., Tsoumakos, D., Koziris, N: H2RDF: adaptive query processing on RDF data in the cloud. In: Proceedings of the 21st International World Wide Web Conference Companion Volume, pp. 397–400 (2012)
Papailiou, N., Tsoumakos, D., Karras, P., Koziris, N.: Graph-aware, workload-adaptive SPARQL query caching. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1777–1792 (2015)
Reed, W.: The normal-Laplace distribution and its relatives. In: Proceedings of the Advances in Distribution Theory, Order Statistics, and Inference, pp. 61–74 (2006)
Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th International World Wide Web Conference, pp 851–860 (2010)
Sidirourgos, L., Goncalves, R., Kersten, M., Nes, N., Manegold, S.: Column-store support for RDF data management: not all swans are white. Proc. VLDB Endow. 1(2), 1553–1563 (2008)
std::hash. http://www.cplusplus.com/reference/functional/hash/ (2015). Accessed 16 Aug 2018
std::map. http://www.cplusplus.com/reference/map/map/ (2015). Accessed 16 Aug 2018
std::unordered\_map. http://www.cplusplus.com/reference/unordered_map/unordered_map/ (2015). Accessed 16 Aug 2018
Tao, Y., Yi, K., Sheng, K., Kalnis, P.: Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35(3), 20 (2010)
Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. Proc. VLDB Endow. 1(1), 1008–1019 (2008)
Wilkinson, K.: Jena property table implementation. Technical Report HPL-2006-140, HP-Labs (2006)
Yuan, P., Liu, P., Wu, B., Jin, H., Zhang, W., Liu, L.: TripleBit: a fast and compact system for large scale RDF data. Proc. VLDB Endow. 6(7), 517–528 (2013)
Zeng, L., Zou, L.: Redesign of the gStore system. Front. Comput. Sci. 12(4), 623–641 (2018)
Zilio, D.C., Rao, J., Lightstone, S., Lohman, G. M., Storm, A.J., Garcia-Arellano, C., Fadden, S.: DB2 design advisor: integrated automatic physical database design. In: Proceedings of the 30th International Conference on Very Large Data Bases, pp. 1087–1097 (2004)
Zou, L., Mo, J., Zhao, D., Chen, L., Özsu, M.T.: gStore: answering SPARQL queries via subgraph matching. Proc. VLDB Endow. 4(1), 482–493 (2011)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Aluç, G., Özsu, M.T. & Daudjee, K. Building self-clustering RDF databases using Tunable-LSH. The VLDB Journal 28, 173–195 (2019). https://doi.org/10.1007/s00778-018-0530-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-018-0530-9