ABSTRACT
Increasing data volumes have extensively increased application possibilities. However, accessing this data in an ad hoc manner remains an unsolved problem due to the diversity of data management approaches, formats and storage frameworks, resulting in the need to effectively access and process distributed heterogeneous data at scale. For years, Semantic Web techniques have addressed data integration challenges with practical knowledge representation models and ontology-based mappings. Leveraging these techniques, we provide a solution enabling uniform access to large, heterogeneous data sources, without enforcing centralization; thus realizing the vision of a Semantic Data Lake. In this paper, we define the core concepts underlying this vision and the architectural requirements that systems implementing it need to fulfill. Squerall, an example of such a system, is an extensible framework built on top of state-of-the-art Big Data technologies. We focus on Squerall's distributed query execution techniques and strategies, empirically evaluating its performance throughout its various sub-phases.
- Paolo Atzeni, Francesca Bugiotti, and Luca Rossi. 2012. Uniform Access to Nonrelational Database Systems: The SOS Platform.. In In CAiSE, Jolita Ralyté, Xavier Franch, Sjaak Brinkkemper, and Stanislaw Wrycza (Eds.), Vol. 7328. Springer, 160--174.Google Scholar
- Sören Auer, Simon Scerri, Aad Versteden, Erika Pauwels, Stasinos Konstantopoulos, Jens Lehmann, Hajira Jabeen, Ivan Ermilov, Gezim Sejdiu, Mohamed Nadjib Mami, et al. 2017. The BigDataEurope platform-supporting the variety dimension of big data. In International Conference on Web Engineering. Springer, 41--59.Google ScholarCross Ref
- Christian Bizer and Andreas Schultz. 2009. The Berlin SPARQL benchmark. International Journal on Semantic Web and Information Systems (IJSWIS) 5, 2 (2009), 1--24.Google ScholarCross Ref
- Elena Botoeva, Diego Calvanese, Benjamin Cogrel, Julien Corman, and Guohui Xiao. 2018. A Generalized Framework for Ontology-Based Data Access. In International Conference of the Italian Association for Artificial Intelligence. Springer, 166--180.Google ScholarCross Ref
- Olivier Curé, Robin Hecht, Chan Le Duc, and Myriam Lamolle. 2011. Data integration over NoSQL stores using access path based mappings. In International Conference on Database and Expert Systems Applications. Springer, 481--495.Google ScholarCross Ref
- Oliver Curé, Fadhela Kerdjoudj, David Faye, Chan Le Duc, and Myriam Lamolle. 2013. On the potential integration of an ontology-based data access approach in NoSQL stores. International Journal of Distributed Systems and Technologies (IJDST) 4, 3 (2013), 17--30.Google ScholarDigital Library
- James Dixon. 2010. Pentaho, Hadoop, and Data Lakes. (2010). https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes Online; accessed 06-August-2019.Google Scholar
- Brendan Elliott, En Cheng, Chimezie Thomas-Ogbuji, and Z Meral Ozsoyoglu. 2009. A complete translation from SPARQL into efficient SQL. In Proceedings of the International Database Engineering & Applications Symposium. ACM, 31--42.Google ScholarDigital Library
- Kemele M Endris, Philipp D Rohde, Maria-Esther Vidal, and Sören Auer. 2019. Ontario: Federated Query Processing Against a Semantic Data Lake. In International Conference on Database and Expert Systems Applications. Springer, 379--395.Google ScholarDigital Library
- Vijay Gadepally, Peinan Chen, Jennie Duggan, Aaron Elmore, Brandon Haynes, Jeremy Kepner, Samuel Madden, Tim Mattson, and Michael Stonebraker. 2016. The bigdawg polystore system and architecture. In High Performance Extreme Computing Conference. IEEE, 1--6.Google ScholarCross Ref
- Victor Giannakouris, Nikolaos Papailiou, Dimitrios Tsoumakos, and Nectarios Koziris. 2016. MuSQLE: Distributed SQL query execution over multiple engine environments. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 452--461.Google ScholarCross Ref
- Martin Giese, Ahmet Soylu, Guillermo Vega-Gorgojo, Arild Waaler, Peter Haase, Ernesto Jiménez-Ruiz, Davide Lanti, Martín Rezk, Guohui Xiao, Özgür Özçep, et al. 2015. Optique: Zooming in on big data. Computer 48, 3 (2015), 60--67.Google ScholarDigital Library
- Damien Graux, Louis Jachiet, Pierre Geneves, and Nabil Layaïda. 2018. A Multi-Criteria Experimental Ranking of Distributed SPARQL Evaluators. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 693--702.Google ScholarCross Ref
- Eben Hewitt. 2010. Cassandra: the definitive guide. " O'Reilly Media, Inc.".Google Scholar
- Boyan Kolev, Patrick Valduriez, Carlyna Bondiombouy, Ricardo Jiménez-Peris, Raquel Pau, and José Pereira. 2016. CloudMdsQL: querying heterogeneous cloud data stores with a common language. Distributed and Parallel Databases 34, 4 (2016), 463--503.Google ScholarDigital Library
- Doug Laney. 2012. Deja VVVu: others claiming Gartner's construct for big data. Gartner Blog, Jan 14 (2012).Google Scholar
- Ora Lassila, Ralph R Swick, et al. 1998. Resource description framework (RDF) model and syntax specification. (1998).Google Scholar
- Jens Lehmann, Gezim Sejdiu, Lorenz Bühmann, Patrick Westphal, Claus Stadler, Ivan Ermilov, Simon Bin, Nilesh Chakraborty, Muhammad Saleem, and Axel-Cyrille Ngonga Ngomo. 2017. Distributed Semantic Analytics using the SANSA Stack. In ISWC. Springer, 147--155.Google Scholar
- Mohamed Nadjib Mami, Damien Graux, Simon Scerri, Hajira Jabeen, and Sören Auer. 2019. Querying Data Lakes using Spark and Presto. In The World Wide Web Conference. ACM, 3574--3578.Google ScholarDigital Library
- Mohamed Nadjib Mami, Damien Graux, Simon Scerri, Hajira Jabeen, Sören Auer, and Jens Lehman. 2019. How to feed the Squerall with RDF and other data nuts? Proceedings of 18th International Semantic Web Conference (Poster & Demo Track) (2019).Google Scholar
- Mohamed Nadjib Mami, Damien Graux, Simon Scerri, Hajira Jabeen, Sören Auer, and Jens Lehman. 2019. Squerall: Virtual Ontology-Based Access to Heterogeneous and Large Data Sources. Proceedings of 18th International Semantic Web Conference (2019).Google ScholarCross Ref
- Franck Michel, Catherine Faron-Zucker, and Johan Montagnat. 2016. A mapping-based method to query MongoDB documents with SPARQL. In International Conference on Database and Expert Systems Applications. Springer, 52--67.Google ScholarCross Ref
- Kian Win Ong, Yannis Papakonstantinou, and Romain Vernoux. 2014. The SQL+ + unifying semi-structured query language, and an expressiveness benchmark of SQL-on-Hadoop, NoSQL and NewSQL databases. CoRR, abs/1405.3631 (2014).Google Scholar
- Rami Sellami, Sami Bhiri, and Bruno Defude. 2016. Supporting Multi Data Stores Applications in Cloud Environments. IEEE Trans. Services Computing 9, 1 (2016), 59--71.Google Scholar
- Rami Sellami and Bruno Defude. 2018. Complex Queries Optimization and Evaluation over Relational and NoSQL Data Stores in Cloud Environments. IEEE Trans. Big Data 4, 2 (2018), 217--230.Google ScholarCross Ref
- Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, et al. 2019. Presto: SQL on Everything. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1802--1813.Google Scholar
- D.E. Spanos, P. Stavrou, and N. Mitrou. 2010. Bringing relational databases into the semantic web: A survey. Semantic Web (2010), 1--41.Google Scholar
- Jörg Unbehauen and Michael Martin. 2016. Executing SPARQL queries over Mapped Document Stores with SparqlMap-M. In 12th Int. Conf. on Semantic Systems.Google ScholarDigital Library
- Ágnes Vathy-Fogarassy and Tamás Hugyák. 2017. Uniform data access platform for SQL and NoSQL database systems. Information Systems 69 (2017), 93--105.Google ScholarDigital Library
- Marco Vogt, Alexander Stiemer, and Heiko Schuldt. 2017. Icarus: Towards a multistore database system. 2017 IEEE International Conference on Big Data (Big Data) (2017), 2490--2499.Google ScholarCross Ref
- Guohui Xiao, Diego Calvanese, Roman Kontchakov, Domenico Lembo, Antonella Poggi, Riccardo Rosati, and Michael Zakharyaschev. 2018. Ontology-based data access: A survey. IJCAI.Google Scholar
- Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, 10--10 (2010), 95.Google ScholarDigital Library
Index Terms
- Uniform Access to Multiform Data Lakes using Semantic Technologies
Recommendations
Querying Data Lakes using Spark and Presto
WWW '19: The World Wide Web ConferenceSquerall is a tool that allows the querying of heterogeneous, large-scale data sources by leveraging state-of-the-art Big Data processing engines: Spark and Presto. Queries are posed on-demand against a Data Lake, i.e., directly on the original data ...
Query Processing over Large RDF using SPARQL in Big Data
ICTCS '16: Proceedings of the Second International Conference on Information and Communication Technology for Competitive StrategiesInternet search is done by exploring the link graph and keyword frequency. In 2012, Google released "Knowledge Graph" --Semantic Web. The human reasoning can be enhanced by the use semantic web an emerging area. Most of the current applications link ...
Development of a knowledge system for Big Data: Case study to plant phenotyping data
WIMS '16: Proceedings of the 6th International Conference on Web Intelligence, Mining and SemanticsIn the recent years, the data deluge in many areas of scientific research brings challenges in the treatment and improvement of agricultural data. Research in bioinformatics field does not outside this trend. This paper presents some approaches aiming to ...
Comments