Abstract
The increasing variety of Linked Data on the Web makes it challenging to determine the quality of this data and, subsequently, to make this information explicit to data consumers. Despite the availability of a number of tools and frameworks to assess Linked Data Quality, the output of such tools is not suitable for machine consumption, and thus consumers can hardly compare and rank datasets in the order of fitness for use. This article describes a conceptual methodology for assessing Linked Datasets, and Luzzu; a framework for Linked Data Quality Assessment. Luzzu is based on four major components: (1) an extensible interface for defining new quality metrics; (2) an interoperable, ontology-driven back-end for representing quality metadata and quality problems that can be re-used within different semantic frameworks; (3) scalable dataset processors for data dumps, SPARQL endpoints, and big data infrastructures; and (4) a customisable ranking algorithm taking into account user-defined weights. We show that Luzzu scales linearly against the number of triples in a dataset. We also demonstrate the applicability of the Luzzu framework by evaluating and analysing a number of statistical datasets against a variety of metrics. This article contributes towards the definition of a holistic data quality lifecycle, in terms of the co-evolution of linked datasets, with the final aim of improving their quality.
- Riccardo Albertoni, Antoine Isaac, Christophe Guéret, Jeremy Debattista, Deirdre Lee, Nandana Mihindukulasooriya, and Amrapali Zaveri. 2015. Data Quality Vocabulary (DQV). W3C Interest Group Note. World Wide Web Consortium (W3C).Google Scholar
- Keith Alexander, Richard Cyganiak, Michael Hausenblas, and Jun Zhao. 2011. Describing Linked Datasets with the VoID Vocabulary. W3C Interest Group Note. World Wide Web Consortium.Google Scholar
- Mario Arias, Javier D Fernández, Miguel A Martínez-Prieto, and Claudio Gutiérrez. 2011. HDT-it: Storing, sharing and visualizing huge RDF datasets. In ISWC. 23--27.Google Scholar
- Judie Attard, Fabrizio Orlandi, Simon Scerri, and Sören Auer. 2015. A systematic review of open government data initiatives. Gov. Inform. Quart. 32, 4 (2015), 399--418. DOI:http://dx.doi.org/10.1016/j.giq.2015.07.006Google ScholarCross Ref
- Sören Auer, Lorenz Bühmann, Christian Dirschl, Orri Erling, Michael Hausenblas, Robert Isele, Jens Lehmann, Michael Martin, Pablo N. Mendes, Bert Van Nuffelen, Claus Stadler, Sebastian Tramp, and Hugh Williams. 2012. Managing the life-cycle of linked data with the LOD2 stack. In The Semantic Web, ISWC 2012, 11th International Semantic Web Conference on the Semantic Web (ISWC 2012), Boston, MA, USA, November 11--15, 2012, Proceedings, Part II (Lecture Notes in Computer Science), Philippe Cudré-Mauroux, Jeff Heflin, Evren Sirin, Tania Tudorache, Jérôme Euzenat, Manfred Hauswirth, Josiane Xavier Parreira, Jim Hendler, Guus Schreiber, Abraham Bernstein, and Eva Blomqvist (Eds.), Vol. 7650. Springer, 1--16. DOI:http://dx.doi.org/10.1007/978-3-642-35173-0_1 Google ScholarDigital Library
- Carlo Batini, Daniele Barone, Michele Mastrella, Andrea Maurino, and Claudio Ruffini. 2007. A framework and a methodology for data quality assessment and monitoring. In ICIQ (2010-07-12), Mary Ann Robbert, Robert O’Hare, M. Lynne Markus, and Barbara D. Klein (Eds.). MIT, 333--346.Google Scholar
- Christian Bizer and Richard Cyganiak. 2009. Quality-driven information filtering using the WIQA policy framework. Web Semant. 7, 1 (Jan. 2009), 1--10. DOI:http://dx.doi.org/10.1016/j.websem.2008.02.005 Google ScholarDigital Library
- Alan F. Blackwell, Carol Britton, Anna Louise Cox, Thomas R. G. Green, Corin A. Gurr, Gada F. Kadoda, Maria Kutar, Martin Loomes, Chrystopher L. Nehaniv, Marian Petre, Chris Roast, Chris Roe, Allan Wong, and R. Michael Young. 2001. Cognitive dimensions of notations: Design tools for cognitive technology. In Cognitive Technology. Lecture Notes in Computer Science, Vol. 2117. Springer, 325--341. Google ScholarDigital Library
- Michalis Chortis and Giorgos Flouris. 2015. A diagnosis and repair framework for DL-LiteA KBs. In The Semantic Web: ESWC 2015 Satellite Events, Fabien Gandon, Christophe Guéret, Serena Villata, John Breslin, Catherine Faron-Zucker, and Antoine Zimmermann (Eds.). Lecture Notes in Computer Science, Vol. 9341. Springer, 199--214. DOI:http://dx.doi.org/10.1007/978-3-319-25639-9_37 Google ScholarDigital Library
- Edward Curry, Andre Freitas, and Sean O’Riáin. 2010. The Role of Community-Driven Data Curation for Enterprises. Springer, Boston, MA, 25--47. DOI:http://dx.doi.org/10.1007/978-1-4419-7665-9_2Google Scholar
- Richard Cyganiak, Dave Reynolds, and Jeni Tennison. 2014. The RDF Data Cube Vocabulary. W3C Recommendation. World Wide Web Consortium (W3C).Google Scholar
- Jeremy Debattista, Christoph Lange, and Sören Auer. 2014. Representing dataset quality metadata using multi-dimensional views. In Proceedings of the 10th International Conference on Semantic Systems (SEM’14). ACM, New York, NY, 92--99. DOI:http://dx.doi.org/10.1145/2660517.2660525 Google ScholarDigital Library
- Jeremy Debattista, Christoph Lange, and Sören Auer. 2015a. Luzzu - a framework for linked data quality assessment (demo). In ISWC 2015 Posters and Demonstrations Track (CEUR Workshop Proceedings), Serena Villata, Jeff Z. Pan, and Mauro Dragoni (Eds.), Vol. 1486. CEUR-WS.org. Retrieved from http://ceur-ws.org/Vol-1486/paper_74.pdf.Google Scholar
- Jeremy Debattista, Santiago Londoño, Christoph Lange, and Sören Auer. 2015b. Quality Assessment of Linked Datasets Using Probabilistic Approximation. Springer, 221--236. DOI:http://dx.doi.org/10.1007/978-3-319-18818-8_14Google Scholar
- Arie Van Deursen, Paul Klint, and Joost Visser. 2000. Domain-specific languages: An annotated bibliography. SIGPLAN Not. 35, 6 (2000), 26--36. Google ScholarDigital Library
- Annika Flemming. 2011. Quality Characteristics of Linked Data Publishing Datasources. Master’s thesis. Humboldt-Universität zu Berlin, Institut für Informatik.Google Scholar
- Yolanda Gil and Varun Ratnakar. 2002. TRELLIS: An Interactive Tool for Capturing Information Analysis and Decision Making. Lecture Notes in Computer Science, Vol. 2473. Springer, 37--42. Google ScholarDigital Library
- Christophe Guéret, Paul T. Groth, Claus Stadler, and Jens Lehmann. 2012. Assessing linked data mappings using network measures. In ESWC (Lecture Notes in Computer Science), Elena Simperl, Philipp Cimiano, Axel Polleres, Óscar Corcho, and Valentina Presutti (Eds.), Vol. 7295. Springer, 87--102. Google ScholarDigital Library
- Olaf Hartig. 2008a. Specification for tSPARQL. Retrieved from http://trdf.sourceforge.net/documents/tsparql.pdf.Google Scholar
- Olaf Hartig. 2008b. Trustworthiness of data on the web. In Proceedings of the STI Berlin and CSW PhD Workshop.Google Scholar
- Patrick J. Hayes and Peter F. Patel-Schneider. 2014. RDF 1.1 Semantics. W3C Recommendation. World Wide Web Consortium (W3C).Google Scholar
- Jiří Helmich, Jakub Klímek, and Martin Nečaský. 2014. Visualizing RDF data cubes using the linked data visualization model. In The Semantic Web: ESWC 2014 Satellite Events: ESWC 2014 Satellite Events, Revised Selected Papers, Valentina Presutti, Eva Blomqvist, Raphael Troncy, Harald Sack, Ioannis Papadakis, and Anna Tordai (Eds.). Springer, 368--373. DOI:http://dx.doi.org/10.1007/978-3-319-11955-7_50Google Scholar
- Pascal Hitzler and Krzysztof Janowicz. 2013. Linked data, big data, and the 4th paradigm. Sem. Web 4, 3 (2013), 233--235. Google ScholarDigital Library
- Aidan Hogan, Jürgen Umbrich, Andreas Harth, Richard Cyganiak, Axel Polleres, and Stefan Decker. 2012. An empirical survey of linked data conformance. J. Web Sem. 14 (2012), 14--44. Google ScholarDigital Library
- Paul Hudak. 1998. Domain-specific languages. In Handbook of Programming Languages, Vol. III: Little Languages and Tools, Peter H. Salas (Ed.). Chapter 3, MacMillan, Indianapolis, 39--60.Google Scholar
- J. M. Juran. 1974. Juran’s Quality Control Handbook (4th ed.). McGraw--Hill.Google Scholar
- M. Knuth, J. Hercher, and H. Sack. 2012. Collaboratively patching linked data. In Proceedings of 2nd International Workshop on Usage Analysis and the Web of Data (USEWOD’12), Co-located with the 21st International World Wide Web Conference 2012 (WWW’12).Google Scholar
- Dimitris Kontokostas, Patrick Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland Cornelissen, and Amrapali Zaveri. 2014. Test-driven evaluation of linked data quality. In WWW, Chin-Wan Chung, Andrei Z. Broder, Kyuseok Shim, and Torsten Suel (Eds.). ACM, 747--758. Google ScholarDigital Library
- Timothy Lebo, Satya Sahoo, Deborah McGuinness, Khalid Belhajjame, James Cheney, David Corsar, Daniel Garijo, Stian Soiland-Reyes, Stephan Zednik, and Jun Zhao. 2013. PROV-O: The PROV Ontology. W3C Recommendation. World Wide Web Consortium (W3C).Google Scholar
- Ling Liu and M. Tamer Özsu (Eds.). 2009. Encyclopedia of Database Systems. Springer. Google ScholarDigital Library
- Bernadette Farias Lóscio, Caroline Burle, and Newton Calegari. 2016. Data on the Web Best Practices. W3C Recommendation candidate. World Wide Web Consortium. Retrieved from http://www.w3.org/TR/dwbp/.Google Scholar
- Christian Mader, Michael Martin, and Claus Stadler. 2014. Facilitating the exploration and visualization of linked data. In Linked Open Data—Creating Knowledge Out of Interlinked Data, Sören Auer, Volha Bryl, and Sebastian Tramp (Eds.). Springer, 90--107. DOI:http://dx.doi.org/10.1007/978-3-319-09846-3_5Google Scholar
- Anja Jentzsch Max Schmachtenberg, Christian Bizer, and Richard Cyganiak. 2014. Linking Open Data Cloud Diagram 2014. Retrieved from http://lod-cloud.net.Google Scholar
- Pablo N. Mendes, Hannes Mühleisen, and Christian Bizer. 2012. Sieve: Linked data quality assessment and fusion. In Proceedings of the 2012 Joint EDBT/ICDT Workshops, Divesh Srivastava and Ismail Ari (Eds.). ACM, 116--123. Google ScholarDigital Library
- Marjan Mernik, Jan Heering, and Anthony M. Sloane. 2005. When and How to develop domain-specific languages. ACM Comput. Surv. 37, 4 (Dec. 2005), 316--344. DOI:http://dx.doi.org/10.1145/1118890.1118892 Google ScholarDigital Library
- Edna Ruckhaus, Oriana Baldizán, and María-Esther Vidal. 2013. Analyzing linked data quality with liquate. In OTM Workshops. DOI:http://dx.doi.org/10.1007/978-3-642-41033-8_80Google ScholarCross Ref
- Anisa Rula and Amrapali Zaveri. 2014. Methodology for assessment of linked data quality. In Proceedings of the 1st Workshop on Linked Data Quality co-located with 10th International Conference on Semantic Systems, LDQ@SEMANTiCS 2014, Leipzig, Germany, September 2nd, 2014. (CEUR Workshop Proceedings), Magnus Knuth, Dimitris Kontokostas, and Harald Sack (Eds.), Vol. 1215. CEUR-WS.org.Google Scholar
- Ruben Verborgh, Olaf Hartig, Ben De Meester, Gerald Haesendonck, Laurens De Vocht, Miel Vander Sande, Richard Cyganiak, Pieter Colpaert, Erik Mannens, and Rik Van de Walle. 2014. Querying datasets on the web with high availability. In Proceedings of the 13th International Semantic Web Conference on the Semantic Web (ISWC’14) Part I (Lecture Notes in Computer Science), Peter Mika, Tania Tudorache, Abraham Bernstein, Chris Welty, Craig A. Knoblock, Denny Vrandecic, Paul T. Groth, Natasha F. Noy, Krzysztof Janowicz, and Carole A. Goble (Eds.), Vol. 8796. Springer, 180--196. DOI:http://dx.doi.org/10.1007/978-3-319-11964-9_12 Google ScholarDigital Library
- Amrapali Zaveri, Dimitris Kontokostas, Mohamed Ahmed Sherif, Lorenz Bühmann, Mohamed Morsey, Sören Auer, and Jens Lehmann. 2013. User-driven quality evaluation of DBpedia. In I-SEMANTICS. ACM, 97--104. Google ScholarDigital Library
- Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. 2015. Quality assessment for linked data: A survey. Sem. Web J. 7 (2015), 63--93.Google ScholarCross Ref
Index Terms
- Luzzu—A Methodology and Framework for Linked Data Quality Assessment
Recommendations
Test-driven evaluation of linked data quality
WWW '14: Proceedings of the 23rd international conference on World wide webLinked Open Data (LOD) comprises an unprecedented volume of structured data on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowdsourced or extracted data of often relatively low quality. We ...
Linked Data Quality Assessment: A Survey
Web Services – ICWS 2021AbstractData is of high quality if it is fit for its intended use in operations, decision-making, and planning. There is a colossal amount of linked data available on the web. However, it is difficult to understand how well the linked data fits into the ...
A comprehensive quality model for Linked Data
Quality management of Semantic Web assets (data, services and systems)With the increasing amount of Linked Data published on the Web, the community has recognised the importance of the quality of such data and a number of initiatives have been undertaken to specify and evaluate Linked Data quality. However, these ...
Comments