skip to main content
research-article

Luzzu—A Methodology and Framework for Linked Data Quality Assessment

Authors Info & Claims
Published:25 October 2016Publication History
Skip Abstract Section

Abstract

The increasing variety of Linked Data on the Web makes it challenging to determine the quality of this data and, subsequently, to make this information explicit to data consumers. Despite the availability of a number of tools and frameworks to assess Linked Data Quality, the output of such tools is not suitable for machine consumption, and thus consumers can hardly compare and rank datasets in the order of fitness for use. This article describes a conceptual methodology for assessing Linked Datasets, and Luzzu; a framework for Linked Data Quality Assessment. Luzzu is based on four major components: (1) an extensible interface for defining new quality metrics; (2) an interoperable, ontology-driven back-end for representing quality metadata and quality problems that can be re-used within different semantic frameworks; (3) scalable dataset processors for data dumps, SPARQL endpoints, and big data infrastructures; and (4) a customisable ranking algorithm taking into account user-defined weights. We show that Luzzu scales linearly against the number of triples in a dataset. We also demonstrate the applicability of the Luzzu framework by evaluating and analysing a number of statistical datasets against a variety of metrics. This article contributes towards the definition of a holistic data quality lifecycle, in terms of the co-evolution of linked datasets, with the final aim of improving their quality.

References

  1. Riccardo Albertoni, Antoine Isaac, Christophe Guéret, Jeremy Debattista, Deirdre Lee, Nandana Mihindukulasooriya, and Amrapali Zaveri. 2015. Data Quality Vocabulary (DQV). W3C Interest Group Note. World Wide Web Consortium (W3C).Google ScholarGoogle Scholar
  2. Keith Alexander, Richard Cyganiak, Michael Hausenblas, and Jun Zhao. 2011. Describing Linked Datasets with the VoID Vocabulary. W3C Interest Group Note. World Wide Web Consortium.Google ScholarGoogle Scholar
  3. Mario Arias, Javier D Fernández, Miguel A Martínez-Prieto, and Claudio Gutiérrez. 2011. HDT-it: Storing, sharing and visualizing huge RDF datasets. In ISWC. 23--27.Google ScholarGoogle Scholar
  4. Judie Attard, Fabrizio Orlandi, Simon Scerri, and Sören Auer. 2015. A systematic review of open government data initiatives. Gov. Inform. Quart. 32, 4 (2015), 399--418. DOI:http://dx.doi.org/10.1016/j.giq.2015.07.006Google ScholarGoogle ScholarCross RefCross Ref
  5. Sören Auer, Lorenz Bühmann, Christian Dirschl, Orri Erling, Michael Hausenblas, Robert Isele, Jens Lehmann, Michael Martin, Pablo N. Mendes, Bert Van Nuffelen, Claus Stadler, Sebastian Tramp, and Hugh Williams. 2012. Managing the life-cycle of linked data with the LOD2 stack. In The Semantic Web, ISWC 2012, 11th International Semantic Web Conference on the Semantic Web (ISWC 2012), Boston, MA, USA, November 11--15, 2012, Proceedings, Part II (Lecture Notes in Computer Science), Philippe Cudré-Mauroux, Jeff Heflin, Evren Sirin, Tania Tudorache, Jérôme Euzenat, Manfred Hauswirth, Josiane Xavier Parreira, Jim Hendler, Guus Schreiber, Abraham Bernstein, and Eva Blomqvist (Eds.), Vol. 7650. Springer, 1--16. DOI:http://dx.doi.org/10.1007/978-3-642-35173-0_1 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Carlo Batini, Daniele Barone, Michele Mastrella, Andrea Maurino, and Claudio Ruffini. 2007. A framework and a methodology for data quality assessment and monitoring. In ICIQ (2010-07-12), Mary Ann Robbert, Robert O’Hare, M. Lynne Markus, and Barbara D. Klein (Eds.). MIT, 333--346.Google ScholarGoogle Scholar
  7. Christian Bizer and Richard Cyganiak. 2009. Quality-driven information filtering using the WIQA policy framework. Web Semant. 7, 1 (Jan. 2009), 1--10. DOI:http://dx.doi.org/10.1016/j.websem.2008.02.005 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Alan F. Blackwell, Carol Britton, Anna Louise Cox, Thomas R. G. Green, Corin A. Gurr, Gada F. Kadoda, Maria Kutar, Martin Loomes, Chrystopher L. Nehaniv, Marian Petre, Chris Roast, Chris Roe, Allan Wong, and R. Michael Young. 2001. Cognitive dimensions of notations: Design tools for cognitive technology. In Cognitive Technology. Lecture Notes in Computer Science, Vol. 2117. Springer, 325--341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Michalis Chortis and Giorgos Flouris. 2015. A diagnosis and repair framework for DL-LiteA KBs. In The Semantic Web: ESWC 2015 Satellite Events, Fabien Gandon, Christophe Guéret, Serena Villata, John Breslin, Catherine Faron-Zucker, and Antoine Zimmermann (Eds.). Lecture Notes in Computer Science, Vol. 9341. Springer, 199--214. DOI:http://dx.doi.org/10.1007/978-3-319-25639-9_37 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Edward Curry, Andre Freitas, and Sean O’Riáin. 2010. The Role of Community-Driven Data Curation for Enterprises. Springer, Boston, MA, 25--47. DOI:http://dx.doi.org/10.1007/978-1-4419-7665-9_2Google ScholarGoogle Scholar
  11. Richard Cyganiak, Dave Reynolds, and Jeni Tennison. 2014. The RDF Data Cube Vocabulary. W3C Recommendation. World Wide Web Consortium (W3C).Google ScholarGoogle Scholar
  12. Jeremy Debattista, Christoph Lange, and Sören Auer. 2014. Representing dataset quality metadata using multi-dimensional views. In Proceedings of the 10th International Conference on Semantic Systems (SEM’14). ACM, New York, NY, 92--99. DOI:http://dx.doi.org/10.1145/2660517.2660525 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jeremy Debattista, Christoph Lange, and Sören Auer. 2015a. Luzzu - a framework for linked data quality assessment (demo). In ISWC 2015 Posters and Demonstrations Track (CEUR Workshop Proceedings), Serena Villata, Jeff Z. Pan, and Mauro Dragoni (Eds.), Vol. 1486. CEUR-WS.org. Retrieved from http://ceur-ws.org/Vol-1486/paper_74.pdf.Google ScholarGoogle Scholar
  14. Jeremy Debattista, Santiago Londoño, Christoph Lange, and Sören Auer. 2015b. Quality Assessment of Linked Datasets Using Probabilistic Approximation. Springer, 221--236. DOI:http://dx.doi.org/10.1007/978-3-319-18818-8_14Google ScholarGoogle Scholar
  15. Arie Van Deursen, Paul Klint, and Joost Visser. 2000. Domain-specific languages: An annotated bibliography. SIGPLAN Not. 35, 6 (2000), 26--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Annika Flemming. 2011. Quality Characteristics of Linked Data Publishing Datasources. Master’s thesis. Humboldt-Universität zu Berlin, Institut für Informatik.Google ScholarGoogle Scholar
  17. Yolanda Gil and Varun Ratnakar. 2002. TRELLIS: An Interactive Tool for Capturing Information Analysis and Decision Making. Lecture Notes in Computer Science, Vol. 2473. Springer, 37--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Christophe Guéret, Paul T. Groth, Claus Stadler, and Jens Lehmann. 2012. Assessing linked data mappings using network measures. In ESWC (Lecture Notes in Computer Science), Elena Simperl, Philipp Cimiano, Axel Polleres, Óscar Corcho, and Valentina Presutti (Eds.), Vol. 7295. Springer, 87--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Olaf Hartig. 2008a. Specification for tSPARQL. Retrieved from http://trdf.sourceforge.net/documents/tsparql.pdf.Google ScholarGoogle Scholar
  20. Olaf Hartig. 2008b. Trustworthiness of data on the web. In Proceedings of the STI Berlin and CSW PhD Workshop.Google ScholarGoogle Scholar
  21. Patrick J. Hayes and Peter F. Patel-Schneider. 2014. RDF 1.1 Semantics. W3C Recommendation. World Wide Web Consortium (W3C).Google ScholarGoogle Scholar
  22. Jiří Helmich, Jakub Klímek, and Martin Nečaský. 2014. Visualizing RDF data cubes using the linked data visualization model. In The Semantic Web: ESWC 2014 Satellite Events: ESWC 2014 Satellite Events, Revised Selected Papers, Valentina Presutti, Eva Blomqvist, Raphael Troncy, Harald Sack, Ioannis Papadakis, and Anna Tordai (Eds.). Springer, 368--373. DOI:http://dx.doi.org/10.1007/978-3-319-11955-7_50Google ScholarGoogle Scholar
  23. Pascal Hitzler and Krzysztof Janowicz. 2013. Linked data, big data, and the 4th paradigm. Sem. Web 4, 3 (2013), 233--235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Aidan Hogan, Jürgen Umbrich, Andreas Harth, Richard Cyganiak, Axel Polleres, and Stefan Decker. 2012. An empirical survey of linked data conformance. J. Web Sem. 14 (2012), 14--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Paul Hudak. 1998. Domain-specific languages. In Handbook of Programming Languages, Vol. III: Little Languages and Tools, Peter H. Salas (Ed.). Chapter 3, MacMillan, Indianapolis, 39--60.Google ScholarGoogle Scholar
  26. J. M. Juran. 1974. Juran’s Quality Control Handbook (4th ed.). McGraw--Hill.Google ScholarGoogle Scholar
  27. M. Knuth, J. Hercher, and H. Sack. 2012. Collaboratively patching linked data. In Proceedings of 2nd International Workshop on Usage Analysis and the Web of Data (USEWOD’12), Co-located with the 21st International World Wide Web Conference 2012 (WWW’12).Google ScholarGoogle Scholar
  28. Dimitris Kontokostas, Patrick Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland Cornelissen, and Amrapali Zaveri. 2014. Test-driven evaluation of linked data quality. In WWW, Chin-Wan Chung, Andrei Z. Broder, Kyuseok Shim, and Torsten Suel (Eds.). ACM, 747--758. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Timothy Lebo, Satya Sahoo, Deborah McGuinness, Khalid Belhajjame, James Cheney, David Corsar, Daniel Garijo, Stian Soiland-Reyes, Stephan Zednik, and Jun Zhao. 2013. PROV-O: The PROV Ontology. W3C Recommendation. World Wide Web Consortium (W3C).Google ScholarGoogle Scholar
  30. Ling Liu and M. Tamer Özsu (Eds.). 2009. Encyclopedia of Database Systems. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Bernadette Farias Lóscio, Caroline Burle, and Newton Calegari. 2016. Data on the Web Best Practices. W3C Recommendation candidate. World Wide Web Consortium. Retrieved from http://www.w3.org/TR/dwbp/.Google ScholarGoogle Scholar
  32. Christian Mader, Michael Martin, and Claus Stadler. 2014. Facilitating the exploration and visualization of linked data. In Linked Open Data—Creating Knowledge Out of Interlinked Data, Sören Auer, Volha Bryl, and Sebastian Tramp (Eds.). Springer, 90--107. DOI:http://dx.doi.org/10.1007/978-3-319-09846-3_5Google ScholarGoogle Scholar
  33. Anja Jentzsch Max Schmachtenberg, Christian Bizer, and Richard Cyganiak. 2014. Linking Open Data Cloud Diagram 2014. Retrieved from http://lod-cloud.net.Google ScholarGoogle Scholar
  34. Pablo N. Mendes, Hannes Mühleisen, and Christian Bizer. 2012. Sieve: Linked data quality assessment and fusion. In Proceedings of the 2012 Joint EDBT/ICDT Workshops, Divesh Srivastava and Ismail Ari (Eds.). ACM, 116--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Marjan Mernik, Jan Heering, and Anthony M. Sloane. 2005. When and How to develop domain-specific languages. ACM Comput. Surv. 37, 4 (Dec. 2005), 316--344. DOI:http://dx.doi.org/10.1145/1118890.1118892 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Edna Ruckhaus, Oriana Baldizán, and María-Esther Vidal. 2013. Analyzing linked data quality with liquate. In OTM Workshops. DOI:http://dx.doi.org/10.1007/978-3-642-41033-8_80Google ScholarGoogle ScholarCross RefCross Ref
  37. Anisa Rula and Amrapali Zaveri. 2014. Methodology for assessment of linked data quality. In Proceedings of the 1st Workshop on Linked Data Quality co-located with 10th International Conference on Semantic Systems, LDQ@SEMANTiCS 2014, Leipzig, Germany, September 2nd, 2014. (CEUR Workshop Proceedings), Magnus Knuth, Dimitris Kontokostas, and Harald Sack (Eds.), Vol. 1215. CEUR-WS.org.Google ScholarGoogle Scholar
  38. Ruben Verborgh, Olaf Hartig, Ben De Meester, Gerald Haesendonck, Laurens De Vocht, Miel Vander Sande, Richard Cyganiak, Pieter Colpaert, Erik Mannens, and Rik Van de Walle. 2014. Querying datasets on the web with high availability. In Proceedings of the 13th International Semantic Web Conference on the Semantic Web (ISWC’14) Part I (Lecture Notes in Computer Science), Peter Mika, Tania Tudorache, Abraham Bernstein, Chris Welty, Craig A. Knoblock, Denny Vrandecic, Paul T. Groth, Natasha F. Noy, Krzysztof Janowicz, and Carole A. Goble (Eds.), Vol. 8796. Springer, 180--196. DOI:http://dx.doi.org/10.1007/978-3-319-11964-9_12 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Amrapali Zaveri, Dimitris Kontokostas, Mohamed Ahmed Sherif, Lorenz Bühmann, Mohamed Morsey, Sören Auer, and Jens Lehmann. 2013. User-driven quality evaluation of DBpedia. In I-SEMANTICS. ACM, 97--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. 2015. Quality assessment for linked data: A survey. Sem. Web J. 7 (2015), 63--93.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Luzzu—A Methodology and Framework for Linked Data Quality Assessment

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Journal of Data and Information Quality
          Journal of Data and Information Quality  Volume 8, Issue 1
          Special Issue on Web Data Quality
          November 2016
          125 pages
          ISSN:1936-1955
          EISSN:1936-1963
          DOI:10.1145/3012403
          Issue’s Table of Contents

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 25 October 2016
          • Accepted: 1 August 2016
          • Revised: 1 July 2016
          • Received: 1 November 2015
          Published in jdiq Volume 8, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader