research-article

Toward Veracity Assessment in RDF Knowledge Bases: An Exploratory Analysis

Authors:
Diego Esteves

University of Bonn, Bonn, Germany

University of Bonn, Bonn, Germany

0000-0002-1676-545X
View Profile

,
Anisa Rula

University of Milano-Bicocca and University of Bonn, Bonn, Germany

University of Milano-Bicocca and University of Bonn, Bonn, Germany
View Profile

,
Aniketh Janardhan Reddy

Birla Institute of Technology and Science, India

Birla Institute of Technology and Science, India
View Profile

,
Jens Lehmann

University of Bonn and Fraunhofer IAIS, Bonn, Germany

University of Bonn and Fraunhofer IAIS, Bonn, Germany
View Profile

Authors Info & Claims

Journal of Data and Information Quality Volume 9 Issue 3Article No.: 16pp 1–26https://doi.org/10.1145/3177873

Published:20 February 2018Publication History

Journal of Data and Information Quality

Abstract

Among different characteristics of knowledge bases, data quality is one of the most relevant to maximize the benefits of the provided information. Knowledge base quality assessment poses a number of big data challenges such as high volume, variety, velocity, and veracity. In this article, we focus on answering questions related to the assessment of the veracity of facts through Deep Fact Validation (DeFacto), a triple validation framework designed to assess facts in RDF knowledge bases. Despite current developments in the research area, the underlying framework faces many challenges. This article pinpoints and discusses these issues and conducts a thorough analysis of its pipeline, aiming at reducing the error propagation through its components. Furthermore, we discuss recent developments related to this fact validation as well as describing advantages and drawbacks of state-of-the-art models. As a result of this exploratory analysis, we give insights and directions toward a better architecture to tackle the complex task of fact-checking in knowledge bases.

References

Gabor Angeli, Melvin Johnson Premkumar, and Christopher D. Manning. 2015. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Vol. 1, 344--354.Google Scholar
Hannah Bast, Björn Buchhold, and Elmar Haussmann. 2015. Relevance scores for triples from type-like relations. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 243--252. Google ScholarDigital Library
Laure Berti-Équille and Javier Borge-Holthoefer. 2015. Veracity of Data: From Truth Discovery Computation Algorithms to Models of Misinformation Dynamics. Morgan 8 Claypool Publishers. Google ScholarDigital Library
Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. 2009. DBpedia—A crystallization point for the Web of Data. J. Web Semant: Sci., Serv. Agents World Wide Web 7, 3 (2009), 154--165. Google ScholarDigital Library
Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. 2001. Why and where: A characterization of data provenance. In Proceedings of the International Conference on Database Theory. Springer, 316--330. Google ScholarDigital Library
Jorge Carrillo de Albornoz, Laura Plaza, and Pablo Gervás. 2010. A hybrid approach to emotional sentence polarity and intensity classification. In Proceedings of the 14th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 153--161. Google ScholarDigital Library
Davide Ceolin, Paul Groth, Willem Robert Van Hage, Archana Nottamkandath, and Wan Fokkink. 2012. Trust evaluation through user reputation and provenance analysis. In Proceedings of the 8th International Conference on Uncertainty Reasoning for the Semantic Web, Vol. 900. CEUR-WS.org, 15--26. Google ScholarDigital Library
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Mach. Learn. 20, 3 (Sept. 1995), 273--297. Google ScholarCross Ref
Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, Genevieve Gorrell, Raphaël Troncy, Johann Petrak, and Kalina Bontcheva. 2015. Analysis of named entity recognition and linking for tweets. Inform. Proc. Manag. 51, 2 (2015), 32--49.Google ScholarCross Ref
Boyang Ding, Quan Wang, and Bin Wang. 2017. Leveraging text and knowledge bases for triple scoring: An ensemble approach—The BOKCHOY triple scorer at WSDM Cup 2017. In WSDM Cup 2017 Notebook Papers, February 10, Cambridge, UK, Martin Potthast, Stefan Heindorf, and Hannah Bast (Eds.). CEUR-WS.org. Retrieved from http://www.wsdm-cup-2017.org/proceedings.html.Google Scholar
Diego Esteves, Diego Moussallem, Ciro Baron Neto, Tommaso Soru, Ricardo Usbeck, Markus Ackermann, and Jens Lehmann. 2015. MEX vocabulary: A lightweight interchange format for machine learning experiments. In Proceedings of the 11th International Conference on Semantic Systems. ACM, 169--176. Google ScholarDigital Library
Diego Esteves, Rafael Peres, Jens Lehmann, and Giulio Napolitano. 2017. Named entity recognition in Twitter using images and text. Arxiv:1710.11027 (2017).Google Scholar
Oren Etzioni, Michael Cafarella, Doug Downey, Stanley Kok, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2004. Web-scale information extraction in Knowitall: (preliminary results). In Proceedings of the 13th International Conference on World Wide Web (WWW’04). ACM, New York, 100--110. Google ScholarDigital Library
Daniel Gerber, Diego Esteves, Jens Lehmann, Lorenz Bühmann, Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, and René Speck. 2015. DeFacto—Temporal and multilingual deep fact validation. J. Web Semant: Sci., Serv. Agents World Wide Web 35 (2015), 85--101. Google ScholarDigital Library
Daniel Gerber and Axel-Cyrille Ngonga Ngomo. 2012. Extracting multilingual natural-language patterns for RDF predicates. In International Conference on Knowledge Engineering and Knowledge Management. Springer, 87--96. Google ScholarDigital Library
Hugh Glaser, Afraz Jaffri, and Ian Millard. 2009. Managing co-reference on the semantic web. WWW2009 Workshop: Linked Data on the Web (LDOW2009). University of Southampton Institutional Repository. https://eprints.soton.ac.uk/267587/.Google Scholar
Faegheh Hasibi, Darío Garigliotti, Shuo Zhang, and Krisztian Balog. 2017. Supervised ranking of triples for type-like relations—the cress triple scorer at the WSDM Cup 2017. In WSDM Cup 2017 Notebook Papers, February 10, Cambridge, UK, Martin Potthast, Stefan Heindorf, and Hannah Bast (Eds.). CEUR-WS.org. Retrieved from http://www.wsdm-cup-2017.org/proceedings.html.Google Scholar
Naeemul Hassan, Afroza Sultana, You Wu, Gensheng Zhang, Chengkai Li, Jun Yang, and Cong Yu. 2014. Data in, fact out: Automated monitoring of facts by FactQatcher. Proceedings of the VLDB Endowment 7, 13 (2014), 1557--1560. Google ScholarDigital Library
Soon Gill Hong, Sin-hee Cho, and Mun Yong Yi. 2014. Unsupervised verb inference from nouns crossing root boundary. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. Dublin City University and Association for Computational Linguistics, Dublin, 1248--1259.Google Scholar
Soon Gill Hong and Mun Yong Yi. 2017. Plausibility assessment of triples with instance-based learning distantly supervised by background knowledge. Submitted to Semant. Web J. Retrieved from http://www.semantic-web-journal.net/system/files/swj1546.pdf.Google Scholar
Krzysztof Janowicz. 2009. Trust and Provenance—-You Canfit Have One Without the Other. Technical Report. Institute for Geoinformatics, University of Muenster, Germany.Google Scholar
Jens Lehmann, Daniel Gerber, Mohamed Morsey, and Axel-Cyrille Ngonga Ngomo. 2012. DeFacto—Deep fact validation. In Proceedings of the International Semantic Web Conference.Google ScholarDigital Library
Furong Li, Xin Luna Dong, Anno Langen, and Yang Li. 2017. Knowledge verification for longtail verticals. PVLDB 10, 11 (2017), 1370--1381. Google ScholarDigital Library
Xian Li, Weiyi Meng, and Clement Yu. 2011. T-verifier: Verifying truthfulness of fact statements. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE’11). IEEE Computer Society, Washington, DC, 63--74. Google ScholarDigital Library
Marco Lippi and Paolo Torroni. 2016. Argumentation mining: State of the art and emerging trends. ACM Trans. Internet Technol. (TOIT) 16, 2 (2016), 10. Google ScholarDigital Library
Catherine Macleod, Ralph Grishman, Adam Meyers, Leslie Barrett, and Ruth Reeves. 1998. NOMLEX: A lexicon of nominalizations. In Proceedings of Euralex98. 187--193.Google Scholar
Mónica Marrero, Julián Urbano, Sonia Sánchez-Cuadrado, Jorge Morato, and Juan Miguel Gómez-Berbís. 2013. Named entity recognition: Fallacies, challenges and opportunities. Computer Standards 8 Interfaces 35, 5 (2013), 482--489.Google Scholar
George A. Miller. 1995. WordNet: A lexical database for english. Commun. ACM 38, 11 (Nov. 1995), 39--41. 0001-0782 Google ScholarDigital Library
Raquel Mochales Palau and Marie-Francine Moens. 2009. Argumentation mining: The detection, classification and structure of arguments in text. In Proceedings of the 12th International Conference on Artificial Intelligence and Law. ACM, 98--107. Google ScholarDigital Library
Jeff Pasternack and Dan Roth. 2011. Generalized fact-finding. In Proceedings of the 20th International Conference Companion on World Wide Web. ACM, 99--100. Google ScholarDigital Library
Jeff Pasternack and Dan Roth. 2011. Making better informed trust decisions with generalized fact-finding. In IJCAI. 2324--2329. Google ScholarDigital Library
Jeff Pasternack and Dan Roth. 2013. Latent credibility analysis. In Proceedings of the 22nd International Conference on World Wide Web (WWW’13). 1009--1020. Google ScholarDigital Library
Zhong Qian, Peifeng Li, Qiaoming Zhu, Guodong Zhou, Zhunchen Luo, and Wei Luo. Speculation and negation scope detection via convolutional neural networks. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 815--825.Google Scholar
Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 147--155. Google ScholarDigital Library
Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 74--84.Google Scholar
Gil Rocha, Henrique Lopes Cardoso, and Jorge Teixeira. 2016. ArgMine: A framework for argumentation mining. Computational Processing of the Portuguese Language-12th International Conference, PROPOR. 13--15.Google Scholar
Barna Saha and Divesh Srivastava. 2014. Data quality: The other face of big data. In IEEE 30th International Conference on Data Engineering, Chicago (ICDE’14). 1294--1297.Google ScholarCross Ref
B. Saha and D. Srivastava. 2014. Data quality: The other face of Big Data. In 2014 IEEE 30th International Conference on Data Engineering. 1294--1297.Google Scholar
Mehdi Samadi, Partha Talukdar, Manuela Veloso, and Manuel Blum. 2016. ClaimEval: Integrated and flexible framework for claim evaluation using credibility of sources. In Proceedings of the 13th AAAI Conference on Artificial Intelligence (AAAI’16). AAAI Press, 222--228. Google ScholarDigital Library
Mehdi Samadi, Manuela M. Veloso, and Manuel Blum. 2013. OpenEval: Web information query evaluation. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (AAAI’13). AAAI Press, Bellevue, Washington, 1163--1169. http://dl.acm.org/citation.cfm?id=2891460.2891622. Google ScholarDigital Library
Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, Christopher Potts, and others. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’13), Vol. 1631. Citeseer, 1642.Google Scholar
Stephen Soderland, Oren Etzioni, Tal Shaked, and D. Weld. 2004. The use of web-based statistics to validate information extraction. In AAAI-04 Workshop on Adaptive Text Extraction and Mining. 21--26.Google Scholar
Andreas Vlachos and Sebastian Riedel. 2014. Fact checking: Task definition and dataset construction. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science. 18--22.Google ScholarCross Ref
Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. 2015. Quality assessment for linked data: A survey. Semantic Web 7, 1 (2015), 63--93.Google ScholarCross Ref
Xiaodan Zhu, Svetlana Kiritchenko, and Saif M. Mohammad. 2014. NRC-Canada-2014: Recent improvements in the sentiment analysis of tweets. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval’14). Citeseer, 443--447.Google Scholar
Valentin Zmiycharov, Dimitar Alexandrov, Preslav Nakov, Ivan Koychev, and Yasen Kiprov. 2017. Finding people’s professions and nationalities using distant supervision: The FMI@SU “goosefoot” team at the WSDM Cup 2017 triple scoring task. In WSDM Cup 2017 Notebook Papers, February 10, Cambridge, UK, Martin Potthast, Stefan Heindorf, and Hannah Bast (Eds.). CEUR-WS.org. http://www.wsdm-cup-2017.org/proceedings.html.Google Scholar

Index Terms

Toward Veracity Assessment in RDF Knowledge Bases: An Exploratory Analysis
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
  2. World Wide Web
    1. Web searching and information discovery

Recommendations

The role of reasoning for RDF validation
SEMANTICS '15: Proceedings of the 11th International Conference on Semantic Systems

For data practitioners embracing the world of RDF and Linked Data, the openness and flexibility is a mixed blessing. For them, data validation according to predefined constraints is a much sought-after feature, particularly as this is taken for granted ...
Read More
Linked Data Quality Assessment: A Survey
Web Services – ICWS 2021
Abstract
Data is of high quality if it is fit for its intended use in operations, decision-making, and planning. There is a colossal amount of linked data available on the web. However, it is difficult to understand how well the linked data fits into the ...
Read More
Luzzu—A Methodology and Framework for Linked Data Quality Assessment
Special Issue on Web Data Quality

The increasing variety of Linked Data on the Web makes it challenging to determine the quality of this data and, subsequently, to make this information explicit to data consumers. Despite the availability of a number of tools and frameworks to assess ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Journal of Data and Information Quality Volume 9, Issue 3
Special Issue on Improving the Veracity and Value of Big Data
September 2017
140 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3183573
Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 February 2018
- Accepted: 1 January 2018
- Revised: 1 December 2017
- Received: 1 April 2017
Published in jdiq Volume 9, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
DeFacto
benchmark
data quality
exploratory data analysis
fact checking
linked data
trustworthiness
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 415
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Toward Veracity Assessment in RDF Knowledge Bases: An Exploratory Analysis

Journal of Data and Information Quality

Abstract

References

Cited By

Index Terms

Recommendations

The role of reasoning for RDF validation

Linked Data Quality Assessment: A Survey

Luzzu—A Methodology and Framework for Linked Data Quality Assessment