Abstract
Among different characteristics of knowledge bases, data quality is one of the most relevant to maximize the benefits of the provided information. Knowledge base quality assessment poses a number of big data challenges such as high volume, variety, velocity, and veracity. In this article, we focus on answering questions related to the assessment of the veracity of facts through Deep Fact Validation (DeFacto), a triple validation framework designed to assess facts in RDF knowledge bases. Despite current developments in the research area, the underlying framework faces many challenges. This article pinpoints and discusses these issues and conducts a thorough analysis of its pipeline, aiming at reducing the error propagation through its components. Furthermore, we discuss recent developments related to this fact validation as well as describing advantages and drawbacks of state-of-the-art models. As a result of this exploratory analysis, we give insights and directions toward a better architecture to tackle the complex task of fact-checking in knowledge bases.
- Gabor Angeli, Melvin Johnson Premkumar, and Christopher D. Manning. 2015. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Vol. 1, 344--354.Google Scholar
- Hannah Bast, Björn Buchhold, and Elmar Haussmann. 2015. Relevance scores for triples from type-like relations. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 243--252. Google ScholarDigital Library
- Laure Berti-Équille and Javier Borge-Holthoefer. 2015. Veracity of Data: From Truth Discovery Computation Algorithms to Models of Misinformation Dynamics. Morgan 8 Claypool Publishers. Google ScholarDigital Library
- Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. 2009. DBpedia—A crystallization point for the Web of Data. J. Web Semant: Sci., Serv. Agents World Wide Web 7, 3 (2009), 154--165. Google ScholarDigital Library
- Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. 2001. Why and where: A characterization of data provenance. In Proceedings of the International Conference on Database Theory. Springer, 316--330. Google ScholarDigital Library
- Jorge Carrillo de Albornoz, Laura Plaza, and Pablo Gervás. 2010. A hybrid approach to emotional sentence polarity and intensity classification. In Proceedings of the 14th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 153--161. Google ScholarDigital Library
- Davide Ceolin, Paul Groth, Willem Robert Van Hage, Archana Nottamkandath, and Wan Fokkink. 2012. Trust evaluation through user reputation and provenance analysis. In Proceedings of the 8th International Conference on Uncertainty Reasoning for the Semantic Web, Vol. 900. CEUR-WS.org, 15--26. Google ScholarDigital Library
- Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Mach. Learn. 20, 3 (Sept. 1995), 273--297. Google ScholarCross Ref
- Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, Genevieve Gorrell, Raphaël Troncy, Johann Petrak, and Kalina Bontcheva. 2015. Analysis of named entity recognition and linking for tweets. Inform. Proc. Manag. 51, 2 (2015), 32--49.Google ScholarCross Ref
- Boyang Ding, Quan Wang, and Bin Wang. 2017. Leveraging text and knowledge bases for triple scoring: An ensemble approach—The BOKCHOY triple scorer at WSDM Cup 2017. In WSDM Cup 2017 Notebook Papers, February 10, Cambridge, UK, Martin Potthast, Stefan Heindorf, and Hannah Bast (Eds.). CEUR-WS.org. Retrieved from http://www.wsdm-cup-2017.org/proceedings.html.Google Scholar
- Diego Esteves, Diego Moussallem, Ciro Baron Neto, Tommaso Soru, Ricardo Usbeck, Markus Ackermann, and Jens Lehmann. 2015. MEX vocabulary: A lightweight interchange format for machine learning experiments. In Proceedings of the 11th International Conference on Semantic Systems. ACM, 169--176. Google ScholarDigital Library
- Diego Esteves, Rafael Peres, Jens Lehmann, and Giulio Napolitano. 2017. Named entity recognition in Twitter using images and text. Arxiv:1710.11027 (2017).Google Scholar
- Oren Etzioni, Michael Cafarella, Doug Downey, Stanley Kok, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2004. Web-scale information extraction in Knowitall: (preliminary results). In Proceedings of the 13th International Conference on World Wide Web (WWW’04). ACM, New York, 100--110. Google ScholarDigital Library
- Daniel Gerber, Diego Esteves, Jens Lehmann, Lorenz Bühmann, Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, and René Speck. 2015. DeFacto—Temporal and multilingual deep fact validation. J. Web Semant: Sci., Serv. Agents World Wide Web 35 (2015), 85--101. Google ScholarDigital Library
- Daniel Gerber and Axel-Cyrille Ngonga Ngomo. 2012. Extracting multilingual natural-language patterns for RDF predicates. In International Conference on Knowledge Engineering and Knowledge Management. Springer, 87--96. Google ScholarDigital Library
- Hugh Glaser, Afraz Jaffri, and Ian Millard. 2009. Managing co-reference on the semantic web. WWW2009 Workshop: Linked Data on the Web (LDOW2009). University of Southampton Institutional Repository. https://eprints.soton.ac.uk/267587/.Google Scholar
- Faegheh Hasibi, Darío Garigliotti, Shuo Zhang, and Krisztian Balog. 2017. Supervised ranking of triples for type-like relations—the cress triple scorer at the WSDM Cup 2017. In WSDM Cup 2017 Notebook Papers, February 10, Cambridge, UK, Martin Potthast, Stefan Heindorf, and Hannah Bast (Eds.). CEUR-WS.org. Retrieved from http://www.wsdm-cup-2017.org/proceedings.html.Google Scholar
- Naeemul Hassan, Afroza Sultana, You Wu, Gensheng Zhang, Chengkai Li, Jun Yang, and Cong Yu. 2014. Data in, fact out: Automated monitoring of facts by FactQatcher. Proceedings of the VLDB Endowment 7, 13 (2014), 1557--1560. Google ScholarDigital Library
- Soon Gill Hong, Sin-hee Cho, and Mun Yong Yi. 2014. Unsupervised verb inference from nouns crossing root boundary. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. Dublin City University and Association for Computational Linguistics, Dublin, 1248--1259.Google Scholar
- Soon Gill Hong and Mun Yong Yi. 2017. Plausibility assessment of triples with instance-based learning distantly supervised by background knowledge. Submitted to Semant. Web J. Retrieved from http://www.semantic-web-journal.net/system/files/swj1546.pdf.Google Scholar
- Krzysztof Janowicz. 2009. Trust and Provenance—-You Canfit Have One Without the Other. Technical Report. Institute for Geoinformatics, University of Muenster, Germany.Google Scholar
- Jens Lehmann, Daniel Gerber, Mohamed Morsey, and Axel-Cyrille Ngonga Ngomo. 2012. DeFacto—Deep fact validation. In Proceedings of the International Semantic Web Conference.Google ScholarDigital Library
- Furong Li, Xin Luna Dong, Anno Langen, and Yang Li. 2017. Knowledge verification for longtail verticals. PVLDB 10, 11 (2017), 1370--1381. Google ScholarDigital Library
- Xian Li, Weiyi Meng, and Clement Yu. 2011. T-verifier: Verifying truthfulness of fact statements. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE’11). IEEE Computer Society, Washington, DC, 63--74. Google ScholarDigital Library
- Marco Lippi and Paolo Torroni. 2016. Argumentation mining: State of the art and emerging trends. ACM Trans. Internet Technol. (TOIT) 16, 2 (2016), 10. Google ScholarDigital Library
- Catherine Macleod, Ralph Grishman, Adam Meyers, Leslie Barrett, and Ruth Reeves. 1998. NOMLEX: A lexicon of nominalizations. In Proceedings of Euralex98. 187--193.Google Scholar
- Mónica Marrero, Julián Urbano, Sonia Sánchez-Cuadrado, Jorge Morato, and Juan Miguel Gómez-Berbís. 2013. Named entity recognition: Fallacies, challenges and opportunities. Computer Standards 8 Interfaces 35, 5 (2013), 482--489.Google Scholar
- George A. Miller. 1995. WordNet: A lexical database for english. Commun. ACM 38, 11 (Nov. 1995), 39--41. 0001-0782 Google ScholarDigital Library
- Raquel Mochales Palau and Marie-Francine Moens. 2009. Argumentation mining: The detection, classification and structure of arguments in text. In Proceedings of the 12th International Conference on Artificial Intelligence and Law. ACM, 98--107. Google ScholarDigital Library
- Jeff Pasternack and Dan Roth. 2011. Generalized fact-finding. In Proceedings of the 20th International Conference Companion on World Wide Web. ACM, 99--100. Google ScholarDigital Library
- Jeff Pasternack and Dan Roth. 2011. Making better informed trust decisions with generalized fact-finding. In IJCAI. 2324--2329. Google ScholarDigital Library
- Jeff Pasternack and Dan Roth. 2013. Latent credibility analysis. In Proceedings of the 22nd International Conference on World Wide Web (WWW’13). 1009--1020. Google ScholarDigital Library
- Zhong Qian, Peifeng Li, Qiaoming Zhu, Guodong Zhou, Zhunchen Luo, and Wei Luo. Speculation and negation scope detection via convolutional neural networks. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 815--825.Google Scholar
- Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 147--155. Google ScholarDigital Library
- Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 74--84.Google Scholar
- Gil Rocha, Henrique Lopes Cardoso, and Jorge Teixeira. 2016. ArgMine: A framework for argumentation mining. Computational Processing of the Portuguese Language-12th International Conference, PROPOR. 13--15.Google Scholar
- Barna Saha and Divesh Srivastava. 2014. Data quality: The other face of big data. In IEEE 30th International Conference on Data Engineering, Chicago (ICDE’14). 1294--1297.Google ScholarCross Ref
- B. Saha and D. Srivastava. 2014. Data quality: The other face of Big Data. In 2014 IEEE 30th International Conference on Data Engineering. 1294--1297.Google Scholar
- Mehdi Samadi, Partha Talukdar, Manuela Veloso, and Manuel Blum. 2016. ClaimEval: Integrated and flexible framework for claim evaluation using credibility of sources. In Proceedings of the 13th AAAI Conference on Artificial Intelligence (AAAI’16). AAAI Press, 222--228. Google ScholarDigital Library
- Mehdi Samadi, Manuela M. Veloso, and Manuel Blum. 2013. OpenEval: Web information query evaluation. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (AAAI’13). AAAI Press, Bellevue, Washington, 1163--1169. http://dl.acm.org/citation.cfm?id=2891460.2891622. Google ScholarDigital Library
- Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, Christopher Potts, and others. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’13), Vol. 1631. Citeseer, 1642.Google Scholar
- Stephen Soderland, Oren Etzioni, Tal Shaked, and D. Weld. 2004. The use of web-based statistics to validate information extraction. In AAAI-04 Workshop on Adaptive Text Extraction and Mining. 21--26.Google Scholar
- Andreas Vlachos and Sebastian Riedel. 2014. Fact checking: Task definition and dataset construction. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science. 18--22.Google ScholarCross Ref
- Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. 2015. Quality assessment for linked data: A survey. Semantic Web 7, 1 (2015), 63--93.Google ScholarCross Ref
- Xiaodan Zhu, Svetlana Kiritchenko, and Saif M. Mohammad. 2014. NRC-Canada-2014: Recent improvements in the sentiment analysis of tweets. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval’14). Citeseer, 443--447.Google Scholar
- Valentin Zmiycharov, Dimitar Alexandrov, Preslav Nakov, Ivan Koychev, and Yasen Kiprov. 2017. Finding people’s professions and nationalities using distant supervision: The FMI@SU “goosefoot” team at the WSDM Cup 2017 triple scoring task. In WSDM Cup 2017 Notebook Papers, February 10, Cambridge, UK, Martin Potthast, Stefan Heindorf, and Hannah Bast (Eds.). CEUR-WS.org. http://www.wsdm-cup-2017.org/proceedings.html.Google Scholar
Index Terms
- Toward Veracity Assessment in RDF Knowledge Bases: An Exploratory Analysis
Recommendations
The role of reasoning for RDF validation
SEMANTICS '15: Proceedings of the 11th International Conference on Semantic SystemsFor data practitioners embracing the world of RDF and Linked Data, the openness and flexibility is a mixed blessing. For them, data validation according to predefined constraints is a much sought-after feature, particularly as this is taken for granted ...
Linked Data Quality Assessment: A Survey
Web Services – ICWS 2021AbstractData is of high quality if it is fit for its intended use in operations, decision-making, and planning. There is a colossal amount of linked data available on the web. However, it is difficult to understand how well the linked data fits into the ...
Luzzu—A Methodology and Framework for Linked Data Quality Assessment
Special Issue on Web Data QualityThe increasing variety of Linked Data on the Web makes it challenging to determine the quality of this data and, subsequently, to make this information explicit to data consumers. Despite the availability of a number of tools and frameworks to assess ...
Comments