research-article

Lightweight Lexical and Semantic Evidence for Detecting Classes Among Wikipedia Articles

Authors:
Marius Pasca

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Travis Wolfe

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data MiningJanuary 2019Pages 78–86https://doi.org/10.1145/3289600.3291020

Published:30 January 2019Publication History

WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pages 78–86

ABSTRACT

A supervised method relies on simple, lightweight features in order to distinguish Wikipedia articles that are classes (Shield volcano) from other articles (Kilauea). The features are lexical or semantic in nature. Experimental results in multiple languages over multiple evaluation sets demonstrate the superiority of the proposed method over previous work.

References

C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. 2009. DBpedia - a Crystallization Point for the Web of Data. Journal of Web Semantics, Vol. 7, 3 (2009), 154--165. Google ScholarDigital Library
R. Blanco, G. Ottaviano, and E. Meij. 2015. Fast and Space-Efficient Entity Linking in Queries. In Proceedings of the 8th ACM Conference on Web Search and Data Mining (WSDM-15). Shanghai, China, 179--188. Google ScholarDigital Library
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. 2008. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. In Proceedings of the 2008 International Conference on Management of Data (SIGMOD-08) . Vancouver, Canada, 1247--1250. Google ScholarDigital Library
D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL-17) . Vancouver, Canada, 1870--1879.Google Scholar
A. Chisholm and B. Hachey. 2015. Entity disambiguation with Web links. Transactions of the Association for Computational Linguistics, Vol. 3 (2015), 145--156.Google ScholarCross Ref
P. Downing. 1977. On the Creation and Use of English Compound Nouns. Language, Vol. 53 (1977), 810--842.Google ScholarCross Ref
X. Du and C. Cardie. 2018. Harvesting Paragraph-level Question-Answer Pairs from Wikipedia. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL-18) . Melbourne, Australia, 1907--1917.Google Scholar
F. Ensan and E. Bagheri. 2017. Document Retrieval Model Through Semantic Linking. In Proceedings of the 10th ACM Conference on Web Search and Data Mining (WSDM-17). Cambridge, United Kingdom, 181--190. Google ScholarDigital Library
P. Ernst, A. Siu, and G. Weikum. 2018. HighLife: Higher-Arity Fact Harvesting. In Proceedings of the 2018 Web Conference (WWW-18) . Lyon, France, 1013--1022. Google ScholarDigital Library
O. Etzioni, A. Fader, J. Christensen, S. Soderland, and Mausam. 2011. Open Information Extraction: The Second Generation. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI-11) . Barcelona, Spain, 3--10. Google ScholarDigital Library
A. Fader, S. Soderland, and O. Etzioni. 2011. Identifying Relations for Open Information Extraction. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP-11) . Edinburgh, Scotland, 1535--1545. Google ScholarDigital Library
C. Fellbaum (Ed.). 1998. WordNet: An Electronic Lexical Database and Some of its Applications .MIT Press.Google Scholar
T. Flati, D. Vannella, T. Pasini, and R. Navigli. 2014. Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL-14). Baltimore, Maryland, 945--955.Google Scholar
O. Ganea, M. Ganea, A. Lucchi, C. Eickhoff, and T. Hofmann. 2016. Probabilistic Bag-Of-Hyperlinks Model for Entity Linking. In Proceedings of the 25th World Wide Web Conference (WWW-16). Montreal, Canada, 927--938. Google ScholarDigital Library
O. Ganea and T. Hofmann. 2017. Deep Joint Entity Disambiguation with Local Neural Attention. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP-17) . Copenhagen, Denmark, 2619--2629.Google Scholar
A. Gupta, R. Lebret, H. Harkous, and K. Aberer. 2018. 280 Birds With One Stone: Inducing Multilingual Taxonomies From Wikipedia Using Character-Level Classification. In Proceedings of the 32nd National Conference on Artificial Intelligence (AAAI-18). New Orleans, Louisiana, 4824--4831.Google Scholar
J. Hoffart, F. Suchanek, K. Berberich, and G. Weikum. 2013. YAGO2: a Spatially and Temporally Enhanced Knowledge Base from Wikipedia. Artificial Intelligence Journal. Special Issue on Artificial Intelligence, Wikipedia and Semi-Structured Resources, Vol. 194 (2013), 28--61. Google ScholarDigital Library
J. Hu, G. Wang, F. Lochovsky, J. Sun, and Z. Chen. 2009. Understanding User's Query Intent with Wikipedia. In Proceedings of the 18th World Wide Web Conference (WWW-09). Madrid, Spain, 471--480. Google ScholarDigital Library
A. Konovalov, B. Strauss, A. Ritter, and B. O'Connor. 2017. Learning to Extract Events from Knowledge Base Revisions. In Proceedings of the 26th World Wide Web Conference (WWW-17). Perth, Australia, 1007--1014. Google ScholarDigital Library
J. Langford, A. Strehl, and L. Li. 2007. Vowpal Wabbit. http://hunch.net/ vw.Google Scholar
D. Lenat. 1995. CYC: a Large-Scale Investment in Knowledge Infrastructure. Commun. ACM, Vol. 38, 11 (1995), 32--38. Google ScholarDigital Library
D. Ma, Y. Chen, K. Chang, and X. Du. 2018. Leveraging Fine-Grained Wikipedia Categories for Entity Search. In Proceedings of the 2018 Web Conference (WWW-18). Lyon, France, 1623--1632. Google ScholarDigital Library
Mausam, M. Schmitz, S. Soderland, R. Bart, and O. Etzioni. 2012. Open Language Learning for Information Extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL-12). Jeju Island, Korea, 523--534. Google ScholarDigital Library
R. Mihalcea. 2007. Using Wikipedia for Automatic Word Sense Disambiguation. In Proceedings of the 2007 Conference of the North American Association for Computational Linguistics (NAACL-HLT-07). Rochester, New York, 196--203.Google Scholar
V. Nastase and M. Strube. 2008. Decoding Wikipedia Categories for Knowledge Acquisition. In Proceedings of the 23rd National Conference on Artificial Intelligence (AAAI-08). Chicago, Illinois, 1219--1224. Google ScholarDigital Library
V. Nastase and M. Strube. 2013. Transforming Wikipedia into a Large Scale Multilingual Concept Network. Artificial Intelligence, Vol. 194 (2013), 62--85. Google ScholarDigital Library
M. Pacsca. 2018. Finding Needles in an Encyclopedic Haystack: Detecting Classes Among Wikipedia Articles. In Proceedings of the 2018 Web Conference (WWW-18) . Lyon, France, 1267--1276. Google ScholarDigital Library
M. Pacsca and H. Buisman. 2015. Dissecting German Grammar and Swiss Passports: Open-Domain Decomposition of Compositional Entries in Large-Scale Knowledge Repositories. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI-15) . Buenos Aires, Argentina, 896--902. Google ScholarDigital Library
X. Pan, T. Cassidy, U. Hermjakob, H. Ji, and K. Knight. 2015. Unsupervised Entity Linking with Abstract Meaning Representation. In Proceedings of the 2015 Conference of the North American Association for Computational Linguistics (NAACL-HLT-15). Denver, Colorado, 1130--1139.Google Scholar
T. Piccardi, M. Catasta, L. Zia, and R. West. 2018. Structuring Wikipedia Articles with Section Recommendations. In Proceedings of the 41st International Conference on Research and Development in Information Retrieval (SIGIR-18). Ann Arbor, Michigan, 665--674. Google ScholarDigital Library
S. Ponzetto and R. Navigli. 2009. Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI-09) . Pasadena, California, 2083--2088. Google ScholarDigital Library
S. Ponzetto and M. Strube. 2007. Deriving a Large Scale Taxonomy from Wikipedia. In Proceedings of the 22nd National Conference on Artificial Intelligence (AAAI-07). Vancouver, British Columbia, 1440--1447. Google ScholarDigital Library
M. Qu, X. Ren, Y. Zhang, and J. Han. 2018. Weakly-Supervised Relation Extraction by Pattern-Enhanced Embedding Learning. In Proceedings of the 2018 Web Conference (WWW-18) . Lyon, France, 1257--1266. Google ScholarDigital Library
L. Ratinov and D. Roth. 2012. Learning-Based Multi-Sieve Co-Reference Resolution with Knowledge. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL-12) . Jeju Island, Korea, 1234--1244. Google ScholarDigital Library
L. Ratinov, D. Roth, D. Downey, and M. Anderson. 2011. Local and Global Algorithms for Disambiguation to Wikipedia. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-11) . Portland, Oregon, 1375--1384. Google ScholarDigital Library
M. Remy. 2002. Wikipedia: The Free Encyclopedia. Online Information Review, Vol. 26, 6 (2002), 434.Google ScholarCross Ref
Z. Bouraoui S. Jameel and S. Schockaert. 2017. MEmbER: Max-Margin Based Embeddings. In Proceedings of the 40th International Conference on Research and Development in Information Retrieval (SIGIR-17) . Tokyo, Japan, 783--792. Google ScholarDigital Library
U. Scaiella, P. Ferragina, A. Marino, and M. Ciaramita. 2012. Topical Clustering of Search Results. In Proceedings of the 5th ACM Conference on Web Search and Data Mining (WSDM-12). Seattle, Washington, 223--232. Google ScholarDigital Library
A. Singhal. 2012. Introducing the Knowledge Graph: Things, not Strings. Corporate blog.Google Scholar
M. Sun, X. Li, X. Wang, M. Fan, Y. Feng, and P. Li. 2018. Logician: A Unified End-to-End Neural Approach for Open-Domain Information Extraction. In Proceedings of the 11th ACM Conference on Web Search and Data Mining (WSDM-18) . Marina del Rey, California, 556--564. Google ScholarDigital Library
C. Tan, F. Wei, P. Ren, W. Lv, and M. Zhou. 2017. Entity Linking for Queries by Searching Wikipedia Sentences. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP-17) . Copenhagen, Denmark, 68--77.Google Scholar
D. Tsurel, D. Pelleg, I. Guy, and D. Shahaf. 2017. Fun Facts: Automatic Trivia Fact Extraction from Wikipedia. In Proceedings of the 10th ACM Conference on Web Search and Data Mining (WSDM-17) . Cambridge, United Kingdom, 345--354. Google ScholarDigital Library
D. Vrandeucić and M. Krötzsch. 2014. Wikidata: A Free Collaborative Knowledge Base. Commun. ACM, Vol. 57 (2014), 78--85. Google ScholarDigital Library
Z. Wang, Z. Li, J. Li, J. Tang, and J. Pan. 2013. Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL-13). Sofia, Bulgaria, 641--650.Google Scholar
F. Wu and D. Weld. 2010. Open Information Extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL-10) . Uppsala, Sweden, 118--127. Google ScholarDigital Library
W. Wu, H. Li, H. Wang, and K. Zhu. 2012. Probase: a Probabilistic Taxonomy for Text Understanding. In Proceedings of the 2012 International Conference on Management of Data (SIGMOD-12) . Scottsdale, Arizona, 481--492. Google ScholarDigital Library
Y. Yan, N. Okazaki, Y. Matsuo, Z. Yang, and M. Ishizuka. 2009. Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP-09) . Singapore, 1021--1029. Google ScholarDigital Library
X. Yao and B. Van Durme. 2014. Information Extraction over Structured Data: Question Answering with Freebase. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL-14) . Baltimore, Maryland, 956--966.Google Scholar
S. Zhang and K. Balog. 2018. Ad Hoc Table Retrieval Using Semantic Similarity. In Proceedings of the 2018 Web Conference (WWW-18). Lyon, France, 1553--1562. Google ScholarDigital Library
C. Zirn, V. Nastase, and M. Strube. 2008. Distinguishing Between Instances and Classes in the Wikipedia Taxonomy. In Proceedings of the 5th European Semantic Web Conference (ESWC-08). Tenerife, Spain, 376--387. Google ScholarDigital Library

Index Terms

Lightweight Lexical and Semantic Evidence for Detecting Classes Among Wikipedia Articles
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
      2. Lexical semantics
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection

Recommendations

Finding Needles in an Encyclopedic Haystack: Detecting Classes Among Wikipedia Articles
WWW '18: Proceedings of the 2018 World Wide Web Conference

A lightweight method distinguishes articles within Wikipedia that are classes (Novel, Book) from other articles (Three Men in a Boat, Diary of a Pilgrimage). It exploits clues available within the article text and within categories associated with ...
Read More
Approximate Definitional Constructs as Lightweight Evidence for Detecting Classes Among Wikipedia Articles
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

A lightweight method applies a few extraction patterns to the task of distinguishing Wikipedia articles that are classes ("Walled garden", "Garden") from other articles ("High Hazels Park"). The method acquires a set of classes, based on patterns ...
Read More
The Role of Query Sessions in Interpreting Compound Noun Phrases
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

The meaning of compound noun phrases can be approximated in the form of lexical interpretations extracted from text. The interpretations hint at the role that modifiers play relative to heads within the noun phrases. In a study examining the role of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining
January 2019
874 pages
ISBN:9781450359405
DOI:10.1145/3289600
General Chairs:
J. Shane Culpepper
RMIT University
,
Alistair Moffat
The University of Melbourne
,
Program Chairs:
Paul N. Bennett
Microsoft
,
Kristina Lerman
University of Southern California
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 January 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
classes
knowledge acquisition
open-domain information extraction
semantics
topic classification
Qualifiers
- research-article
Conference

Acceptance Rates
WSDM '19 Paper Acceptance Rate84of511submissions,16%Overall Acceptance Rate498of2,863submissions,17%
More
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 309
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Lightweight Lexical and Semantic Evidence for Detecting Classes Among Wikipedia Articles

WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Finding Needles in an Encyclopedic Haystack: Detecting Classes Among Wikipedia Articles

Approximate Definitional Constructs as Lightweight Evidence for Detecting Classes Among Wikipedia Articles

The Role of Query Sessions in Interpreting Compound Noun Phrases