research-article

Autonomously semantifying wikipedia

Authors:
Fei Wu

University of Washington, Seattle, WA

University of Washington, Seattle, WA
View Profile

,
Daniel S. Weld

University of Washington, Seattle, WA

University of Washington, Seattle, WA
View Profile

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementNovember 2007Pages 41–50https://doi.org/10.1145/1321440.1321449

Published:06 November 2007Publication History

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Pages 41–50

ABSTRACT

Berners-Lee's compelling vision of a Semantic Web is hindered by a chicken-and-egg problem, which can be best solved by a bootstrapping method - creating enough structured data to motivate the development of applications. This paper argues that autonomously "Semantifying Wikipedia" is the best way to solve the problem. We choose Wikipedia as an initial data source, because it is comprehensive, not too large, high-quality, and contains enough manually-derived structure to bootstrap an autonomous, self-supervised process. We identify several types of structures which can be automatically enhanced in Wikipedia (e.g., link structure, taxonomic data, infoboxes, etc.), and we describea prototype implementation of a self-supervised, machine learning system which realizes our vision. Preliminary experiments demonstrate the high precision of our system's extracted data - in one case equaling that of humans.

References

http://opennlp.sourceforge.net/.Google Scholar
S. F. Adafre and M. de Rijke. Discovering missing links in wikipedia. In Proceedings of the 3rd International Workshop on Link Discovery at KDD05, Chicago, USA, August 2005. Google ScholarDigital Library
S. Auer and J. Lehmann. What have Innsbruck and Leipzig in common? Extracting semantics from wiki content. In ESWC, 2007. Google ScholarDigital Library
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the Web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007. Google ScholarDigital Library
T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientific American, May 2001.Google ScholarCross Ref
L. Breiman. Bagging predictors. Machine Learning, 24(2):123--140, 1996. Google ScholarCross Ref
E. Brill, S. Dumais, and M. Banko. An analysis of the AskMSR question-answering system. In Proceedings of EMNLP, 2002. Google ScholarDigital Library
C. L. A. Clarke, G. V. Cormack, and T. R. Lynam. Exploiting redundancy in question answering. In Proceedings of the 24th Annual International ACM SIGIR Conference, 2001. Google ScholarDigital Library
R. de Salvo Braz, R. Girju, V. Punyakanok, D. Roth, and M. Sammons. An inference model for semantic entailment in natural language. In National Conference on Artificial Intelligence (AAAI), pages 1678--1679, 2005. Google ScholarDigital Library
S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, T. Kanungo, S. Rajagopalan, A. Tomkins, J. Tomlin, and J. Y. Zien. Semtag and Seeker: bootstrapping the Semantic Web via automated semantic annotation. In Proceedings of 12th International World Wide Web Conference, pages 178--186, 2003. Google ScholarDigital Library
A. Doan and A. Halevy. Semantic integration research in the database community: A brief survey. AI Magazine, Special Issue on Semantic Integration, 2005. Google ScholarDigital Library
D. Downey, O. Etzioni, and S. Soderland. A probabilistic model of redundancy in information extraction. In Procs. of IJCAI 2005, 2005. Google ScholarDigital Library
S. Dumais, M. Banko, E. Brill, J. Lin, and A. Ng. Web question answering: Is more always better? In Proceedings of the 25th Annual International ACM SIGIR Conference, 2002. Google ScholarDigital Library
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Unsupervised named-entity extraction from the Web: An experimental study. Artificial Intelligence, 165(1):91--134, 2005. Google ScholarDigital Library
E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence, pages 1301--1306, 2006. Google ScholarDigital Library
E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of The 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, January 2007. Google ScholarDigital Library
A. Y. Halevy, O. Etzioni, A. Doan, Z. G. Ives, J. Madhavan, L. McDowell, and I. Tatarinov. Crossing the structure chasm. In Proceedings of CIDR, 2003.Google Scholar
C. T. Kwok, O. Etzioni, and D. Weld. Scaling question answering to the Web. ACM Transactions on Information Systems (TOIS), 19(3):242--262, 2001. Google ScholarDigital Library
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 15th International Conference on World Wide Web, Edinburgh, Scotland, May 2001. Google ScholarDigital Library
B. MacCartney and C. D. Manning. Natural logic for textual inference. In Workshop on Textual Entailment and Paraphrasing, ACL 2007, 2007. Google ScholarDigital Library
A. K. McCallum. Mallet: A machine learning for language toolkit. In http://mallet.cs.umass.edu, 2002.Google Scholar
R. Meir and G. Rätsch. An introduction to boosting and leveraging. Journal of Artificial Intelligence Research, Advanced Lectures on Machine Learning: 118--183, 2003. Google ScholarDigital Library
D. P. Nguyen, Y. Matsuo, and M. Ishizuka. Exploiting syntactic and semantic information for relation extraction from wikipedia. In IJCAI07-TextLinkWS, 2007.Google Scholar
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999.Google Scholar
D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, pages 169--198, 1999.Google Scholar
S. P. Ponzetto and M. Strube. Deriving a large scale taxonomy from wikipedia. In Proceedings of the 22st National Conference on Artificial Intelligence, pages 1440--1445, 2007. Google ScholarDigital Library
E. Riloff and J. Shepherd. A corpus-based approach for building semantic lexicons. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pages 117--124, Providence, RI, 1997.Google Scholar
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A core of semantic knowledge - unifying WordNet and Wikipedia. In Proceedings of the 16th International Conference on World Wide Web, 2007. Google ScholarDigital Library
M. V&3246;lkel, M. Krötzsch, D. Vrandecic, H. Haller, and R. Studer. Semantic wikipedia. In Proceedings of the 15th International Conference on World Wide Web, 2006. Google ScholarDigital Library
W. Wu, A. Doan, C. Yu, and W. Meng. Bootstrapping domain ontology for Semantic Web services from source web sites. In Proceedings of the VLDB-05 Workshop on Technologies for E-Services, 2005.Google Scholar

Index Terms

Autonomously semantifying wikipedia
1. Information systems
  1. Information systems applications

Recommendations

Learning multilingual named entity recognition from Wikipedia

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Read More
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Read More
Evaluating Entity Linking with Wikipedia

Named Entity Linking (nel) grounds entity mentions to their corresponding node in a Knowledge Base (kb). Recently, a number of systems have been proposed for linking entity mentions in text to Wikipedia pages. Such systems typically search for candidate ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
November 2007
1048 pages
ISBN:9781595938039
DOI:10.1145/1321440
Co-chair:
Alberto H. F. Laender,
Conference Chairs:
André O. Falcão
Universidade de Lisboa, Portugal
,
Øystein Haug Olsen,
General Chair:
Mário J. Silva
(Universidade de Lisboa, Portugal)
,
Program Chairs:
Ricardo Baeza-Yates,
Deborah L. McGuinness,
Bjorn Olstad
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 November 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information extraction
semantic web
wikipedia
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 213
  Total Citations
  View Citations
- 2,455
  Total Downloads
- Downloads (Last 12 months)98
- Downloads (Last 6 weeks)29
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Autonomously semantifying wikipedia

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning multilingual named entity recognition from Wikipedia

Two-stage approach to named entity recognition using Wikipedia and DBpedia

Evaluating Entity Linking with Wikipedia