ABSTRACT
In this paper we investigate the task of Entity Ranking on the Web. Searchers looking for entities are arguably better served by presenting a ranked list of entities directly, rather than a list of web pages with relevant but also potentially redundant information about these entities. Since entities are represented by their web homepages, a naive approach to entity ranking is to use standard text retrieval. Our experimental results clearly demonstrate that text retrieval is effective at finding relevant pages, but performs poorly at finding entities. Our proposal is to use Wikipedia as a pivot for finding entities on the Web, allowing us to reduce the hard web entity ranking problem to easier problem of Wikipedia entity ranking. Wikipedia allows us to properly identify entities and some of their characteristics, and Wikipedia's elaborate category structure allows us to get a handle on the entity's type.
Our main findings are the following. Our first finding is that, in principle, the problem of web entity ranking can be reduced to Wikipedia entity ranking. We found that the majority of entity ranking topics in our test collections can be answered using Wikipedia, and that with high precision relevant web entities corresponding to the Wikipedia entities can be found using Wikipedia's 'external links'. Our second finding is that we can exploit the structure of Wikipedia to improve entity ranking effectiveness. Entity types are valuable retrieval cues in Wikipedia. Automatically assigned entity types are effective, and almost as good as manually assigned types. Our third finding is that web entity retrieval can be significantly improved by using Wikipedia as a pivot. Both Wikipedia's external links and the enriched Wikipedia entities with additional links to homepages are significantly better at finding primary web homepages than anchor text retrieval, which in turn significantly improved over standard text retrieval.
- A. Arampatzis and J. Kamps. A signal-to-noise approach to score normalization. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009), pages 797--806. ACM Press, New York USA, 2009. Google ScholarDigital Library
- K. Balog. People Search in the Enterprise. PhD thesis, University of Amsterdam, 2008.Google Scholar
- K. Balog, M. Bron, and M. de Rijke. Category-based query modeling for entity search. In 32nd European Conference on Information Retrieval (ECIR 2010), pages 319--331. Springer, 2010. Google ScholarDigital Library
- K. Balog and M. de Rijke. Determining expert profiles (with an application to expert finding). In Proceedings of the IJCAI '07, pages pages 2657--2662, 2007. Google ScholarDigital Library
- K. Balog, A. de Vries, P. Serdyukov, P. Thomas, and T. West- erveld. Overview of the TREC 2009 entity track. In The Eighteenth Text REtrieval Conference (TREC 2009) Notebook. National Institute for Standards and Technology, 2009.Google Scholar
- H. Bast, A. Chitea, F. Suchanek, and I. Weber. ESTER: efficient search on text, entities, and relations. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 671--678, 2007. Google ScholarDigital Library
- J. G. Conrad and M. H. Utt. A system for discovering re- lationships by feature extraction from text databases. In SIGIR '94: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 260--270, New York, NY, USA, 1994. Springer-Verlag New York, Inc. Google ScholarDigital Library
- A. de Vries, A.-M. Vercoustre, J. A. Thom, N. Craswell, and M. Lalmas. Overview of the INEX 2007 entity ranking track. In INEX 2007, pages 245--251, Berlin, Heidelberg, 2008. Springer-Verlag. Google ScholarDigital Library
- G. Demartini, C. S. Firan, T. Iofciu, R. Krestel, and W. Nejdl. "Why finding entities in wikipedia is difficult, sometimes. Information Retrieval", Special Issue on Focused Retrieval and Result Aggregation, 2010. Google ScholarDigital Library
- G. Demartini, T. Iofciu, and A. de Vries. Overview of the inex 2009 entity ranking track. In INEX 2009 Workshop Pre-Proceedings, 2009. Google ScholarDigital Library
- G. Demartini, A. P. Vries, T. Iofciu, and J. Zhu. Overview of the INEX 2008 entity ranking track. In Advances in Focused Retrieval: 7th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2008, pages 243--252, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarDigital Library
- L. Denoyer and P. Gallinari. The Wikipedia XML Corpus. SIGIR Forum, 2006. Google ScholarDigital Library
- Y. Fang, L. Si, Z. Yu, Y. Xian, and Y. Xu. Entity retrieval with hierarchical relevance model. In The Eighteenth Text REtrieval Conference (TREC 2009) Notebook, 2009.Google Scholar
- J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pages pp. 363--370, 2005. Google ScholarDigital Library
- T. Götz and O. Suhre. Design and implementation of the UIMA common analysis system. IBM Syst. J., 43(3):476--489, 2004. Google ScholarDigital Library
- D. Hiemstra. Using Language Models for Information Retrieval. PhD thesis, Center for Telematics and Information Technology, University of Twente, 2001.Google Scholar
- R. Kaptein, M. Koolen, and J. Kamps. Using Wikipedia cat- egories for ad hoc search. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and De- velopment in Information Retrieval. ACM Press, New York NY, USA, 2009. Google ScholarDigital Library
- G. Kasneci, F. M. Suchanek, G. Ifrim, M. Ramanath, and G. Weikum. NAGA: Searching and Ranking Knowledge. In 24th International Conference on Data Engineering (ICDE 2008). IEEE, 2008. Google ScholarDigital Library
- R. McCreadie, C. Macdonald, I. Ounis, J. Peng, and R. L. T. Santos. University of glasgow at TREC 2009: experiments with terrier. In The Eighteenth Text REtrieval Conference (TREC 2009) Notebook, 2009.Google Scholar
- E. Meij, P. Mika, and H. Zaragoza. An evaluation of entity and frequency based query completion methods. In SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 678--679, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- M. Paşca. Weakly-supervised discovery of named entities using web search queries. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 683--690, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- D. Petkova and W. B. Croft. Proximity-based document representation for named entity retrieval. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 731--740, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- H. Raghavan, J. Allan, and A. Mccallum. An exploration of entity models, collective classification and relation description. In KDD'04, 2004.Google Scholar
- R. Schenkel, F. M. Suchanek, and G. Kasneci. Yawn: A semantically annotated wikipedia xml corpus. In BTW, pages 277--291, 2007.Google Scholar
- T. Strohman, D. Metzler, H. Turtle, and W. B. Croft. Indri: a language-model based search engine for complex queries. In Proceedings of the International Conference on Intelligent Analysis, 2005.Google Scholar
- T. Tsikrika, P. Serdyukov, H. Rode, T. Westerveld, R. Aly, D. Hiemstra, and A. P. de Vries. Structured document retrieval, multimedia retrieval, and entity ranking using PF/Tijah. In Focused Access to XML Documents, pages 306--320, 2007.Google Scholar
- D. Vallet and H. Zaragoza. Inferring the most important types of a query: a semantic approach. In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 857--858, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- A.-M. Vercoustre, J. Pehcevski, and J. A. Thom. Using wikipedia categories and links in entity ranking. In Focused Access to XML Documents, pages 321--335, 2007.Google Scholar
- A.-M. Vercoustre, J. A. Thom, and J. Pehcevski. Entity ranking in wikipedia. In SAC '08: Proceedings of the 2008 ACM symposium on Applied computing, pages 1101--1106, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- H. Zaragoza, H. Rode, P. Mika, J. Atserias, M. Ciaramita, and G. Attardi. Ranking very many typed entities on wikipedia. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 1015--1018, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
Index Terms
- Entity ranking using Wikipedia as a pivot
Recommendations
Entity ranking in Wikipedia
SAC '08: Proceedings of the 2008 ACM symposium on Applied computingThe traditional entity extraction problem lies in the ability of extracting named entities from plain text using natural language processing techniques and intensive training from large document collections. Examples of named entities include ...
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and CommunicationIn natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Evaluating Entity Linking with Wikipedia
Named Entity Linking (nel) grounds entity mentions to their corresponding node in a Knowledge Base (kb). Recently, a number of systems have been proposed for linking entity mentions in text to Wikipedia pages. Such systems typically search for candidate ...
Comments