Skip to main content

The Grouped Author-Topic Model for Unsupervised Entity Resolution

  • Conference paper
Book cover Artificial Neural Networks and Machine Learning – ICANN 2011 (ICANN 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6791))

Included in the following conference series:

Abstract

This paper describes a generative approach for tackling the problem of identity resolution in a completely unsupervised context with no fixed assumption regarding the true number of identities. The problem of entity resolution involves associating different references to authors (in a paper’s author list, for example) with real underlying identities. The references may be written in differing forms or may have errors, and identical references may refer to different real identities. The approach taken here uses a generative model of both the abstract of a document and its list of authors to resolve identities in a corpus of documents. In the model, authors and topics are associated with latent groups. For each document, an abstract and an author list are generated conditioned on a given group. Results are presented on real-world datasets, and outperform the best performing unsupervised methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ferguson, T.S.: A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics 1(2), 209–230 (1973)

    Article  MATH  MathSciNet  Google Scholar 

  2. Rodriguez, A., Dunson, D.B., Gelfand, A.E.: The nested Dirichlet process. Journal of the American Statistical Association 103(483), 1131–1154 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  3. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet Processes. Journal of the American Statistical Association 101(476), 1566–1581 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  4. Bhattacharya, I., Getoor, L.: A Latent Dirichlet Model for Unsupervised Entity Resolution. In: The SIAM International Conference on Data Mining (SIAM-SDM), Bethesda, MD, USA (2006)

    Google Scholar 

  5. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: UAI 2004: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 487–494. AUAI Press, Arlington (2004)

    Google Scholar 

  6. Neal, R.M.: Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of Computational and Graphical Statistics 9(2), 249–265 (2000)

    MathSciNet  Google Scholar 

  7. Giles, C.L., Bollacker, K.D., Lawrence, S.: CiteSeer: An Automatic Citation Indexing System. In: Digital Libraries 1998 - The Third ACM Conference on Digital Libraries, pp. 89–98 (1998)

    Google Scholar 

  8. Peng, F., Mccallum, A.: Information extraction from research papers using conditional random fields. Information Processing & Management 42(4), 963–979 (2006)

    Article  Google Scholar 

  9. Dorazio, R.M.: On selecting a prior for the precision parameter of Dirichlet process mixture models. Journal of Statistical Planning and Inference 139(9), 3384–3390 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  10. Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the Vector Space Model. In: Proceedings of the 17th International Conference on Computational Linguistics, Association for Computational Linguistics, pp. 79–85 Morristown (1998)

    Google Scholar 

  11. Artiles, J., Gonzalo, J., Sekine, S.: WePS 2 Evaluation Campaign: Overview of the Web People Search Clustering Task. In: Evaluation (2009)

    Google Scholar 

  12. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics ACL 2005, vol. 43, pp. 363–370 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dai, A.M., Storkey, A.J. (2011). The Grouped Author-Topic Model for Unsupervised Entity Resolution. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2011. ICANN 2011. Lecture Notes in Computer Science, vol 6791. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21735-7_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-21735-7_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-21734-0

  • Online ISBN: 978-3-642-21735-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics