Skip to main content
Log in

On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Author name disambiguation has been one of the hardest problems faced by digital libraries since their early days. Historically, supervised solutions have empirically outperformed those based on heuristics, but with the burden of having to rely on manually labeled training sets for the learning process. Moreover, most supervised solutions just apply some type of generic machine learning solution and do not exploit specific knowledge about the problem. In this article, we follow a similar reasoning, but in the opposite direction. Instead of extending an existing supervised solution, we propose a set of carefully designed heuristics and similarity functions, and apply supervision only to optimize such parameters for each particular dataset. As our experiments show, the result is a very effective, efficient and practical author name disambiguation method that can be used in many different scenarios. In fact, we show that our method can beat state-of-the-art supervised methods in terms of effectiveness in many situations while being orders of magnitude faster. It can also run without any training information, using only default parameters, and still be very competitive when compared to these supervised methods (beating several of them) and better than most existing unsupervised author name disambiguation solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. http://www.ncbi.nlm.nih.gov/pubmed/.

  2. We here work with the minimum amount of information found in a citation in order to illustrate the practical capabilities of our method in real-world scenarios, but in other contexts citations may include other attributes such as authors’ affiliations or emails.

  3. The implementations of all methods used in our experimental evaluation are available at http://www.lbd.dcc.ufmg.br/lbd/collections/disambiguation/author-name-disambiguation-methods.

  4. For this baseline, we have used the libSVM package available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

  5. We used the LASVM package [3] available at http://leon.bottou.org/projects/lasvm and the DBSCAN version available from Weka at http://www.cs.waikato.ac.nz/ml/weka/.

  6. http://dblp.uni-trier.de.

  7. http://www.lbd.dcc.ufmg.br/bdbcomp.

  8. All collections used in our experimental evaluation are available at http://www.lbd.dcc.ufmg.br/lbd/collections/disambiguation/collections-the-nearest-cluster-method.

  9. http://arnetminer.org.

  10. http://academic.research.microsoft.com.

  11. This does not count the expansion, performed in [4], from short to full author names in some records to better resemble a more realistic situation, in which there is a more balanced mix of both cases.

  12. http://www.kisti.re.kr.

  13. http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

  14. As the pF1 metric is calculated based on the number of pairs of citations in the empirical clusters, when the cluster has only one citation, no pair is formed, and the obtained value is equal to 0, as in the case of the “M. Silva” group.

  15. We consider that a cluster represents an author if most of its citations belong to this author.

  16. To put in perspective, with the reported times, it would take more than two weeks to disambiguate a digital library with 1 million citations with SLAND. With NC, this would take, on average, about three minutes.

References

  1. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)

    Google Scholar 

  2. Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans Knowl Discov Data 1(1) (2007)

  3. Bordes, A., Ertekin, S., Weston, J., Bottou, L.: Fast kernel classifiers with online and active learning. J Mach Learning Res 6, 1579–1619 (2005)

    MathSciNet  MATH  Google Scholar 

  4. Cota, R.G., Ferreira, A.A., Nascimento, C., Gonçalves, M.A., Laender, A.H.F.: An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. J Am Soc Inform Sci Technol 61(9), 1853–1870 (2010)

    Article  Google Scholar 

  5. Fan, X., Wang, J., Pu, X., Zhou, L., Lv, B.: On graph-based name disambiguation. J Data Inform Qual 2, 10:1–10:23 (2011)

    Article  Google Scholar 

  6. Ferreira, A.A., Veloso, A., Gonçalves, M.A., Laender, A.H.F.: Effective self-training author name disambiguation in scholarly digital libraries. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries, pp. 39–48 (2010)

  7. Ferreira, A.A., Gonçalves, M.A., Laender, A.H.F.: A brief survey of automatic methods for author name disambiguation. SIGMOD Record 41(2), 15–26 (2012)

    Article  Google Scholar 

  8. Ferreira, A.A, Silva, R., Gonçalves, M.A., Veloso, A., Laender, A.H.F.: Active associative sampling for author name disambiguation. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 175–184 (2012)

  9. Ferreira, A.A., Veloso, A., Gonçalves, M.A., Laender, A.H.F.: Self-training author name disambiguation for information scarce scenarios. J Am Soc Inform Sci Technol 65(6), 1257–1278 (2014)

    Article  Google Scholar 

  10. Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries, pp. 296–305 (2004)

  11. Han, H., Xu, W., Zha, H., Giles, C.L.: A hierarchical naive bayes mixture model for name disambiguation in author citations. In: Proceedings of the ACM Symposium on Applied Computing, pp. 1065–1069 (2005)

  12. Han, H., Zha, H., Giles, C.L.: Name disambiguation in author citations using a K-way spectral clustering method. In: Proceedings of JCDL, pp. 334–343 (2005)

  13. Holm, S.: A Simple Sequentially Rejective Multiple Test Procedure. Scand J Stat 6(2), 65–70 (1979)

    MathSciNet  MATH  Google Scholar 

  14. Huang, J., Ertekin, S., Giles, C.L.: Efficient name disambiguation for large-scale databases. In: Proceedings of European Conference on Principles and Practice of Knowl. Discovery in Databases, pp. 536–544 (2006)

  15. Kanani, P., McCallum, A., Pal, C.: Improving author coreference by resource-bounded information gathering from the web. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 429–434 (2007)

  16. Kang, I.S., Na, S.H., Lee, S., Jung, H., Kim, P., Sung, W.K., Lee, J.H.: On co-authorship for author disambiguation. Inform Process Manag 45(1), 84–97 (2009)

    Article  Google Scholar 

  17. Kang, I.S., Kim, P., Lee, S., Jung, H., You, B.J.: Construction of a large-scale test set for author disambiguation. Inform Process Manag 47(3), 452–465 (2011)

    Article  Google Scholar 

  18. Lee, D., On, B.W., Kang, J., Park, S.: Effective and scalable solutions for mixed and split citation problems in digital libraries. In: Proceedings of the 2nd International Workshop on Inf. Quality in Inf. Systems, pp. 69–76 (2005)

  19. Liu, W., Islamaj Doan, R., Kim, S., Comeau, D.C., Kim, W., Yeganova, L., Lu, Z., Wilbur, W.J.: Author name disambiguation for pubmed. J Assoc Inform Sci Technol 65(4), 765–781 (2014)

    Google Scholar 

  20. Pereira, D.A., Ribeiro-Neto, B.A., Ziviani, N., Laender, A.H.F, Gonçalves, M.A., Ferreira, A.A.: Using web information for author name disambiguation. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 49–58 (2009)

  21. Shu, L., Long, B., Meng, W.: A latent topic model for complete entity resolution. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 880–891 (2009)

  22. Tang, J., Fong, A.C.M., Wang, B., Zhang, J.: A unified probabilistic framework for name disambiguation in digital library. IEEE Trans Knowl Data Eng 24(6), 975–987 (2012)

    Article  Google Scholar 

  23. Torvik, V.I., Smalheiser, N.R.: Author name disambiguation in medline. ACM Trans Know Discov Data 3(3), 1–29 (2009)

    Article  Google Scholar 

  24. Treeratpituk, P., Giles, C.L.: Disambiguating authors in academic publications using random forests. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 39–48 (2009)

  25. Veloso, A., Ferreira, A.A., Gonçalves, M.A., Laender, A.H.F., Meira Jr, W.: Cost-effective on-demand associative author name disambiguation. Inform Process Manag 48(4), 680–697 (2012)

    Article  Google Scholar 

  26. Wu, H., Li, B., Pei, Y., He, J.: Unsupervised author disambiguation using DempsterShafer theory. Scientometrics 101(3), 1955–1972 (2014)

    Article  Google Scholar 

Download references

Acknowledgments

This research is funded by INWeb (CNPq grant 57.3871/2008-6) and by the authors’ individual grants from CNPq, CAPES, and FAPEMIG.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcos André Gonçalves.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Santana, A.F., Gonçalves, M.A., Laender, A.H.F. et al. On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method. Int J Digit Libr 16, 229–246 (2015). https://doi.org/10.1007/s00799-015-0158-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-015-0158-y

Keywords

Navigation