Abstract
Author name disambiguation has been one of the hardest problems faced by digital libraries since their early days. Historically, supervised solutions have empirically outperformed those based on heuristics, but with the burden of having to rely on manually labeled training sets for the learning process. Moreover, most supervised solutions just apply some type of generic machine learning solution and do not exploit specific knowledge about the problem. In this article, we follow a similar reasoning, but in the opposite direction. Instead of extending an existing supervised solution, we propose a set of carefully designed heuristics and similarity functions, and apply supervision only to optimize such parameters for each particular dataset. As our experiments show, the result is a very effective, efficient and practical author name disambiguation method that can be used in many different scenarios. In fact, we show that our method can beat state-of-the-art supervised methods in terms of effectiveness in many situations while being orders of magnitude faster. It can also run without any training information, using only default parameters, and still be very competitive when compared to these supervised methods (beating several of them) and better than most existing unsupervised author name disambiguation solutions.
Similar content being viewed by others
Notes
We here work with the minimum amount of information found in a citation in order to illustrate the practical capabilities of our method in real-world scenarios, but in other contexts citations may include other attributes such as authors’ affiliations or emails.
The implementations of all methods used in our experimental evaluation are available at http://www.lbd.dcc.ufmg.br/lbd/collections/disambiguation/author-name-disambiguation-methods.
For this baseline, we have used the libSVM package available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
We used the LASVM package [3] available at http://leon.bottou.org/projects/lasvm and the DBSCAN version available from Weka at http://www.cs.waikato.ac.nz/ml/weka/.
All collections used in our experimental evaluation are available at http://www.lbd.dcc.ufmg.br/lbd/collections/disambiguation/collections-the-nearest-cluster-method.
This does not count the expansion, performed in [4], from short to full author names in some records to better resemble a more realistic situation, in which there is a more balanced mix of both cases.
As the pF1 metric is calculated based on the number of pairs of citations in the empirical clusters, when the cluster has only one citation, no pair is formed, and the obtained value is equal to 0, as in the case of the “M. Silva” group.
We consider that a cluster represents an author if most of its citations belong to this author.
To put in perspective, with the reported times, it would take more than two weeks to disambiguate a digital library with 1 million citations with SLAND. With NC, this would take, on average, about three minutes.
References
Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans Knowl Discov Data 1(1) (2007)
Bordes, A., Ertekin, S., Weston, J., Bottou, L.: Fast kernel classifiers with online and active learning. J Mach Learning Res 6, 1579–1619 (2005)
Cota, R.G., Ferreira, A.A., Nascimento, C., Gonçalves, M.A., Laender, A.H.F.: An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. J Am Soc Inform Sci Technol 61(9), 1853–1870 (2010)
Fan, X., Wang, J., Pu, X., Zhou, L., Lv, B.: On graph-based name disambiguation. J Data Inform Qual 2, 10:1–10:23 (2011)
Ferreira, A.A., Veloso, A., Gonçalves, M.A., Laender, A.H.F.: Effective self-training author name disambiguation in scholarly digital libraries. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries, pp. 39–48 (2010)
Ferreira, A.A., Gonçalves, M.A., Laender, A.H.F.: A brief survey of automatic methods for author name disambiguation. SIGMOD Record 41(2), 15–26 (2012)
Ferreira, A.A, Silva, R., Gonçalves, M.A., Veloso, A., Laender, A.H.F.: Active associative sampling for author name disambiguation. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 175–184 (2012)
Ferreira, A.A., Veloso, A., Gonçalves, M.A., Laender, A.H.F.: Self-training author name disambiguation for information scarce scenarios. J Am Soc Inform Sci Technol 65(6), 1257–1278 (2014)
Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries, pp. 296–305 (2004)
Han, H., Xu, W., Zha, H., Giles, C.L.: A hierarchical naive bayes mixture model for name disambiguation in author citations. In: Proceedings of the ACM Symposium on Applied Computing, pp. 1065–1069 (2005)
Han, H., Zha, H., Giles, C.L.: Name disambiguation in author citations using a K-way spectral clustering method. In: Proceedings of JCDL, pp. 334–343 (2005)
Holm, S.: A Simple Sequentially Rejective Multiple Test Procedure. Scand J Stat 6(2), 65–70 (1979)
Huang, J., Ertekin, S., Giles, C.L.: Efficient name disambiguation for large-scale databases. In: Proceedings of European Conference on Principles and Practice of Knowl. Discovery in Databases, pp. 536–544 (2006)
Kanani, P., McCallum, A., Pal, C.: Improving author coreference by resource-bounded information gathering from the web. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 429–434 (2007)
Kang, I.S., Na, S.H., Lee, S., Jung, H., Kim, P., Sung, W.K., Lee, J.H.: On co-authorship for author disambiguation. Inform Process Manag 45(1), 84–97 (2009)
Kang, I.S., Kim, P., Lee, S., Jung, H., You, B.J.: Construction of a large-scale test set for author disambiguation. Inform Process Manag 47(3), 452–465 (2011)
Lee, D., On, B.W., Kang, J., Park, S.: Effective and scalable solutions for mixed and split citation problems in digital libraries. In: Proceedings of the 2nd International Workshop on Inf. Quality in Inf. Systems, pp. 69–76 (2005)
Liu, W., Islamaj Doan, R., Kim, S., Comeau, D.C., Kim, W., Yeganova, L., Lu, Z., Wilbur, W.J.: Author name disambiguation for pubmed. J Assoc Inform Sci Technol 65(4), 765–781 (2014)
Pereira, D.A., Ribeiro-Neto, B.A., Ziviani, N., Laender, A.H.F, Gonçalves, M.A., Ferreira, A.A.: Using web information for author name disambiguation. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 49–58 (2009)
Shu, L., Long, B., Meng, W.: A latent topic model for complete entity resolution. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 880–891 (2009)
Tang, J., Fong, A.C.M., Wang, B., Zhang, J.: A unified probabilistic framework for name disambiguation in digital library. IEEE Trans Knowl Data Eng 24(6), 975–987 (2012)
Torvik, V.I., Smalheiser, N.R.: Author name disambiguation in medline. ACM Trans Know Discov Data 3(3), 1–29 (2009)
Treeratpituk, P., Giles, C.L.: Disambiguating authors in academic publications using random forests. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 39–48 (2009)
Veloso, A., Ferreira, A.A., Gonçalves, M.A., Laender, A.H.F., Meira Jr, W.: Cost-effective on-demand associative author name disambiguation. Inform Process Manag 48(4), 680–697 (2012)
Wu, H., Li, B., Pei, Y., He, J.: Unsupervised author disambiguation using DempsterShafer theory. Scientometrics 101(3), 1955–1972 (2014)
Acknowledgments
This research is funded by INWeb (CNPq grant 57.3871/2008-6) and by the authors’ individual grants from CNPq, CAPES, and FAPEMIG.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Santana, A.F., Gonçalves, M.A., Laender, A.H.F. et al. On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method. Int J Digit Libr 16, 229–246 (2015). https://doi.org/10.1007/s00799-015-0158-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-015-0158-y