On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

Santana, Alan Filipe; Gonçalves, Marcos André; Laender, Alberto H. F.; Ferreira, Anderson A.

doi:10.1007/s00799-015-0158-y

On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

Published: 07 July 2015

Volume 16, pages 229–246, (2015)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Alan Filipe Santana¹,
Marcos André Gonçalves¹,
Alberto H. F. Laender¹ &
…
Anderson A. Ferreira²

1717 Accesses
26 Citations
Explore all metrics

Abstract

Author name disambiguation has been one of the hardest problems faced by digital libraries since their early days. Historically, supervised solutions have empirically outperformed those based on heuristics, but with the burden of having to rely on manually labeled training sets for the learning process. Moreover, most supervised solutions just apply some type of generic machine learning solution and do not exploit specific knowledge about the problem. In this article, we follow a similar reasoning, but in the opposite direction. Instead of extending an existing supervised solution, we propose a set of carefully designed heuristics and similarity functions, and apply supervision only to optimize such parameters for each particular dataset. As our experiments show, the result is a very effective, efficient and practical author name disambiguation method that can be used in many different scenarios. In fact, we show that our method can beat state-of-the-art supervised methods in terms of effectiveness in many situations while being orders of magnitude faster. It can also run without any training information, using only default parameters, and still be very competitive when compared to these supervised methods (beating several of them) and better than most existing unsupervised author name disambiguation solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visualizing Bibliometric Networks

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Article Open access 27 February 2017

Scientific paper recommendation systems: a literature review of recent publications

Article Open access 05 October 2022

Notes

http://www.ncbi.nlm.nih.gov/pubmed/.
We here work with the minimum amount of information found in a citation in order to illustrate the practical capabilities of our method in real-world scenarios, but in other contexts citations may include other attributes such as authors’ affiliations or emails.
The implementations of all methods used in our experimental evaluation are available at http://www.lbd.dcc.ufmg.br/lbd/collections/disambiguation/author-name-disambiguation-methods.
For this baseline, we have used the libSVM package available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
We used the LASVM package [3] available at http://leon.bottou.org/projects/lasvm and the DBSCAN version available from Weka at http://www.cs.waikato.ac.nz/ml/weka/.
http://dblp.uni-trier.de.
http://www.lbd.dcc.ufmg.br/bdbcomp.
All collections used in our experimental evaluation are available at http://www.lbd.dcc.ufmg.br/lbd/collections/disambiguation/collections-the-nearest-cluster-method.
http://arnetminer.org.
http://academic.research.microsoft.com.
This does not count the expansion, performed in [4], from short to full author names in some records to better resemble a more realistic situation, in which there is a more balanced mix of both cases.
http://www.kisti.re.kr.
http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
As the pF1 metric is calculated based on the number of pairs of citations in the empirical clusters, when the cluster has only one citation, no pair is formed, and the obtained value is equal to 0, as in the case of the “M. Silva” group.
We consider that a cluster represents an author if most of its citations belong to this author.
To put in perspective, with the reported times, it would take more than two weeks to disambiguate a digital library with 1 million citations with SLAND. With NC, this would take, on average, about three minutes.

References

Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)
Google Scholar
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans Knowl Discov Data 1(1) (2007)
Bordes, A., Ertekin, S., Weston, J., Bottou, L.: Fast kernel classifiers with online and active learning. J Mach Learning Res 6, 1579–1619 (2005)
MathSciNet MATH Google Scholar
Cota, R.G., Ferreira, A.A., Nascimento, C., Gonçalves, M.A., Laender, A.H.F.: An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. J Am Soc Inform Sci Technol 61(9), 1853–1870 (2010)
Article Google Scholar
Fan, X., Wang, J., Pu, X., Zhou, L., Lv, B.: On graph-based name disambiguation. J Data Inform Qual 2, 10:1–10:23 (2011)
Article Google Scholar
Ferreira, A.A., Veloso, A., Gonçalves, M.A., Laender, A.H.F.: Effective self-training author name disambiguation in scholarly digital libraries. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries, pp. 39–48 (2010)
Ferreira, A.A., Gonçalves, M.A., Laender, A.H.F.: A brief survey of automatic methods for author name disambiguation. SIGMOD Record 41(2), 15–26 (2012)
Article Google Scholar
Ferreira, A.A, Silva, R., Gonçalves, M.A., Veloso, A., Laender, A.H.F.: Active associative sampling for author name disambiguation. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 175–184 (2012)
Ferreira, A.A., Veloso, A., Gonçalves, M.A., Laender, A.H.F.: Self-training author name disambiguation for information scarce scenarios. J Am Soc Inform Sci Technol 65(6), 1257–1278 (2014)
Article Google Scholar
Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries, pp. 296–305 (2004)
Han, H., Xu, W., Zha, H., Giles, C.L.: A hierarchical naive bayes mixture model for name disambiguation in author citations. In: Proceedings of the ACM Symposium on Applied Computing, pp. 1065–1069 (2005)
Han, H., Zha, H., Giles, C.L.: Name disambiguation in author citations using a K-way spectral clustering method. In: Proceedings of JCDL, pp. 334–343 (2005)
Holm, S.: A Simple Sequentially Rejective Multiple Test Procedure. Scand J Stat 6(2), 65–70 (1979)
MathSciNet MATH Google Scholar
Huang, J., Ertekin, S., Giles, C.L.: Efficient name disambiguation for large-scale databases. In: Proceedings of European Conference on Principles and Practice of Knowl. Discovery in Databases, pp. 536–544 (2006)
Kanani, P., McCallum, A., Pal, C.: Improving author coreference by resource-bounded information gathering from the web. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 429–434 (2007)
Kang, I.S., Na, S.H., Lee, S., Jung, H., Kim, P., Sung, W.K., Lee, J.H.: On co-authorship for author disambiguation. Inform Process Manag 45(1), 84–97 (2009)
Article Google Scholar
Kang, I.S., Kim, P., Lee, S., Jung, H., You, B.J.: Construction of a large-scale test set for author disambiguation. Inform Process Manag 47(3), 452–465 (2011)
Article Google Scholar
Lee, D., On, B.W., Kang, J., Park, S.: Effective and scalable solutions for mixed and split citation problems in digital libraries. In: Proceedings of the 2nd International Workshop on Inf. Quality in Inf. Systems, pp. 69–76 (2005)
Liu, W., Islamaj Doan, R., Kim, S., Comeau, D.C., Kim, W., Yeganova, L., Lu, Z., Wilbur, W.J.: Author name disambiguation for pubmed. J Assoc Inform Sci Technol 65(4), 765–781 (2014)
Google Scholar
Pereira, D.A., Ribeiro-Neto, B.A., Ziviani, N., Laender, A.H.F, Gonçalves, M.A., Ferreira, A.A.: Using web information for author name disambiguation. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 49–58 (2009)
Shu, L., Long, B., Meng, W.: A latent topic model for complete entity resolution. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 880–891 (2009)
Tang, J., Fong, A.C.M., Wang, B., Zhang, J.: A unified probabilistic framework for name disambiguation in digital library. IEEE Trans Knowl Data Eng 24(6), 975–987 (2012)
Article Google Scholar
Torvik, V.I., Smalheiser, N.R.: Author name disambiguation in medline. ACM Trans Know Discov Data 3(3), 1–29 (2009)
Article Google Scholar
Treeratpituk, P., Giles, C.L.: Disambiguating authors in academic publications using random forests. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 39–48 (2009)
Veloso, A., Ferreira, A.A., Gonçalves, M.A., Laender, A.H.F., Meira Jr, W.: Cost-effective on-demand associative author name disambiguation. Inform Process Manag 48(4), 680–697 (2012)
Article Google Scholar
Wu, H., Li, B., Pei, Y., He, J.: Unsupervised author disambiguation using DempsterShafer theory. Scientometrics 101(3), 1955–1972 (2014)
Article Google Scholar

Download references

Acknowledgments

This research is funded by INWeb (CNPq grant 57.3871/2008-6) and by the authors’ individual grants from CNPq, CAPES, and FAPEMIG.

Author information

Authors and Affiliations

Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, 31270-901, Belo Horizonte, Brazil
Alan Filipe Santana, Marcos André Gonçalves & Alberto H. F. Laender
Departamento de Computação, Universidade Federal de Ouro Preto, 35400-000, Ouro Preto, Brazil
Anderson A. Ferreira

Authors

Alan Filipe Santana
View author publications
You can also search for this author in PubMed Google Scholar
Marcos André Gonçalves
View author publications
You can also search for this author in PubMed Google Scholar
Alberto H. F. Laender
View author publications
You can also search for this author in PubMed Google Scholar
Anderson A. Ferreira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcos André Gonçalves.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Santana, A.F., Gonçalves, M.A., Laender, A.H.F. et al. On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method. Int J Digit Libr 16, 229–246 (2015). https://doi.org/10.1007/s00799-015-0158-y

Download citation

Received: 02 December 2014
Revised: 09 June 2015
Accepted: 22 June 2015
Published: 07 July 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s00799-015-0158-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

Abstract

Access this article

Similar content being viewed by others

Visualizing Bibliometric Networks

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Scientific paper recommendation systems: a literature review of recent publications

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

Abstract

Access this article

Similar content being viewed by others

Visualizing Bibliometric Networks

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Scientific paper recommendation systems: a literature review of recent publications

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation