Graph-Based Hierarchical Record Clustering for Unsupervised Entity Resolution

Ebeid, Islam Akef; Talburt, John R.; Siddique, Md Abdus Salam

doi:10.1007/978-3-030-97652-1_14

Islam Akef Ebeid¹⁵,
John R. Talburt¹⁵ &
Md Abdus Salam Siddique¹⁵

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1421))

434 Accesses
1 Citations

Abstract

Here we study the problem of matched record clustering in unsupervised entity resolution. We build upon a state-of-the-art probabilistic framework named the Data Washing Machine (DWM). We introduce a graph-based hierarchical 2-step record clustering method (GDWM) that first identifies large, connected components or, as we call them, soft clusters in the matched record pairs using a graph-based transitive closure. That is followed by breaking down the discovered soft clusters into more precise entity profiles in a hierarchical manner using an adapted graph-based modularity optimization method. Our approach provides several advantages over the original implementation of the DWM, mainly a significant speed-up, increased precision, and overall increased F1 scores. We demonstrate the efficacy of our approach using experiments on multiple synthetic datasets. Our results also provide some evidence of the utility of graph theory-based algorithms despite their sparsity in the literature on unsupervised entity resolution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

J.R. Talburt, A.K., D. Pullen, L. Claassens, R. Wang, An Iterative, self-assessing entity resolution system: first steps toward a data washing machine. Int. J. Adv. Comput. Sci. Appl. 11(12) (2020). https://doi.org/10.14569/IJACSA.2020.0111279
J.R. Talburt, Y. Zhou, A practical guide to entity resolution with OYSTER, in Handbook of Data Quality: Research and Practice, ed. by S. Sadiq (Springer, Berlin, 2013), pp. 235–270. https://doi.org/10.1007/978-3-642-36257-6_11
Chapter Google Scholar
T.N. Herzog, F.J. Scheuren, W.E. Winkler, Data Quality and Record Linkage Techniques. Springer Science and Business Media (Springer, New Yrok, 2007)
Google Scholar
P. Lahiri, M.D. Larsen, Regression analysis with linked data. J. Am. Stat. Assoc. 100(469), 222–230 (2005). https://doi.org/10.1198/016214504000001277
Article MathSciNet Google Scholar
A. Tancredi, B. Liseo, A hierarchical Bayesian approach to record linkage and population size problems. Ann. Appl. Stat. 5(2B), 1553–1585 (2011). https://doi.org/10.1214/10-AOAS447
Article MathSciNet Google Scholar
M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, S. Fienberg, Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003). https://doi.org/10.1109/MIS.2003.1234765
Article Google Scholar
X. Li, J.R. Talburt, T. Li, Scoring matrix for unstandardized data in entity resolution, in 2018 International Conference on Computational Science and Computational Intelligence (CSCI) (2018), pp. 1087–1092. https://doi.org/10.1109/CSCI46756.2018.00211
A. Alsarkhi, J.R. Talburt, A method for implementing probabilistic entity resolution. Int. J. Adv. Comput. Sci. Appl. 9(11), 7–15 (2018)
Google Scholar
L. Kolb, Z. Sehili, E. Rahm, Iterative computation of connected graph components with MapReduce. Datenbank-Spektrum 14(2), 107–117 (2014). https://doi.org/10.1007/s13222-014-0154-1
Article Google Scholar
V.D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008(10), P10008 (2008). https://doi.org/10.1088/1742-5468/2008/10/P10008
J.R. Talburt, Y. Zhou, S.Y. Shivaiah, SOG: a synthetic occupancy generator to support entity resolution instruction and research.. ICIQ 9, 91–105 (2009)
Google Scholar
D. Zhang, D. Li, L. Guo, K. Tan, Unsupervised entity resolution with blocking and graph algorithms. IEEE Trans. Knowl. Data Eng. 1–1 (2020). https://doi.org/10.1109/TKDE.2020.2991063
G. Jeh, J. Widom, SimRank: a measure of structural-context similarity, in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002), pp. 538–543
Google Scholar
F. Wang, H. Wang, J. Li, H. Gao, Graph-based reference table construction to facilitate entity matching. J. Syst. Softw. 86(6), 1679–1688 (2013). https://doi.org/10.1016/j.jss.2013.02.026
Article Google Scholar
H. Wang, J. Li, H. Gao, Efficient entity resolution based on subgraph cohesion. Knowl. Inf. Syst. 46(2), 285–314 (2016)
Article Google Scholar
A. Saeedi, M. Nentwig, E. Peukert, E. Rahm, Scalable matching and clustering of entities with FAMER. Complex Syst. Inform. Model. Q. 0(16), Art. no. 16 (2018). https://doi.org/10.7250/csimq.2018-16.04
U. Draisbach, P. Christen, F. Naumann, Transforming pairwise duplicates to entity clusters for high-quality duplicate detection. J. Data Inf. Qual. 12(1), 3:1–3:30 (2019). https://doi.org/10.1145/3352591
N. Kang, J.-J. Kim, B.-W. On, I. Lee, A node resistance-based probability model for resolving duplicate named entities. Scientometrics 124(3), 1721–1743 (2020). https://doi.org/10.1007/s11192-020-03585-4
Article Google Scholar
L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web (Stanford InfoLab, Stanford, 1999)
Google Scholar
M. Sadiq, S.I. Ali, M.B. Amin, S. Lee, A vertex matcher for entity resolution on graphs, in 2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM) (2020), pp. 1–4. https://doi.org/10.1109/IMCOM48794.2020.9001799
D. Zhang, L. Guo, X. He, J. Shao, S. Wu, H.T. Shen, A graph-theoretic fusion framework for unsupervised entity resolution, in 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris (2018), pp. 713–724. https://doi.org/10.1109/ICDE.2018.00070
P. Malhotra, P. Agarwal, G.M. Shroff, Graph-parallel entity resolution using LSH & IMM, in EDBT/ICDT Workshops (2014), pp. 41–49
Google Scholar
A. Al-Sarkhi, J.R. Talburt, Estimating the parameters for linking unstandardized references with the matrix comparator. J. Inf. Technol. Manag. 10(4), 12–26 (2018)
Google Scholar
A.E. Monge, C. Elkan, et al., The field matching problem: algorithms and applications, in KDD, vol. 2 (1996), pp. 267–270
Google Scholar
V.I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, in Soviet Physics Doklady, vol. 10, No. 8 (1966), pp. 707–710
Google Scholar
S.V. Ovchinnikov, On the transitivity property. Fuzzy Sets Syst. 20(2), 241–243 (1986). https://doi.org/10.1016/0165-0114(86)90080-1
Article MathSciNet Google Scholar
S.E. Schaeffer, Graph clustering. Comput. Sci. Rev. 1(1), 27–64 (2007). https://doi.org/10.1016/j.cosrev.2007.05.001
Article Google Scholar
M.E.J. Newman, Modularity and community structure in networks. Proc. Natl. Acad. Sci. U. S. A. 103(23), 8577–8582 (2006). https://doi.org/10.1073/pnas.0601602103
Article Google Scholar
Y. Ye, J.R. Talburt, Generating synthetic data to support entity resolution education and research. J. Comput. Sci. Coll. 34(7), 12–19 (2019)
Google Scholar
A. Hagberg, P. Swart, D.S. Chult, Exploring Network Structure, Dynamics, and Function Using NetworkX. Los Alamos National Lab. (LANL) (Los Alamos, NM (United States), 2008)
Google Scholar
J.P. Mower, PREP-Mt: predictive RNA editor for plant mitochondrial genes. BMC Bioinf. 6(1), 1–15 (2005)
Article Google Scholar

Download references

Acknowledgement

This material is based upon work supported by the National Science Foundation under Award No. OIA-1946391.

Author information

Authors and Affiliations

Department of Information Science, University of Arkansas at Little Rock, Little Rock, AR, USA
Islam Akef Ebeid, John R. Talburt & Md Abdus Salam Siddique

Authors

Islam Akef Ebeid
View author publications
You can also search for this author in PubMed Google Scholar
John R. Talburt
View author publications
You can also search for this author in PubMed Google Scholar
Md Abdus Salam Siddique
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Islam Akef Ebeid .

Editor information

Editors and Affiliations

Department of Electrical and Computer Engineering, University of Nevada, Las Vegas, NV, USA
Shahram Latifi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ebeid, I.A., Talburt, J.R., Siddique, M.A.S. (2022). Graph-Based Hierarchical Record Clustering for Unsupervised Entity Resolution. In: Latifi, S. (eds) ITNG 2022 19th International Conference on Information Technology-New Generations. Advances in Intelligent Systems and Computing, vol 1421. Springer, Cham. https://doi.org/10.1007/978-3-030-97652-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-97652-1_14
Published: 24 February 2012
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-97651-4
Online ISBN: 978-3-030-97652-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics