Abstract
Here we study the problem of matched record clustering in unsupervised entity resolution. We build upon a state-of-the-art probabilistic framework named the Data Washing Machine (DWM). We introduce a graph-based hierarchical 2-step record clustering method (GDWM) that first identifies large, connected components or, as we call them, soft clusters in the matched record pairs using a graph-based transitive closure. That is followed by breaking down the discovered soft clusters into more precise entity profiles in a hierarchical manner using an adapted graph-based modularity optimization method. Our approach provides several advantages over the original implementation of the DWM, mainly a significant speed-up, increased precision, and overall increased F1 scores. We demonstrate the efficacy of our approach using experiments on multiple synthetic datasets. Our results also provide some evidence of the utility of graph theory-based algorithms despite their sparsity in the literature on unsupervised entity resolution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
J.R. Talburt, A.K., D. Pullen, L. Claassens, R. Wang, An Iterative, self-assessing entity resolution system: first steps toward a data washing machine. Int. J. Adv. Comput. Sci. Appl. 11(12) (2020). https://doi.org/10.14569/IJACSA.2020.0111279
J.R. Talburt, Y. Zhou, A practical guide to entity resolution with OYSTER, in Handbook of Data Quality: Research and Practice, ed. by S. Sadiq (Springer, Berlin, 2013), pp. 235–270. https://doi.org/10.1007/978-3-642-36257-6_11
T.N. Herzog, F.J. Scheuren, W.E. Winkler, Data Quality and Record Linkage Techniques. Springer Science and Business Media (Springer, New Yrok, 2007)
P. Lahiri, M.D. Larsen, Regression analysis with linked data. J. Am. Stat. Assoc. 100(469), 222–230 (2005). https://doi.org/10.1198/016214504000001277
A. Tancredi, B. Liseo, A hierarchical Bayesian approach to record linkage and population size problems. Ann. Appl. Stat. 5(2B), 1553–1585 (2011). https://doi.org/10.1214/10-AOAS447
M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, S. Fienberg, Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003). https://doi.org/10.1109/MIS.2003.1234765
X. Li, J.R. Talburt, T. Li, Scoring matrix for unstandardized data in entity resolution, in 2018 International Conference on Computational Science and Computational Intelligence (CSCI) (2018), pp. 1087–1092. https://doi.org/10.1109/CSCI46756.2018.00211
A. Alsarkhi, J.R. Talburt, A method for implementing probabilistic entity resolution. Int. J. Adv. Comput. Sci. Appl. 9(11), 7–15 (2018)
L. Kolb, Z. Sehili, E. Rahm, Iterative computation of connected graph components with MapReduce. Datenbank-Spektrum 14(2), 107–117 (2014). https://doi.org/10.1007/s13222-014-0154-1
V.D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008(10), P10008 (2008). https://doi.org/10.1088/1742-5468/2008/10/P10008
J.R. Talburt, Y. Zhou, S.Y. Shivaiah, SOG: a synthetic occupancy generator to support entity resolution instruction and research.. ICIQ 9, 91–105 (2009)
D. Zhang, D. Li, L. Guo, K. Tan, Unsupervised entity resolution with blocking and graph algorithms. IEEE Trans. Knowl. Data Eng. 1–1 (2020). https://doi.org/10.1109/TKDE.2020.2991063
G. Jeh, J. Widom, SimRank: a measure of structural-context similarity, in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002), pp. 538–543
F. Wang, H. Wang, J. Li, H. Gao, Graph-based reference table construction to facilitate entity matching. J. Syst. Softw. 86(6), 1679–1688 (2013). https://doi.org/10.1016/j.jss.2013.02.026
H. Wang, J. Li, H. Gao, Efficient entity resolution based on subgraph cohesion. Knowl. Inf. Syst. 46(2), 285–314 (2016)
A. Saeedi, M. Nentwig, E. Peukert, E. Rahm, Scalable matching and clustering of entities with FAMER. Complex Syst. Inform. Model. Q. 0(16), Art. no. 16 (2018). https://doi.org/10.7250/csimq.2018-16.04
U. Draisbach, P. Christen, F. Naumann, Transforming pairwise duplicates to entity clusters for high-quality duplicate detection. J. Data Inf. Qual. 12(1), 3:1–3:30 (2019). https://doi.org/10.1145/3352591
N. Kang, J.-J. Kim, B.-W. On, I. Lee, A node resistance-based probability model for resolving duplicate named entities. Scientometrics 124(3), 1721–1743 (2020). https://doi.org/10.1007/s11192-020-03585-4
L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web (Stanford InfoLab, Stanford, 1999)
M. Sadiq, S.I. Ali, M.B. Amin, S. Lee, A vertex matcher for entity resolution on graphs, in 2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM) (2020), pp. 1–4. https://doi.org/10.1109/IMCOM48794.2020.9001799
D. Zhang, L. Guo, X. He, J. Shao, S. Wu, H.T. Shen, A graph-theoretic fusion framework for unsupervised entity resolution, in 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris (2018), pp. 713–724. https://doi.org/10.1109/ICDE.2018.00070
P. Malhotra, P. Agarwal, G.M. Shroff, Graph-parallel entity resolution using LSH & IMM, in EDBT/ICDT Workshops (2014), pp. 41–49
A. Al-Sarkhi, J.R. Talburt, Estimating the parameters for linking unstandardized references with the matrix comparator. J. Inf. Technol. Manag. 10(4), 12–26 (2018)
A.E. Monge, C. Elkan, et al., The field matching problem: algorithms and applications, in KDD, vol. 2 (1996), pp. 267–270
V.I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, in Soviet Physics Doklady, vol. 10, No. 8 (1966), pp. 707–710
S.V. Ovchinnikov, On the transitivity property. Fuzzy Sets Syst. 20(2), 241–243 (1986). https://doi.org/10.1016/0165-0114(86)90080-1
S.E. Schaeffer, Graph clustering. Comput. Sci. Rev. 1(1), 27–64 (2007). https://doi.org/10.1016/j.cosrev.2007.05.001
M.E.J. Newman, Modularity and community structure in networks. Proc. Natl. Acad. Sci. U. S. A. 103(23), 8577–8582 (2006). https://doi.org/10.1073/pnas.0601602103
Y. Ye, J.R. Talburt, Generating synthetic data to support entity resolution education and research. J. Comput. Sci. Coll. 34(7), 12–19 (2019)
A. Hagberg, P. Swart, D.S. Chult, Exploring Network Structure, Dynamics, and Function Using NetworkX. Los Alamos National Lab. (LANL) (Los Alamos, NM (United States), 2008)
J.P. Mower, PREP-Mt: predictive RNA editor for plant mitochondrial genes. BMC Bioinf. 6(1), 1–15 (2005)
Acknowledgement
This material is based upon work supported by the National Science Foundation under Award No. OIA-1946391.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ebeid, I.A., Talburt, J.R., Siddique, M.A.S. (2022). Graph-Based Hierarchical Record Clustering for Unsupervised Entity Resolution. In: Latifi, S. (eds) ITNG 2022 19th International Conference on Information Technology-New Generations. Advances in Intelligent Systems and Computing, vol 1421. Springer, Cham. https://doi.org/10.1007/978-3-030-97652-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-97652-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-97651-4
Online ISBN: 978-3-030-97652-1
eBook Packages: EngineeringEngineering (R0)