Skip to main content

Graph-Based Hierarchical Record Clustering for Unsupervised Entity Resolution

  • Conference paper
  • First Online:
ITNG 2022 19th International Conference on Information Technology-New Generations

Abstract

Here we study the problem of matched record clustering in unsupervised entity resolution. We build upon a state-of-the-art probabilistic framework named the Data Washing Machine (DWM). We introduce a graph-based hierarchical 2-step record clustering method (GDWM) that first identifies large, connected components or, as we call them, soft clusters in the matched record pairs using a graph-based transitive closure. That is followed by breaking down the discovered soft clusters into more precise entity profiles in a hierarchical manner using an adapted graph-based modularity optimization method. Our approach provides several advantages over the original implementation of the DWM, mainly a significant speed-up, increased precision, and overall increased F1 scores. We demonstrate the efficacy of our approach using experiments on multiple synthetic datasets. Our results also provide some evidence of the utility of graph theory-based algorithms despite their sparsity in the literature on unsupervised entity resolution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. J.R. Talburt, A.K., D. Pullen, L. Claassens, R. Wang, An Iterative, self-assessing entity resolution system: first steps toward a data washing machine. Int. J. Adv. Comput. Sci. Appl. 11(12) (2020). https://doi.org/10.14569/IJACSA.2020.0111279

  2. J.R. Talburt, Y. Zhou, A practical guide to entity resolution with OYSTER, in Handbook of Data Quality: Research and Practice, ed. by S. Sadiq (Springer, Berlin, 2013), pp. 235–270. https://doi.org/10.1007/978-3-642-36257-6_11

    Chapter  Google Scholar 

  3. T.N. Herzog, F.J. Scheuren, W.E. Winkler, Data Quality and Record Linkage Techniques. Springer Science and Business Media (Springer, New Yrok, 2007)

    Google Scholar 

  4. P. Lahiri, M.D. Larsen, Regression analysis with linked data. J. Am. Stat. Assoc. 100(469), 222–230 (2005). https://doi.org/10.1198/016214504000001277

    Article  MathSciNet  Google Scholar 

  5. A. Tancredi, B. Liseo, A hierarchical Bayesian approach to record linkage and population size problems. Ann. Appl. Stat. 5(2B), 1553–1585 (2011). https://doi.org/10.1214/10-AOAS447

    Article  MathSciNet  Google Scholar 

  6. M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, S. Fienberg, Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003). https://doi.org/10.1109/MIS.2003.1234765

    Article  Google Scholar 

  7. X. Li, J.R. Talburt, T. Li, Scoring matrix for unstandardized data in entity resolution, in 2018 International Conference on Computational Science and Computational Intelligence (CSCI) (2018), pp. 1087–1092. https://doi.org/10.1109/CSCI46756.2018.00211

  8. A. Alsarkhi, J.R. Talburt, A method for implementing probabilistic entity resolution. Int. J. Adv. Comput. Sci. Appl. 9(11), 7–15 (2018)

    Google Scholar 

  9. L. Kolb, Z. Sehili, E. Rahm, Iterative computation of connected graph components with MapReduce. Datenbank-Spektrum 14(2), 107–117 (2014). https://doi.org/10.1007/s13222-014-0154-1

    Article  Google Scholar 

  10. V.D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008(10), P10008 (2008). https://doi.org/10.1088/1742-5468/2008/10/P10008

  11. J.R. Talburt, Y. Zhou, S.Y. Shivaiah, SOG: a synthetic occupancy generator to support entity resolution instruction and research.. ICIQ 9, 91–105 (2009)

    Google Scholar 

  12. D. Zhang, D. Li, L. Guo, K. Tan, Unsupervised entity resolution with blocking and graph algorithms. IEEE Trans. Knowl. Data Eng. 1–1 (2020). https://doi.org/10.1109/TKDE.2020.2991063

  13. G. Jeh, J. Widom, SimRank: a measure of structural-context similarity, in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002), pp. 538–543

    Google Scholar 

  14. F. Wang, H. Wang, J. Li, H. Gao, Graph-based reference table construction to facilitate entity matching. J. Syst. Softw. 86(6), 1679–1688 (2013). https://doi.org/10.1016/j.jss.2013.02.026

    Article  Google Scholar 

  15. H. Wang, J. Li, H. Gao, Efficient entity resolution based on subgraph cohesion. Knowl. Inf. Syst. 46(2), 285–314 (2016)

    Article  Google Scholar 

  16. A. Saeedi, M. Nentwig, E. Peukert, E. Rahm, Scalable matching and clustering of entities with FAMER. Complex Syst. Inform. Model. Q. 0(16), Art. no. 16 (2018). https://doi.org/10.7250/csimq.2018-16.04

  17. U. Draisbach, P. Christen, F. Naumann, Transforming pairwise duplicates to entity clusters for high-quality duplicate detection. J. Data Inf. Qual. 12(1), 3:1–3:30 (2019). https://doi.org/10.1145/3352591

  18. N. Kang, J.-J. Kim, B.-W. On, I. Lee, A node resistance-based probability model for resolving duplicate named entities. Scientometrics 124(3), 1721–1743 (2020). https://doi.org/10.1007/s11192-020-03585-4

    Article  Google Scholar 

  19. L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web (Stanford InfoLab, Stanford, 1999)

    Google Scholar 

  20. M. Sadiq, S.I. Ali, M.B. Amin, S. Lee, A vertex matcher for entity resolution on graphs, in 2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM) (2020), pp. 1–4. https://doi.org/10.1109/IMCOM48794.2020.9001799

  21. D. Zhang, L. Guo, X. He, J. Shao, S. Wu, H.T. Shen, A graph-theoretic fusion framework for unsupervised entity resolution, in 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris (2018), pp. 713–724. https://doi.org/10.1109/ICDE.2018.00070

  22. P. Malhotra, P. Agarwal, G.M. Shroff, Graph-parallel entity resolution using LSH & IMM, in EDBT/ICDT Workshops (2014), pp. 41–49

    Google Scholar 

  23. A. Al-Sarkhi, J.R. Talburt, Estimating the parameters for linking unstandardized references with the matrix comparator. J. Inf. Technol. Manag. 10(4), 12–26 (2018)

    Google Scholar 

  24. A.E. Monge, C. Elkan, et al., The field matching problem: algorithms and applications, in KDD, vol. 2 (1996), pp. 267–270

    Google Scholar 

  25. V.I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, in Soviet Physics Doklady, vol. 10, No. 8 (1966), pp. 707–710

    Google Scholar 

  26. S.V. Ovchinnikov, On the transitivity property. Fuzzy Sets Syst. 20(2), 241–243 (1986). https://doi.org/10.1016/0165-0114(86)90080-1

    Article  MathSciNet  Google Scholar 

  27. S.E. Schaeffer, Graph clustering. Comput. Sci. Rev. 1(1), 27–64 (2007). https://doi.org/10.1016/j.cosrev.2007.05.001

    Article  Google Scholar 

  28. M.E.J. Newman, Modularity and community structure in networks. Proc. Natl. Acad. Sci. U. S. A. 103(23), 8577–8582 (2006). https://doi.org/10.1073/pnas.0601602103

    Article  Google Scholar 

  29. Y. Ye, J.R. Talburt, Generating synthetic data to support entity resolution education and research. J. Comput. Sci. Coll. 34(7), 12–19 (2019)

    Google Scholar 

  30. A. Hagberg, P. Swart, D.S. Chult, Exploring Network Structure, Dynamics, and Function Using NetworkX. Los Alamos National Lab. (LANL) (Los Alamos, NM (United States), 2008)

    Google Scholar 

  31. J.P. Mower, PREP-Mt: predictive RNA editor for plant mitochondrial genes. BMC Bioinf. 6(1), 1–15 (2005)

    Article  Google Scholar 

Download references

Acknowledgement

This material is based upon work supported by the National Science Foundation under Award No. OIA-1946391.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Islam Akef Ebeid .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Cite this paper

Ebeid, I.A., Talburt, J.R., Siddique, M.A.S. (2022). Graph-Based Hierarchical Record Clustering for Unsupervised Entity Resolution. In: Latifi, S. (eds) ITNG 2022 19th International Conference on Information Technology-New Generations. Advances in Intelligent Systems and Computing, vol 1421. Springer, Cham. https://doi.org/10.1007/978-3-030-97652-1_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-97652-1_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-97651-4

  • Online ISBN: 978-3-030-97652-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics