Skip to main content
Log in

EnAli: entity alignment across multiple heterogeneous data sources

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Entity alignment is the problem of identifying which entities in a data source refer to the same real-world entity in the others. Identifying entities across heterogeneous data sources is paramount to many research fields, such as data cleaning, data integration, information retrieval and machine learning. The aligning process is not only overwhelmingly expensive for large data sources since it involves all tuples from two or more data sources, but also need to handle heterogeneous entity attributes. In this paper, we propose an unsupervised approach, called EnAli, to match entities across two or more heterogeneous data sources. EnAli employs a generative probabilistic model to incorporate the heterogeneous entity attributes via employing exponential family, handle missing values, and also utilize the locality sensitive hashing schema to reduce the candidate tuples and speed up the aligning process. EnAli is highly accurate and efficient even without any ground-truth tuples. We illustrate the performance of EnAli on re-identifying entities from the same data source, as well as aligning entities across three real data sources. Our experimental results manifest that our proposed approach outperforms the comparable baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Scannapieco M, Figotin I, Bertino E, Elmagarmid A K. Privacy preserving schema and data matching. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2007, 653–664

    Google Scholar 

  2. Getoor L, Machanavajjhala A. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12): 2018–2019

    Article  Google Scholar 

  3. Zafarani R, Liu H. Connecting corresponding identities across communities. In: Proceedings of International Conference on Weblogs and Social Media. 2009, 354–357

    Google Scholar 

  4. Tantipathananandh C, Berger-Wolf T Y. Constant-factor approximation algorithms for identifying dynamic communities. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 827–836

    Chapter  Google Scholar 

  5. Zhang JW, Yu P S. Integrated anchor and social link predictions across social networks. In: Proceedings of International Joint Conference on Artificial Intelligence. 2015, 2125–2131

    Google Scholar 

  6. Zhang J W, Yu P S. PCT: partial co-alignment of social networks. In: Proceedings of International Conference on World Wide Web. 2016, 749–759

    Chapter  Google Scholar 

  7. Gao M, Lim E P, Lo D, Zhu F D, Prasetyo P K, Zhou A Y. CNL: collective network linkage across heterogeneous social network. In: Proceedings of IEEE International Conference on Data Mining. 2015, 757–762

    Google Scholar 

  8. Kong C, Gao M, Xu C, Qian W N, Zhou A Y. Entity matching across multiple heterogeneous data sources. In: Proceedings of International Conference on Database Systems for Advanced Applications. 2016, 133–146

    Chapter  Google Scholar 

  9. Newcombe H B, Kennedy J M, Axford S J, James A P. Automatic linkage of vital records. Science, 1959, 130(3381): 954–959

    Article  Google Scholar 

  10. Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278

    Google Scholar 

  11. Wang Y R, Madnick S E. The inter-database instance identification problem in integrating autonomous systems. In: Proceedings of International Conference on Data Engineering. 1989, 46–55

    Google Scholar 

  12. Hernandez M A, Stolfo S J. The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 1995, 127–138

    Google Scholar 

  13. Jin L, Li C, Mehrotra S. Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web-internet & Web Information Systems, 2006, 9(4): 557–584

    Google Scholar 

  14. Whang S E, Garcia-Molina H. Incremental entity resolution on rules and data. The VLDB Journal, 2014, 23(1): 77–102

    Article  Google Scholar 

  15. Kolb L, Thor A, Rahm E. Block-based load balancing for entity resolution with MapReduce. In: Proceedings of ACM Conference on Information and Knowledge Management. 2011, 2397–2400

    Google Scholar 

  16. Whang S E, Garcia-Molina H. Entity resolution with evolving rules. Proceedings of the VLDB Endowment, 2010, 3(1–2): 1326–1337

    Article  Google Scholar 

  17. Singla P, Domingos P M. Entity resolution with markov logic. In: Proceedings of IEEE International Conference on Data Mining. 2006, 572–582

    Google Scholar 

  18. Tejada S, Knoblock C A, Minton S. Learning object identification rules for information integration. Information Systems, 2001, 26(8): 607–633

    Article  MATH  Google Scholar 

  19. Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection. Berlin: Springer Heidelberg, 2012

    Book  Google Scholar 

  20. Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16

    Article  Google Scholar 

  21. Winkler W E. Overview of record linkage and current research directions. Bureau of the Census, 2006, 25(4): 603–623

    Google Scholar 

  22. Wang J N, Li G L, Yu J X, Feng J H. Entity matching: how similar is similar. Proceedings of the VLDB Endowment, 2011, 4(10): 622–633

    Article  Google Scholar 

  23. Bilenko M, Mooney R. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48

    Google Scholar 

  24. Dong X, Halevy A Y, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2005, 85–96

    Google Scholar 

  25. Roos L L, Wajda A. Record linkage strategies. Part I: estimating information and evaluating approaches. Methods of Information in Medicine, 1991, 30(2): 117

    Google Scholar 

  26. Grannis S J, Overhage J M, McDonald C J. Analysis of identifier performance using a deterministic linkage algorithm. In: Proceedings of American Medical Informatics Association Annual Symposium. 2002, 305–309

    Google Scholar 

  27. Rastogi V, Dalvi Ni N, Garofalakis M N. Large-scale collective entity matching. Proceedings of the VLDB Endowment, 2011, 4(4): 208–218

    Article  Google Scholar 

  28. Lee S, Lee J, Hwang S. Scalable entity matching computation with materialization. In: Proceedings of ACM Conference on Information and Knowledge Management. 2011, 2353–2356

    Google Scholar 

  29. Liu J, Zhang F, Song X Y, Song Y I, Lin C Y, Hon H W. What’s in a name? an unsupervised approach to link users across communities. In: Proceedings of ACM International Conference on Web Search and Data Mining. 2013, 495–504

    Google Scholar 

  30. Liu S Y, Wang S H, Zhu F D, Zhang J B, Krishnan R. HYDRA: largescale social identity linkage via heterogeneous behavior modeling. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2014, 51–62

    Google Scholar 

  31. Zafarani R, Liu H. Connecting users across social media sites: a behavioral-modeling approach. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 41–49

    Chapter  Google Scholar 

  32. Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210

    Article  MATH  Google Scholar 

  33. DuVall S L, Kerber R A, Thomas A. Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators. Journal of Biomedical Informatics, 2010, 43(1): 24–30

    Article  Google Scholar 

  34. Sadinle M, Fienberg S E. A generalized fellegi-sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 2013, 108(502): 385–397

    Article  MathSciNet  MATH  Google Scholar 

  35. Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engi neering, 2012, 24(9): 1537–1555

    Article  Google Scholar 

  36. Leskovec J, Rajaraman A, Ullman J D. Mining of Massive Datasets. Cambridge: Cambridge University Press, 2011

    Google Scholar 

  37. Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2006, 802–803

    Google Scholar 

  38. Zheng W G, Zou L, Feng Y S, Chen L, Zhao D Y. Efficient simrank-based similarity join over large graphs. Proceedings of the VLDB Endowment, 2013, 6(7): 493–504

    Article  Google Scholar 

  39. Zafarani R, Liu H. Connecting users across social media sites: a behavioral-modeling approach. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 41–49

    Chapter  Google Scholar 

  40. Blei D, Ng A, Jordan M. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993–1022

    MATH  Google Scholar 

Download references

Acknowledgements

This work has been supported by the National Key Research and Development Program of China (2016YFB1000905), the National Natural Science Foundation of China (Grant Nos. U1401256, 61402177, 61672234, 61402180 and 61232002). This work was also supported by NSF of Shanghai (14ZR1412600).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ming Gao.

Additional information

Chao Kong is a PhD candidate majoring in Computer Science and Technology in East China Normal University, China. He received his Bachelor’s and Master’s degrees in Anhui Normal University, China in 2008 and 2012 respectively. His research interests include Web data management and data mining.

Ming Gao is an associate professor of Institute for Data Science and Engineering with East China Normal University (ECNU), China. Prior to joining ECNU, he worked as a postdoctoral fellow at LARC in School of Information Systems, Singapore Management University, Singapore. He received his PhD degree from the School of Computer Science, Fudan University, China in 2011. His research interests include uncertain data management, streaming data processing, social network analysis and data mining. His work appears in major international conferences including TKDE, DMKD, SIGIR, ICDE, ICDM, and DASFAA, etc.

Chen Xu is a senior researcher at Database Systems and Information Management (DIMA) Group, Technische University Berlin, Germany. He received his PhD degree from East China Normal University, China in 2014 and Bachelor’s degree from Hefei University of Technology, China in 2009. His research interest is large-scale distributed data management.

Yunbin Fu is a post-doctor at Institute for Data Science and Engineering in East China Normal University, China. He received his PhD in applied mathematics since from Shanghai University, China in 2013. His research interests include data science and machine learning.

Weining Qian is currently a professor in computer science at East China Normal University, China. He received his MS and PhD degrees in computer science from Fudan University, China in 2001 and 2004, respectively. He served as the co-chair of WISE 2012 Challenge, and program committee member of several international conferences, including ICDE 2009/2010/2012 and KDD 2013. His research interests include Web data management and mining of massive data sets.

Aoying Zhou is a professor of computer science at East China Normal University (ECNU), China where he is heading the Institute of Massive Computing. He is the winner of the National Science Fund for Distinguished Young Scholars supported by NSFC and the professorship appointment under Changjiang Scholarship Program of Ministry of Education. Before joining ECNU in 2008, he worked with Fudan University at the Computer Science Department from 1993 to 2007, where he served as the department chair from 1999 to 2002. He worked as a visiting scholar under the Berkeley Scholar Program in UC Berkeley in 2005. He is now acting as the vice-director of ACM SIGMOD China and Technology Committee on Database of China Computer Federation. He is serving as a member of the editorial boards of some prestigious academic journals, such as VLDB Journal, and WWW Journal. His research interests include Web data management, data management for data-intensive computing, and in-memory data analytics.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kong, C., Gao, M., Xu, C. et al. EnAli: entity alignment across multiple heterogeneous data sources. Front. Comput. Sci. 13, 157–169 (2019). https://doi.org/10.1007/s11704-017-6561-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-017-6561-3

Keywords

Navigation