Skip to main content
Log in

ScLink: supervised instance matching system for heterogeneous repositories

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Instance matching is the finding of co-referent instances that describe the same real-world object across two different repositories. For this problem, the heterogeneity, also known as the differences of objects’ attributes and repositories’ schema, is a challenging issue. It creates the limitations in the accuracy of existing solutions. In order to match the instances of heterogeneous repositories, a matching system can follow a configuration that specifies the equivalent properties, suitable similarity metrics, and other important parameters. This configuration can be created manually or automatically by learning methods. We present ScLink, an instance matching system that can generate a configuration automatically. In ScLink, we install two novel supervised learning algorithms, cLearn and minBlock. cLearn applies an apriori-like heuristic for finding the optimal combination of matching properties and similarity metrics. minBlock finds a blocking model, which aims at optimally reducing the pairwise alignments of instances between input repositories. In addition, ScLink introduces other techniques to take into account the scalability issue on large repositories. Experimental results on standard and very large datasets find that minBlock and cLearn are very effective and efficient. cLearn is also significantly better than existing configuration learning algorithms. It drastically boosts the accuracy of ScLink and makes the system outperform the state-of-the-arts, even when being trained using a small amount of labeled data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. A redirected instance does not contain any information other than a URI linking to another instance that actually contains descriptions.

  2. This similarity may be first proposed in another paper, which is being under review. Different from that paper, in this article, we empirically analyze its effectiveness.

References

  • Agrawal, R., Srikant, R., et al. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th international conference on very large data bases, (Vol. 1215 pp. 487–499).

    Google Scholar 

  • Altowim, Y., Kalashnikov, D.V., & Mehrotra, S. (2014). Progressive approach to relational entity resolution. Proceedings of the VLDB Endowment, 7, 999–1010.

    Article  Google Scholar 

  • Araujo, S., De Vries, A., & Schwabe, D. (2011). SERIMI Results for OAEI 2011. In Proceedings of the 6th workshop on ontology matching (pp. 212–219).

    Google Scholar 

  • Araujo, S., Tran, D.T., de Vries, A., & Schwabe, D. (2015). SERIMI: Class-Based matching for instance matching across heterogeneous datasets. IEEE Transactions on Knowledge and Data Engineering, 27(5), 1397–1440.

    Article  Google Scholar 

  • Bhattacharya, I., & Getoor, L. (2004). Iterative record linkage for cleaning and integration. In Proceedings of the 9th SIGMOD workshop on research numbers in data mining and knowledge discovery (pp. 11–18): ACM.

  • Bhattacharya, I., & Getoor, L. (2006). A latent dirichlet model for unsupervised entity resolution. In Proceedings of the 6th SIAM international conference on data mining (pp. 47–58): SIAM.

  • Bilenko, M., & Mooney, R.J. (2003). Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the SIGKDD conference on knowledge discovery and data mining (pp. 39–48): ACM.

  • Bilenko, M., Kamath, B., & Mooney, R.J. (2006). Adaptive blocking: Learning to scale up record linkage. In Proceedings of the 6th international conference on data mining (pp. 87–96).

  • Christen, P. (2008a). Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceedings of the 14th SIGKDD international conference on knowledge discovery and data mining (pp. 151–159): ACM.

  • Christen, P. (2008b). Automatic training example selection for scalable unsupervised record linkage. In Proceedings of the 12th pacific-asia conference on advances in knowledge discovery and data mining (pp. 511–518): Springer.

  • Christen, P. (2008c). Febrl: a freely available record linkage system with a graphical user interface. In Proceedings of the 2nd australasian workshop on health data and knowledge management, (Vol. 80 pp. 17–25).

  • Christen, P., & Gayler, R.W. (2013). Adaptive temporal entity resolution on dynamic databases. In Proceedings of the 17th pacific-asia conference on advances in knowledge discovery and data mining (pp. 558–569): Springer.

  • Cruz, I.F., Antonelli, F.P., & Stroe, C. (2009). AgreementMaker: Efficient matching for large real-world schemas and ontologies. In Proceedings of the VLDB endowment, (Vol. 2 pp. 1586–1589).

  • Cruz, I.F., Stroe, C., Caimi, F., Fabiani, A., Pesquita, C., Couto, F.M., & Palmonari, M. (2011). Using agreementMaker to align ontologies for OAEI 2011. In Proceedings of the 6th workshop on ontology matching (pp. 114–121).

  • Dalvi, N., Rastogi, V., Dasgupta, A., Das Sarma, A., & Sarlós, T. (2013). Optimal hashing schemes for entity matching. In Proceedings of the 22nd international conference on world wide web (pp. 295–306).

  • Demartini, G., Difallah, D.E., & Cudré-Mauroux, P. (2013). Large-scale linked data integration using probabilistic reasoning and crowdsourcing. The VLDB Journal, 22(5), 665–687.

    Article  Google Scholar 

  • Dong, X., Halevy, A., & Madhavan, J. (2005). Reference reconciliation in complex information spaces. In Proceedings of the 24th SIGMOD international conference on management of data (pp. 85–96): ACM.

  • Euzenat, J., Ferrara, A., van Hague, W.R., Hollink, L., Meilicke, C., Nikolov, A., Scharffe, F., Shvaiko, P., Stuckenschmidt, H., Sváb-Zamazal, O., & dos Santos, C.T. (2011). Final results of the ontology alignment evaluation initiative 2011. In Proceedings of the 6th workshop on ontology matching (pp. 85–113).

  • Ferrara, A., Nikolov, A., & Scharffe, F. (2011). Data linking for the semantic web. Semantic Web and Information System, 7(3), 46–76.

    Article  Google Scholar 

  • Gale, D., & Shapley, L.S. (1962). College admissions and the stability of marriage. American Mathematical Monthly, 96(1), 9–15.

    Article  MathSciNet  MATH  Google Scholar 

  • Hall, R., Sutton, C., & McCallum, A. (2008). Unsupervised deduplication using cross-field dependencies. In Proceedings of the 14th SIGKDD conference on knowledge discovery and data mining (pp. 310–317): ACM.

  • Hernández, M.A., & Stolfo, S.J. (1995). The merge/purge problem for large databases. ACM SIGMOD Record, 24, 127–138.

    Article  Google Scholar 

  • Hogan, A., Zimmermann, A., Umbrich, J., Polleres, A., & Decker, S. (2012). Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora. Web Semantics: Science, Services and Agents on the World Wide Web, 10, 76–110.

    Article  Google Scholar 

  • Hu, W., Chen, J., & Qu, Y. (2011). A self-training approach for resolving object coreference on the semantic web. In Proceedings of the 20th international conference on world wide web (pp. 87–96).

  • Hu, W., Yang, R., & Qu, Y. (2014). Automatically generating data linkages using class-based discriminative properties. Data & Knowledge Engineering, 91, 34–51.

    Article  Google Scholar 

  • Isele, R., & Bizer, C. (2012). Learning expressive linkage rules using genetic programming. The VLDB Journal, 5(11), 1638–1649.

    Google Scholar 

  • Isele, R., & Bizer, C. (2013). Active learning of expressive linkage rules using genetic programming. Web Semantics: Science, Services and Agents on the World Wide Web, 23, 2–15.

    Article  Google Scholar 

  • Isele, R., Jentzsch, A., & Bizer, C. (2011). Efficient multidimensional blocking for link discovery without losing recall. In Proceedings of the 14th SIGMOD workshop on the web and databases.

  • Kejriwal, M., & Miranker, D.P. (2013). An unsupervised algorithm for learning blocking schemes. In Proceedings of the 13th international conference on data mining (pp. 340–349): IEEE.

  • Kejriwal, M., & Miranker, D.P. (2015). Semi-supervised instance matching using boosted classifiers. In Proceedings of the 12th extended semantic web conference. LNCS, (Vol. 9088 pp. 388–402): Springer.

  • Kirsten, T., Kolb, L., Hartung, M., Groß, A., Köpcke, H., & Rahm, E. (2010). Data partitioning for parallel entity matching. Proceedings of the VLDB Endowment, 3.

  • Köpcke, H., & Rahm, E. (2010). Frameworks for entity matching: a comparison. Data & Knowledge Engineering, 69(2), 197–210.

    Article  Google Scholar 

  • Köpcke, H., Thor, A., & Rahm, E. (2010). Evaluation of entity resolution approaches on real-world match problems. In Proceedings of the VLDB endowment, (Vol. 3 pp. 484–493): VLDB Endowment.

  • Koudas, N., Sarawagi, S., & Srivastava, D. (2006). Record linkage: similarity measures and algorithms. In Proceedings of the 25th SIGMOD international conference on management of data (pp. 802–803): ACM.

  • Levenshtein, V.I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, (Vol. 10 pp. 707–710).

  • Li, J., Tang, J., Li, Y., & Luo, Q. (2009). RiMOM: a dynamic multistrategy ontology alignment framework. IEEE Transactions on Knowledge and Data Engineering, 21(8), 1218–1232.

    Article  Google Scholar 

  • Li, W. S., & Clifton, C. (2000). SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowledge and Engineering, 33, 49–84.

    Article  MATH  Google Scholar 

  • Locoro, A., David, J., & Euzenat, J. (2014). Context-based matching: design of a flexible framework and experiment. Journal on Data Semantics, 3(1), 25–46.

    Article  Google Scholar 

  • McCallum, A., Nigam, K., & Ungar, L.H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the 6th SIGKDD conference on knowledge discovery and data mining (pp. 169–178): ACM.

  • Mendes, P.N., & Jakob, M. (2011). García-silva, A., Bizer, C.: Dbpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th international conference on semantic systems (pp. 1–8): ACM.

  • Mishra, S., Gandhi, T., Arora, A., & Bhattacharya, A. (2013). Efficient edit distance based string similarity search using deletion neighborhoods. In Proceedings of the 16th joint EDBT/ICDT workshops on string similarity (pp. 375–383): ACM.

  • Ngomo, A.C.N., & Auer, S. (2011). LIMES: A time-efficient approach for large-scale link discovery on the web of data. In Proceedings of the 22nd international joint conference on artificial intelligence (pp. 2312–2317).

  • Ngomo, A.C.N., & Lyko, K. (2012). EAGLE: Efficient Active learning of link specifications using genetic programming. In Proceedings of the 9th extended semantic web conference. LNCS, (Vol. 7295 pp. 149–163): Springer.

  • Ngomo, A.C.N., & Lyko, K. (2013). Unsupervised learning of link specifications: Deterministic vs. non-deterministic. In Proceedings of the 8th workshop on ontology matching (pp. 25–36).

  • Ngomo, A.C.N., Lehmann, J., Auer, S., & Höffner, K. (2011). RAVEN - active learning of link specifications. In Proceedings of the 6th workshop on ontology matching (pp. 25–36).

  • Nguyen, K., & Ichise, R. (2015a). Heuristic-based configuration learning for linked data instance matching. In Proceedings of the 5th joint international semantic technology conference. LNCS, (Vol. 9544 pp. 56–72): Springer.

  • Nguyen, K., & Ichise, R. (2015b). ScSLINT: Time and memory efficient interlinking framework for linked data. In Proceedings of the 14th international semantic web conference posters and demonstrations track.

  • Nguyen, K., Ichise, R., & Le, B. (2012a). Interlinking linked data sources using a domain-independent system. In Proceedings of the 2nd joint international semantic technology. LNCS, (Vol. 7774 pp. 113–128): Springer.

  • Nguyen, K., Ichise, R., & Le, H.B. (2012b). Learning approach for domain-independent linked data instance matching. In Proceedings of the SIGKDD 2nd workshop on mining data semantics (pp. 7–15): ACM.

  • Nikolov, A., d’Aquin, M., & Motta, E. (2012). Unsupervised learning of link discovery configuration. In Proceedings of the 9th extended semantic web conference. LNCS, (Vol. 7295 pp. 119–133): Springer.

  • Niu, X., Rong, S., Zhang, Y., & Wang, H. (2011). Zhishi.links results for OAEI 2011. In Proceedings of the 6th workshop on ontology matching (pp. 220–227).

  • Papadakis, G., Ioannou, E., Niederée, C., & Fankhauser, P. (2011). Efficient entity resolution for large heterogeneous information spaces. In Proceedings of the 4th international conference on web search and data mining (pp. 535–544): ACM.

  • Papadakis, G., Ioannou, E., Palpanas, T., Niederée, C., & Nejdl, W. (2013). A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Transactions on Knowledge and Data Engineering, 25(12), 2665–2682.

    Article  Google Scholar 

  • Papadakis, G., Papastefanatos, G., & Koutrika, G. (2014). Supervised meta-blocking. In Proceedings of the VLDB endowment, (Vol. 7 pp. 1929–1940): VLDB Endowment.

  • Papadakis, G., Svirsky, J., Gal, A., & Palpanas, T. (2016). Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment, 9.

  • Pernelle, N., Saïs, F., & Symeonidou, D. (2013). An automatic key discovery approach for data linking. Web Semantics: Science, Services and Agents on the World Wide Web, 23, 16–30.

    Article  Google Scholar 

  • Rahm, E., & Do, H.H. (2000). Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 3–13.

    Google Scholar 

  • Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., & Gatford, M. (1994). Okapi at TREC-3. In Proceedings of the 3rd text retrieval conference (pp. 109–123).

  • Rong, S., Niu, X., Xiang, W.E., Wang, H., Yang, Q., & Yu, Y. (2012). A machine learning approach for instance matching based on similarity metrics. In Proceedings of the 11th international semantic web conference. LNCS, (Vol. 7649 pp. 460–475): Springer.

  • Sarawagi, S., & Bhamidipaty, A. (2002). Interactive deduplication using active learning. In Proceedings of the 8th SIGKDD conference on knowledge discovery and data mining (pp. 269–278). New York, USA: ACM.

    Google Scholar 

  • Sheila, T., Knoblock, C., & Minton, S. (2002). Learning domain-independent string transformation weights for high accuracy object identification. In Proceedings of the 8th SIGKDD conference on knowledge discovery and data mining (pp. 350–359): ACM.

  • Song, D., & Heflin, J. (2011). Automatically generating data linkages using a domain-independent candidate selection approach. In Proceedings of the 10th international semantic web conference. LNCS, (Vol. 7031 pp. 649–664): Springer.

  • Soru, T., & Ngomo, A.C.N. (2013). Rapid execution of weighted edit distances. In Proceedings of the 8th workshop on ontology matching (pp. 1–12).

  • Soru, T., & Ngomo, A.C.N. (2014). A comparison of supervised learning classifiers for link discovery. In Proceedings of the 10th international conference on semantic systems (pp. 41–44): ACM.

  • Suchanek, F.M., Abiteboul, S., & Senellart, P. (2011). PARIS: probabilistic alignment of relations, instances, and schema. The VLDB Journal, 5(3), 157–168.

    Google Scholar 

  • Thor, A., & Rahm, E. (2007). MOMA-a mapping-based object matching system. In Proceedings of the 3rd biennial conference on innovative data systems research (pp. 247–258).

  • Urbani, J., Kotoulas, S., Maassen, J., Van Harmelen, F., & Bal, H. (2010). OWL Reasoning with webpie: calculating the closure of 100 billion triples. In Proceedings of the 7th european semantic web conference. LNCS, (Vol. 5554 pp. 213–227): Springer.

  • Vesdapunt, N., Bellare, K., & Dalvi, N. (2014). Crowdsourcing algorithms for entity resolution. In Proceedings of the VLDB endowment, (Vol. 7 pp. 1071–1082): VLDB Endowment.

  • Volz, J., Bizer, C., Gaedke, M., & Kobilarov, G. (2009). Discovering and maintaining links on the web of data. In Proceedings of the 8th international semantic web conference. LNCS, (Vol. 5823 pp. 650–665): Springer.

  • Whang, S.E., & Garcia-Molina, H. (2014). Incremental entity resolution on rules and data. The VLDB Journal, 23, 77–102.

    Article  Google Scholar 

  • Winkler, W.E. (2006). Overview of record linkage and current research directions. Tech. rep., Bureau of the Cencus.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khai Nguyen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nguyen, K., Ichise, R. ScLink: supervised instance matching system for heterogeneous repositories. J Intell Inf Syst 48, 519–551 (2017). https://doi.org/10.1007/s10844-016-0426-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-016-0426-3

Keywords

Navigation