Skip to main content
Log in

eTuner: tuning schema matching software using synthetic scenarios

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Most recent schema matching systems assemble multiple components, each employing a particular matching technique. The domain user mustthen tune the system: select the right component to be executed and correctly adjust their numerous “knobs” (e.g., thresholds, formula coefficients). Tuning is skill and time intensive, but (as we show) without it the matching accuracy is significantly inferior. We describe eTuner, an approach to automatically tune schema matching systems. Given a schema S, we match S against synthetic schemas, for which the ground truth mapping is known, and find a tuning that demonstrably improves the performance of matching S against real schemas. To efficiently search the huge space of tuning configurations, eTuner works sequentially, starting with tuning the lowest level components. To increase the applicability of eTuner, we develop methods to tune a broad range of matching components. While the tuning process is completely automatic, eTuner can also exploit user assistance (whenever available) to further improve the tuning quality. We employed eTuner to tune four recently developed matching systems on several real-world domains. The results show that eTuner produced tuned matching systems that achieve higher accuracy than using the systems with currently possible tuning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aberer K. (2003) Special issue on peer to peer data management. SIGMOD Rec. 32(3): 138–140

    Google Scholar 

  2. Agrawal, S., Chaudhuri, S., Kollr, L., Marathe, A.P., Narasayya, V.R., Syamala, M.: Database tuning advisor for microsoft sql server 2005. In: VLDB, 2004

  3. Andritsos, P., Miller, R.J., Tsaparas, P.: Information-theoretic tools for mining database structure from large data sets. In: Proceedings of SIGMOD, 2004

  4. Aslan G., McLeod D. (1999) Semantic heterogeneity resolution in federated databases by metadata implantation and stepwise evolution. VLDB J. 8(2): 120–132

    Article  Google Scholar 

  5. Batini C., Lenzerini M., Navathe S.B. (1986) A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18(4): 323–364

    Article  Google Scholar 

  6. Benjelloun O., Garcia-Molina H., Jonas J., Su Q., Widom J. (2005) Swoosh: a generic approach to entity resolution. Technical report, Stanford University

    Google Scholar 

  7. Bergamaschi S., Castano S., Vincini M., Beneventano D. (2001) Semantic integration of heterogeneous information sources. Data Knowl. Eng. 36(3): 215–249

    Article  MATH  Google Scholar 

  8. Berlin, J., Motro, A.: Autoplex: automated discovery of content for virtual databases. In: Proceedings of the Conference on Cooperative Information Systems (CoopIS), 2001

  9. Berlin, J., Motro, A.: Database schema matching using machine learning with feature selection. In: Proceedings of the Conference on Advanced Information Systems Engineering (CAiSE), 2002

  10. Bernstein, P.A., Melnik, S., Petropoulos, M., Quix, C.: Industrial-strength schema matching. SIGMOD Record, Special Issue in Semantic Integration, December 2004

  11. Bilke, A., Naumann, F.: Schema matching using duplicates. In: Proceedings of the International Conference on Data Engineering (ICDE), 2005

  12. Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic text segmentation for extracting structured records. In: Proceedings of SIGMOD-01

  13. Brown, A., Kar, G., Keller, A.: An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In: Proceedings of the Seventh IFIP/IEEE International Symposium on Integrated Network Management (IM), 2001

  14. Castano, S., De Antonellis, V.: A schema analysis and reconciliation tool environment. In: Proceedings of the International Database Engineering and Applications Symposium (IDEAS), 1999

  15. Chaudhuri, S., Dageville, B., Lohman, G.: Self-managing technology in database management systems (tutorial). In: Proceedings of VLDB, 2004

  16. Chaudhuri, S., Weikum, G.: Rethinking database system architecture: towards a self-tuning risc-style database system. In: VLDB, 2000

  17. Chidlovskii, B.: Automatic repairing of web wrappers. In: Third International Workshop on Web Information and Data Management, 2001

  18. Clifton, C., Housman, E., Rosenthal, A.: Experience with a combined approach to attribute-matching across heterogeneous databases. In: Proceedings of the IFIP Working Conference on Data Semantics (DS-7), 1997

  19. Dhamankar, R., Lee, Y., Doan, A., Halevy, A., Domingos P.: iMAP: discovering complex matches between database schemas. In: Proceedings of SIGMOD, 2004

  20. Dietterich T.G. (1997) Machine learning research: four current directions. AI Mag. 18(4): 97–136

    Google Scholar 

  21. Do, H.: Schema matching and Mapping-based Data Integration. PhD Thesis, University of Leipzig, 2006

  22. Do, H., Melnik, S., Rahm, E.: Comparison of schema matching evaluations. In: Proceedings of the 2nd International Workshop on Web Databases (German Informatics Society), 2002

  23. Do, H., Rahm, E.: Coma: a system for flexible combination of schema matching approaches. In: Proceedings of the 28th Conference on Very Large Databases (VLDB), 2002

  24. Doan, A.: Learning to Map between Structured Representations of Data. PhD Thesis, University of Washington, 2003

  25. Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources: A machine learning approach. In: Proceedings of the ACM SIGMOD Conference, 2001

  26. Doan A., Domingos P., Halevy A. (2003) Learning to match the database schemas: a multistrategy approach. Mach. Learn. 50(3): 279–301

    Article  MATH  Google Scholar 

  27. Doan A., Madhavan Dhamankar R., Domingos P., Halevy A. (2003) Learning to match ontologies on the Semantic Web. VLDB J. 12, 303–319

    Article  Google Scholar 

  28. Doan, A., Madhavan, J., Domingos, P., Halevy, A.: Learning to map ontologies on the semantic web. In: Proceedings of the World-Wide Web Conference (WWW-02), 2002

  29. Doan A., Noy N., Halevy A. (2004) Introduction to the special issue on semantic integration. SIGMOD Rec. 33(4): 11–13

    Article  Google Scholar 

  30. Embley, D., Jackman, D., Xu, L.: Multifaceted exploitation of metadata for attribute match discovery in information integration. In: Proceedings of the WIIW-01, 2001

  31. Fang, H., Tao, T., Zhai, C.: A formal study of information retrieval heuristics. In: Proceedings of the ACM SIGIR Conference, 2004

  32. Freitag, D.: Machine learning for information extraction in informal domains. PhD. Thesis, Deptartment of Computer Science, Carnegie Mellon University, 1998

  33. Ganti, V., Chaudhuri, S., Motwani, R.: Robust identification of fuzzy duplicates. In: ICDE, 2005

  34. He, B., Chang, K.: Statistical schema matching across web query interfaces. In: Proceedings of the ACM SIGMOD Conference (SIGMOD), 2003

  35. He, B., Chang, K.C.C., Han, J.: Discovering complex matchings across Web query interfaces: a correlation mining approach. In: Proceedings of the ACM SIGKDD Conference (KDD), 2004

  36. Kang, J., Naughton, J.: On schema matching with opaque column names and data values. In: Proceedings of the ACM SIGMOD International Conference on Management of Data SIGMOD-03), 2003

  37. Keim, G., Shazeer, N., Littman, M., Agarwal, S.: Cheves, C., Fitzgerald, J., Grosland, J., Jiang, F., Pollard, S., Weinmeister, K.: PROVERB: the probabilistic cruciverbalist. In: Proceeedings of the 6th National Conference on Artificial Intelligence (AAAI-99), pp. 710–717 (1999)

  38. Kushmerick N. (2000) Wrapper verification. World Wide Web J. 3(2): 79–94

    Article  MATH  Google Scholar 

  39. Lerman K., Minton S., Knoblock C. (2003) Wrapper maintenance: a machine learning approach. J. Artif. Intell. Res. 18:149–187

    MATH  Google Scholar 

  40. Li W., Clifton C., Liu S. (2000) Database integration using neural network: implementation and experience. Knowl. Inf. Syst. 2(1): 73–96

    Article  MATH  Google Scholar 

  41. Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based schema matching. In: Proceedings of the 18th IEEE International Conf. on Data Engineering (ICDE), 2005

  42. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of VLDB, 2001

  43. McCann, R., Alshebli, B., Le, Q., Nguyen, H., Vu, L., Doan, A.: Mapping maintenance for data integration systems. In: Proceedings of VLDB 2005

  44. McCann, R., Doan, A., Kramnik, A.: Varadarajan, V.: Building data integration systems via mass collaboration. In: Proceedings of the SIGMOD-03 Workshop on the Web and Databases (WebDB-03), 2003

  45. McCann, R., Kramnik, A., Shen, W., Varadarajan, V., Sobulo, O., Doan, A.: Integrating data from disparate sources: a mass collaboration approach. In: Proceedings of the International Conference on Data Engineering (ICDE), 2005

  46. Melnik, S., Molina-Garcia, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm. In: Proceedings of the International Conference on Data Engineering (ICDE), 2002

  47. Melville P., Mooney R. (2004) Creating diversity in ensembles using artificial data. J. Inf. Fusion Spec. Issue Divers. Mult. Classifier Syst. 6(1):99–111

    Google Scholar 

  48. Meng, X., Hu, D., Li, C.: Schema-guided wrapper maintenance for web-data extraction. In: Fifth International Workshop on Web Information and Data Management, 2003

  49. Milo, T., Zohar, S.: Using schema matching to simplify heterogeneous data translation. In: Proceedings of the International Conference on Very Large Databases (VLDB), 1998

  50. Mitchell T. (1997) Machine Learning. McGraw-Hill, NY

    MATH  Google Scholar 

  51. Mitra, P., Wiederhold, G., Jannink, J.: Semi-automatic integration of knowledge sources. In: Proceedings of Fusion, 1999

  52. Neumann, F., Ho, C.T., Tian, X., Haas, L., Meggido, N.: Attribute classification using feature analysis. In: Proceedings of the International Conference on Data Engineering (ICDE), 2002

  53. Noy, N.F., Musen, M.A.: PROMPT: algorithm and tool for automated ontology merging and alignment. In: Proceedings of the National Conference on Artificial Intelligence (AAAI), 2000

  54. Noy, N.F., Musen, M.A.: Anchor-PROMPT: using non-local context for semantic Matching. In: Proceedings of the Workshop on Ontologies and Information Sharing at the International Joint Conference on Artificial Intelligence (IJCAI), 2001

  55. Ouksel A., Seth A.P. (1999) Special issue on semantic interoperability in global information systems. SIGMOD Re. 28(1): 5–12

    Article  Google Scholar 

  56. Palopoli, L., Sacca, D., Terracina, G., Ursino, D.: A unififed graph-based framework for deriving nominal interscheme properties, type conflicts, and object cluster similarities. In: Proceedings of the Conf. on Cooperative Information Systems (CoopIS), 1999

  57. Palopoli, L., Sacca, D., Ursino, D.: Semi-automatic, semantic discovery of properties from database schemes. In: Proceedings of the International Database Engineering and Applications Symposium (IDEAS-98), pp. 244–253 (1998)

  58. Palopoli, L., Terracina, G., Ursino, D.: The system DIKE: towards the semi-automatic synthesis of cooperative information systems and data warehouses. In: Proceedings of the ADBIS-DASFAA Conference, 2000

  59. Patterson, D.A., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriquez, P., Fox, A., Kiciman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupman, J., Treuhaft, N.: Recovery-oriented computing (ROC): motivation, definition, techniques, and case studies. Technical Report UCB//CSD-02-1175, University of California, 2002

  60. Perkowitz, M., Etzioni, O.: Category translation: Learning to understand information on the Internet. In: Proceedigns of Internatinal Joint Conference on AI (IJCAI), 1995

  61. Punyakanok, V., Roth, D.: The use of classifiers in sequential inference. In: Proceedings of the Conference on Neural Information Processing Systems (NIPS-00), 2000

  62. Rahm E., Bernstein P.A. (2001) On matching schemas automatically. VLDB J. 10(4): 334–350

    Article  MATH  Google Scholar 

  63. Rahm, E. Do, H., Massmann, S.: Matching large XML schemas. SIGMOD Record, Special Issue in Semantic Integration, December 2004

  64. Rahm, E., Thor, A., Aumueller, D., Do, H., Golovin, N., Kirsten, T.: iFuice—Information fusion utilizing instance correspondences and peer mappings. In: Proceedings of the Eighth International Workshop on the Web and Databases (WebDB), 2005

  65. Ryutaro, I., Hideaki, T., Shinichi, H.: Rule induction for concept hierarchy alignment. In: Proceedings of the 2nd Workshop on Ontology Learning at the 17th International Joint Conference on AI (IJCAI), 2001

  66. Sayyadian, M., LeKhac, H., Doan, A., Gravano, L.: Keyword search across heterogeneous relational databases. Technical report, Department of Computer Science, Universtiy of Illinois (2006)

  67. Seligman, L., Rosenthal, A.: The impact of xml in databases and data sharing. IEEE Computer, 2001

  68. UIMA: Unstructured information management architecture. http://www.research.ibm.com/UIMA/

  69. Velegrakis, Y., Miller, R., Popa, L., Mylopoulos, J.: Tomas: a system for adapting mappings while schemas evolve. In: Proceedings of the Twentieth International Conference on Data Engineering, 2004

  70. Weis, M., Naumann, F.: Dogmatix tracks down duplicates in xml. In: Proceedings of the ACM Conference on Management of Data (SIGMOD), 2005

  71. Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the Deep Web. In: Proceedings of SIGMOD, 2004

  72. Xu, L., Embley, D.: Using domain ontologies to discover direct and indirect matches for schema elements. In: Proceedigns of the Semantic Integration Workshop at ISWC-03. http://smi.stanford.edu/si2003, 2003

  73. Yan, L.L., Miller, R.J., Haas, L.M., Fagin, R.: Data driven understanding and refinement of schema mappings. In: Proceedings of the ACM SIGMOD, 2001

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yoonkyong Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, Y., Sayyadian, M., Doan, A. et al. eTuner: tuning schema matching software using synthetic scenarios. The VLDB Journal 16, 97–122 (2007). https://doi.org/10.1007/s00778-006-0024-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-006-0024-z

Keywords

Navigation