Double-layered schema integration of heterogeneous XML sources

https://doi.org/10.1016/j.jss.2010.07.055Get rights and content

Abstract

Schema integration aims to create a mediated schema as a unified representation of existing heterogeneous sources sharing a common application domain. These sources have been increasingly written in XML due to its versatility and expressive power. Unfortunately, these sources often use different elements and structures to express the same concepts and relations, thus causing substantial semantic and structural conflicts. Such a challenge impedes the creation of high-quality mediated schemas and has not been adequately addressed by existing integration methods. In this paper, we propose a novel method, named XINTOR, for automating the integration of heterogeneous schemas. Given a set of XML sources and a set of correspondences between the source schemas, our method aims to create a complete and minimal mediated schema: it completely captures all of the concepts and relations in the sources without duplication, provided that the concepts do not overlap. Our contributions are fourfold. First, we resolve structural conflicts inherent in the source schemas. Second, we introduce a new statistics-based measure, called path cohesion, for selecting concepts and relations to be a part of the mediated schema. The path cohesion is statistically computed based on multiple path quality dimensions such as average path length and path frequency. Third, we resolve semantic conflicts by augmenting the semantics of similar concepts with context-dependent information. Finally, we propose a novel double-layered mediated schema to retain a wider range of concepts and relations than existing mediated schemas, which are at best either complete or minimal, but not both. Performed on both real and synthetic datasets, our experimental results show that XINTOR outperforms existing methods with respect to (i) the mediated-schema quality using precision, recall, F-measure, and schema minimality; and (ii) the execution performance based on execution time and scale-up performance.

Introduction

Extensible Markup Language (XML) has become a standard data format widely used in many organizations, thus leading to a growing need for exchanging and integrating multiple XML data sources across different application systems. Integrating these sources requires an integration of their schemas. Schema integration aims to create a so-called mediated schema which works as a uniform query interface to a multitude of data sources, freeing users from learning their underlying schemas. However, since these XML sources are independently developed, they are highly heterogeneous in both semantics and structure (see an example in Fig. 1). Despite a long-standing research effort in the field (Batini et al., 1986, Buneman et al., 1992, Hull, 1984, Kalinichenko, 1990, Miller et al., 1993, Lenzerini, 2002, Pottinger and Bernstein, 2002, Chiticariu et al., 2008, He and Chang, 2003, Magnani et al., 2005, Sarma et al., 2008, Saleem et al., 2008), schema heterogeneity still remains a major bottleneck in deploying large-scale integration systems (Doan et al., 2001).

Semantic and structural conflicts are two prevalent challenges that cause schema heterogeneity. Semantic conflicts arise when different sources describe the same concepts using different element names, or there is overlapping meaning between similar concepts in different sources. In Fig. 1, all of the three views describe the same domain on mobile phones but S1 refers to product catalogue, S2 to user reviews, and S3 to mobile plans. The same concept of a “mobile phone” can be described using different terms: CellPhone in S1, mobile-phone in S2 and mobile in S3. Similarly, structural conflicts happen when the same relations1 are expressed by different XML structures. For example, two different paths, mobile-phone/nearest-store in S2 and branch/plan/mobile in S3, describe the same relation “a mobile phone being sold in a store”. Obviously, when the number of sources increases, the problem of semantic and structural heterogeneity can become much worse.

To resolve these conflicts, we need to create a mediated schema that is both complete and minimal. Being “complete” means that the mediated schema contains all of the relations in the sources. Note that preserving all relations implies the preservation of all concepts. Being “minimal” means that each of these relations appears only once, without redundancy. This mediated schema is desirable because it preserves all of the information—relations and concepts—from the sources, in a compact form.

Current methods, due to the way they automatically resolve conflicts, produce mediated schemas that are still far from optimal. Consider the two mediated schemas in Fig. 2a1 and a2. The former mediated schema is complete but non-minimal. A natural integration strategy (e.g., PORSCHE (Saleem et al., 2008)) is based on pattern-growth techniques (Zaki, 2005) to expand its schema tree. Whenever the same concept appears again (e.g., STORE or NAME2), that concept will be duplicated and attached to the tree. At the same time, the newly created path (e.g., STORE/NAME, MOBILE/STORE) is also duplicated, making this schema non-minimal. The process ends when all of the source concepts are integrated, i.e. complete. The latter mediated schema is minimal but incomplete because it integrates only overlapping concepts while ignoring the source-specific ones. Answering user queries using these two mediated schemas either increases the unnecessary search space for the duplicate concepts and relations in the former case, or fails to answer source-specific queries in the latter.

In this paper, we propose a novel method called XINTOR (short for XML schema integrator) for integrating multiple heterogeneous XML sources. We introduce a new notion of a double-layered mediated schema that effectively integrates multiple heterogeneous XML sources. We assume that there is no overlap between concepts in different sources. Our approach favors the integration of relations over that of separate concepts because the relations—together with their associated concepts—carry richer domain information. Moreover, integrating a relation will naturally integrate the concepts embedded within that relation, but the reverse might not be true. In our paper context, the completeness measure refers to the extent to which both concepts and relations are preserved. Given a set of similarity scores between element names (also known as value correspondence (Pottinger and Bernstein, 2002)), we can create this mediated schema without human intervention.

Exploiting various structural properties, we resolve three types of structural conflict: nesting discrepancy, backward path and structural diversity. To select relations for the mediated schema, we introduce a new statistics-based measure, called path cohesion, to evaluate a relation on multiple quality dimensions: path frequency, association strength and semantic closeness. Our integration process also augments the concepts with context-dependent information, thus making them more contextually meaningful when associated with other concepts. We integrate the selected relations into our proposed double-layered mediated schema: the core layer captures the most domain-central concepts and structures while the secondary layer retains source-specific details. Our double-layered mediated schema aims to integrate all of the concepts and relations from the sources—in a minimal manner. Our experimental results show that our approach outperforms traditional methods in terms of (i) the output quality (measured by precision, recall, F-measure, and schema minimality), and (ii) the integration performance (measured by scale-up performance and execution time).

The remainder of this article is organized as follows. In the next section, we define the problem of schema integration. In Section 3, we provide an overview of our integration approach. In Section 4, we determine candidate relations with context-independent concept clustering and structural conflict resolution. In Section 5, we select relations using the statistical cohesion measure and context refinement. In Section 6, we describe how to create double-layered mediated schemas from the selected relations. In Section 7, we present experimental evaluations in terms of precision, recall, F-measure, schema minimality and performance. Finally, we present the related work in Section 8 and conclude in Section 9.

Section snippets

Problem definition

This section formally defines the problem of integrating heterogeneous schemas, and describes some technical challenges facing current methods.

Definition 1 (XML schema integration problem)

Given a set of XML sources, a set of semantic correspondences between their elements, and a set of thresholds and weights, create a mediated schema that satisfies both schema completeness and schema minimality.

To illustrate the definition above, consider the example in Fig. 1. Given three XML sources S1, S2 and S3, we encode them as schema trees (Nguyen

Overview of our approach

With a goal of creating a complete and minimal mediated schema, our integration approach proceeds through three main steps: (i) pairwise conflict resolution, (ii) statistics-based conflict resolution and (iii) creating mediated schema (as described in Fig. 3). The pairwise resolution locally resolves conflicts between two pairs of substructures, and lays a foundation for the subsequent statistics-based resolution step.

We start by resolving conflicts between the sources by clustering

Pairwise conflict resolution

In this section, we present in detail the first step of XINTOR, which consists of resolving conflicts between pairs of substructures, identifying candidate relations (which have a great potential to be integrated in the final mediated schema), and obtaining a set of semantic mappings between the mediated schema and the sources. To achieve this, we first cluster context-independent concepts using the set of given correspondences. Then, we resolve several major structural conflicts among the

Statistics-based conflict resolution

In this section, we present the second step of XINTOR, where the goal is to resolve semantic and structural conflicts using statistical measurements. To do so, we define a statistics-based measure, termed path cohesion, to calculate the extent to which a relation is semantically relevant to the domain. This measure combines multiple quality dimensions of a relation and will be used to select the most qualified relations for the mediated schema. Naturally, the concepts associated with the

Creating double-layered mediated schema

In this section, we concentrate on the third step of XINTOR, where the objective is to create a double-layered mediated schema based on the relations (together with their associated concepts) selected during the previous step. Our proposed mediated schema consists of two layers: the core layer, including important domain-central concepts and relations; and the secondary layer, covering non-essential source-specific information. The algorithm createMediatedSchema (in Listing 1) sketches the

Empirical evaluation

In this section, we evaluate our method in two respects: the output quality of the mediated schema and the execution performance. The output quality measures how well the mediated schema represents the source schemas via the semantic mappings; the execution performance compares how fast the algorithms run. The mediated-schema quality is measured using precision, recall, F-measure (Do and Rahm, 2007), and schema minimality. Since our mediated schema integrates all of the concepts and relations

Related work

Schema integration is a long-standing research topic (Batini et al., 1986, Buneman et al., 1992, Chiticariu et al., 2008, Hull, 1984, Kalinichenko, 1990, He and Chang, 2003, Magnani et al., 2005, Miller et al., 1993, Pottinger and Bernstein, 2002, Sarma et al., 2008), and continues to be a difficult problem in practice. Previous work mostly focuses on studying theoretical aspects of schema integration (Batini et al., 1986, Buneman et al., 1992, Hull, 1984, Kalinichenko, 1990, Lenzerini, 2002,

Conclusion

In this paper, we presented a novel double-layered schema integration method. We generated a complete and minimal representation of the mediated schema to support the structural heterogeneity of the schema sources, which results from nesting discrepancy, backward path and structural diversity. Our mediated schema has two layers: the core layer models concepts and relations that are central to the domain, while the secondary layer retains source-specific details. We proposed a new

Acknowledgments

We thank the anonymous reviewers for their useful comments on this paper. Early versions of this article were published in the proceedings of IEEE/WIC/ACM Web Intelligence (Nguyen et al., 2008c), IEEE/AINA WAMIS (Nguyen et al., 2008b) and OTM Conferences (Nguyen et al., 2008a). The first author is supported by La Trobe University Postgraduate Research Scholarship and the CS and CE studentship.

Hong-Quang Nguyen is currently a sessional lecturer and a PhD candidate at the Department of Computer Science and Computer Engineering, La Trobe University, Australia. He works on various advanced topics on data integration, data mining, Semantic Web and ontology, software engineering, information and knowledge management. He is an active member of the Data Engineering and Knowledge Management Laboratory at La Trobe University. He can be contacted at [email protected].

References (36)

  • M. da Conceição Moraes Batista et al.

    Information quality measurement in data integration schemas

  • Do, H.-H., 2006. Schema Matching and Mapping-based Data Integration. Ph.D. thesis. Dept of Computer Science, University...
  • Doan, A., 2002. Learning to Map Between Structured Representations of Data. Ph.D. thesis. University of Washington,...
  • A. Doan et al.

    Reconciling schemas of disparate data sources: a machine-learning approach

  • M. Ehrig

    Ontology Alignment: Bridging the Semantic Gap. Vol. 4 of Semantic Web And Beyond Computing for Human Experience

    (2007)
  • B. He et al.

    Statistical schema matching across web query interfaces

  • R. Hull

    Relative information capacity of simple relational database schemata

  • L.A. Kalinichenko

    Methods and tools for equivalent data model mapping construction

  • Cited by (20)

    • Data integration in fuzzy XML documents

      2014, Information Sciences
      Citation Excerpt :

      In [37], Tansalarak et al. proposed a matching algorithm that performs a path-based matching of two XML schema trees. In [28], Nguyen et al. presented a double-layered schema integration approach for automating the integration of heterogeneous XML schemas. In particular, three types of structural conflicts: nesting discrepancy, backward path and structural diversity, are resolved in their approach.

    • Empowering integration processes with data provenance

      2013, Data and Knowledge Engineering
      Citation Excerpt :

      Data integration at schema and instance levels has been the focus of a lot of attention in both academia and industry [1–3]. At schema level, data integration aims at determining correspondences among concepts from heterogeneous schemas [4–6]. At instance level, there are two main issues: entity resolution and conflict resolution [7–11].

    • Evolution and change management of XML-based systems

      2012, Journal of Systems and Software
      Citation Excerpt :

      However, these are not the only operations that can occur within the system. If we consider the area of integration, we need to deal with the problem of a new incoming application and its integration with the current ones (Nguyen et al., 2011), or even integration of a whole XML system. This wide area involves issues such as reverse-engineering of conceptual models and schema matching (Wojnar et al., 2010; Klímek and Nečaský, 2010; Tekli et al., 2009), similarity evaluation (Wojnar et al., 2010) etc.

    • Intermediary XML schemas: Constraint, templating and interoperability in complex environments

      2020, International Journal of Metadata, Semantics and Ontologies
    View all citing articles on Scopus

    Hong-Quang Nguyen is currently a sessional lecturer and a PhD candidate at the Department of Computer Science and Computer Engineering, La Trobe University, Australia. He works on various advanced topics on data integration, data mining, Semantic Web and ontology, software engineering, information and knowledge management. He is an active member of the Data Engineering and Knowledge Management Laboratory at La Trobe University. He can be contacted at [email protected].

    David Taniar holds Bachelor, Master, and PhD degrees - all in Computer Science, with a particular specialty in Databases. His current research is applying data management techniques to various domains, including mobile and geography information systems, parallel and grid computing, web engineering, and data mining. Every year he publishes extensively, including his recent co-authored book: High Performance Parallel Database Processing and Grid Databases (John Wiley & Sons, 2008). His list of publications can be viewed at the DBLP server (http://www.informatik.uni-trier.de/∼ley/db/indices/a-tree/t/Taniar:David.html). He is a founding editor-in-chief of three SCI-E journals: Intl. J. of Data Warehousing and Mining, Mobile Information Systems, and Intl. J. of Web and Grid Services. He is currently an Associate Professor at the Faculty of Information Technology, Monash University, Australia. He can be contacted at [email protected].

    J. Wenny Rahayu is an Associate Professor at the Department of Computer Science and Computer Engineering, LaTrobe University. Her research areas cover a wide range of advanced databases topics including XML Databases, Spatial and Temporal Databases, Data Warehousing, Semantic Web and Ontology. She is currently the Head of Data Engineering and Knowledge Management Laboratory at La Trobe University.

    Kinh Nguyen received his BSc. Hns. and MSc. with Distinction from Canterbury University, New Zealand, and PhD in Computer Science from La Trobe University, Australia. He is currently a lecturer in Computer Science at La Trobe University and is a member of the Software Engineering Research Group. His main research interest is in the area of data-intensive systems development, web information systems, semantics web, formal specification and rigorous software development processes, and model-driven engineering. He can be contacted at [email protected].

    View full text