Elsevier

Journal of Web Semantics

Volume 7, Issue 3, September 2009, Pages 189-203
Journal of Web Semantics

Hermes: Data Web search on a pay-as-you-go integration infrastructure

https://doi.org/10.1016/j.websem.2009.07.001Get rights and content

Abstract

The Web as a global information space is developing from a Web of documents to a Web of data. This development opens new ways for addressing complex information needs. Search is no longer limited to matching keywords against documents, but instead complex information needs can be expressed in a structured way, with precise answers as results. In this paper, we present Hermes, an infrastructure for data Web search that addresses a number of challenges involved in realizing search on the data Web. To provide an end-user oriented interface, we support expressive user information needs by translating keywords into structured queries. We integrate heterogeneous Web data sources with automatically computed mappings. Schema-level mappings are exploited in constructing structured queries against the integrated schema. These structured queries are decomposed into queries against the local Web data sources, which are then processed in a distributed way. Finally, heterogeneous result sets are combined using an algorithm called map join, making use of data-level mappings. In evaluation experiments with real life data sets from the data Web, we show the practicability and scalability of the Hermes infrastructure.

Introduction

The Web as a global information space is no longer only a Web of documents, but a Web of data—the data Web. In recent years, the amount of structured data available on the Web has been increasing rapidly. Currently, there are billions of triples publicly available in Web data sources of different domains. These data sources become more tightly interrelated as the number of links in the form of mappings is also growing. The process of interlinking open data sources is actively pursued within the Linking Open Data (LOD) project [2].

This development of a data Web opens a new way for addressing complex information needs. An example might be: “Find articles from Turing Award winners at Stanford University”. No single LOD data source can completely satisfy our example information need. Yet, with the integration of the data sources DBLP, Freebase and DBpedia – all of them publicly available in LOD as RDF data – an answer in principle can be obtained: DBLP contains bibliographic metadata such as authors along with their affiliations, and more information about universities and award winners can be found in Freebase and DBpedia, respectively. Still, the effective exploitation of the data Web brings about a number of challenges:

  • Usability: Searching the data Web effectively requires the use of a structured query language. Yet one cannot assume the user to know what data sources are relevant for answering a query and their schemas. The burden of translating an information need into a structured query should not be imposed on the end users, as it would hinder the widespread exploitation of the data Web. Simple search paradigms adequate for the lay user are needed.

  • Heterogeneity: In order to fully exploit the data Web, available data sources need to be managed in an integrated way. However, data sources cover different, possibly overlapping domains. Data contained in different sources might be redundant, complementary or conflicting. We encounter discrepancies on the schema-level as well as the data-level, i.e. differences in the way the conceptualization, the identifiers and the data values of real world entities are represented. While the LOD project alleviates some of the heterogeneity problems by promoting the creation of links between data sources, such a (manual) upfront integration effort is only a partial solution. In order to deal with the dynamic nature and scale of the data Web, it needs to be complemented with mechanisms that can interrelate and reconcile heterogeneous sources (whose relationships might be not known a priori) in an continuous and automatic manner.

  • Scalability: The amount of data on the Web is ever increasing. The LOD project alone already contains roughly two billion RDF triples in more than 20 data sources. Clearly, efficient query answering that can scale to this amount of data is essential for data Web search.

To address the problems of integration in open data spaces such as the data Web, the pay-as-you-go paradigm to data integration has been proposed. According to Madhavan et al. [19], the main concepts for an affordable integration of the various data sources on the Web are approximate schema mappings, keyword queries with routing and heterogeneous result ranking. Integration is regarded as a process that begins with disparate data sources and continues with incremental improvement of semantic mappings amongst them. At any point during this ongoing integration, the system should be able to process queries using the available information and mappings. Thus it is different from traditional data integration systems that require large upfront effort to manually create complete mappings for the available data sources.

In our paper, we follow the paradigm of pay-as-you-go integration and propose an infrastructure called Hermes that addresses the challenges discussed above:

  • Expressive keyword search: In Hermes, users can formulate queries in terms of keywords. These keywords are translated to the best (top-k) structured queries representing possible interpretations of the information need. Unlike approaches in existing systems (e.g. Sindice,2 Watson3) that simply match keywords against an index of data elements, the results obtained using Hermes do not only match the keywords but also satisfy the structured query computed for the keywords. While existing approaches to keyword translation focus on single data source [16], [13], [28], we propose a novel technique for the computation of queries that might span over multiple data sources, i.e. distributed queries.

  • Integration of Web data sources: Hermes integrates publicly available data sources such that users can ask queries against the data Web in a transparent way. In order to support this, mappings at both the schema- and data-level are precomputed and stored in an index. Existing techniques are used for the actual computation of the mappings. This computation is embedded in a procedure that implements an iterative integration of Web data sources. In particular, it crawls data sources, extracts schemas, and automatically computes mappings as needed, i.e. only those mappings are precomputed that can be used for query processing. This substantially reduces the size of the data that have to be analyzed during the computation of mappings.

  • Efficient query processing: We present techniques for an efficient translation of keywords to structured queries. Instead of searching the entire data space for possible interpretations [16], [13], we construct a query space primarily composed of schema elements. Since it is much smaller than the data space, the search for interpretations can be performed more efficiently. For an efficient processing of the distributed queries computed from the keywords, we propose a special procedure for combining results from different data sources. In particular, we propose the map join, a variant of the similarity join [17], [24]. This form of join is necessary to combine information about the same entities that have different representations in different data sources. An important step part of the join processing is the computation of similarities. The map join procedure can leverage the data-level mappings and thereby avoid the expensive online computation of similarities during online query processing.

The rest of this paper is organized as follows: In Section 2, we introduce the underlying data-, and query model and architecture of Hermes. We then discuss specific aspects of data and query processing in more detail: preprocessing and indexing in Section 3, translation of keywords into structured queries in Section 4, and the distributed processing of queries in Section 5. In Section 6 we report on our evaluation experiments performed with Hermes. Finally, after a discussion of related work in Section 7 we conclude in Section 8.

Section snippets

Hermes infrastructure

In this section we introduce the conceptual architecture of our Hermes infrastructure. Before discussing the components of the infrastructure in detail, we will define the data and queries involved in our data Web search setting.

Data preprocessing

This section describes the offline process where the data graphs are preprocessed and stored in specific data structures of the internal indices.

Keyword query translation

In this section, we describe the computation of possible interpretations of the user keywords. These interpretations are presented to the user in the form of query graphs. For computing such query graphs from keywords, Ref. [28] proposes a procedure consisting of three main steps: (1) construction of the query search space and (2) top-k query graph exploration, and (3) query graph ranking. We extend this work on keyword search to the data Web scenario. Instead of a single data source, the

Distributed query processing

Query translation results in a list of top-k query graphs. Distributed query processing is the subsequent step that starts with the query graph gq selected by the user. The query graph is d ecomposed into parts such that each part can be evaluated against a particular data graph. Before routing, each part needs to be mapped to the query format supported by the local query engines. For optimizing performance, a planner is employed to determine an appropriate order of query execution. Finally,

Evaluation experiments

We will now discuss experiments we have performed with a system implementing the Hermes infrastructure. The goal of the experiments is to show the performance and effectiveness of our system with real life data sets available on the data Web.

Related work

There exist several dimensions of related work. We structure our discussion along the presentation of our contributions: (1) infrastructures for data Web search, (2) keyword query translation, and (3) federated query processing.

Conclusions

We have presented Hermes, an infrastructure for search on the data Web. In the realization of Hermes, we have presented a number of original contributions: We have proposed a novel technique for translating user keywords to structured queries against heterogeneous Web data sources. Further, we have designed a number of indices that are needed in order to realize efficient search over the data Web. Finally, we have elaborated on techniques for distributed query processing on the data Web,

References (29)

  • D.W. Embley et al.

    Automatic direct and indirect schema mapping: experiences and lessons learned

    SIGMOD Rec.

    (2004)
  • J. Euzenat et al.

    Ontology Matching

    (2007)
  • R. Goldman et al.

    Dataguides: enabling query formulation and optimization in semistructured databases

  • L. Guo et al.

    Xrank: ranked keyword search over xml documents

  • Cited by (59)

    • An improved RDF data Clustering Algorithm

      2019, Procedia Computer Science
    • Effective searching of RDF knowledge graphs

      2018, Journal of Web Semantics
      Citation Excerpt :

      The approaches in the second group [18–23] allow users to search RDF knowledge graphs using keyword queries. The keyword query is then mapped into one or more triple-pattern query [18–22]. Again in all these approaches, result ranking and diversity for the generated candidate queries are not considered, as is the case with the approaches that deal with natural language questions.

    • Optimizing Keyword Search Over Federated RDF Systems

      2023, IEEE Transactions on Big Data
    • Aggregation Path Search using Multiple Large RDF Datasets with Equivalence Relations

      2023, Transactions of the Japanese Society for Artificial Intelligence
    • RDF graph mining for cluster-based theme identification

      2020, International Journal of Web Information Systems
    View all citing articles on Scopus
    1

    Present address: Fluid Operations, D-69190 Walldorf, Germany.

    View full text