Hermes: Data Web search on a pay-as-you-go integration infrastructure
Introduction
The Web as a global information space is no longer only a Web of documents, but a Web of data—the data Web. In recent years, the amount of structured data available on the Web has been increasing rapidly. Currently, there are billions of triples publicly available in Web data sources of different domains. These data sources become more tightly interrelated as the number of links in the form of mappings is also growing. The process of interlinking open data sources is actively pursued within the Linking Open Data (LOD) project [2].
This development of a data Web opens a new way for addressing complex information needs. An example might be: “Find articles from Turing Award winners at Stanford University”. No single LOD data source can completely satisfy our example information need. Yet, with the integration of the data sources DBLP, Freebase and DBpedia – all of them publicly available in LOD as RDF data – an answer in principle can be obtained: DBLP contains bibliographic metadata such as authors along with their affiliations, and more information about universities and award winners can be found in Freebase and DBpedia, respectively. Still, the effective exploitation of the data Web brings about a number of challenges:
Usability: Searching the data Web effectively requires the use of a structured query language. Yet one cannot assume the user to know what data sources are relevant for answering a query and their schemas. The burden of translating an information need into a structured query should not be imposed on the end users, as it would hinder the widespread exploitation of the data Web. Simple search paradigms adequate for the lay user are needed.
Heterogeneity: In order to fully exploit the data Web, available data sources need to be managed in an integrated way. However, data sources cover different, possibly overlapping domains. Data contained in different sources might be redundant, complementary or conflicting. We encounter discrepancies on the schema-level as well as the data-level, i.e. differences in the way the conceptualization, the identifiers and the data values of real world entities are represented. While the LOD project alleviates some of the heterogeneity problems by promoting the creation of links between data sources, such a (manual) upfront integration effort is only a partial solution. In order to deal with the dynamic nature and scale of the data Web, it needs to be complemented with mechanisms that can interrelate and reconcile heterogeneous sources (whose relationships might be not known a priori) in an continuous and automatic manner.
Scalability: The amount of data on the Web is ever increasing. The LOD project alone already contains roughly two billion RDF triples in more than 20 data sources. Clearly, efficient query answering that can scale to this amount of data is essential for data Web search.
To address the problems of integration in open data spaces such as the data Web, the pay-as-you-go paradigm to data integration has been proposed. According to Madhavan et al. [19], the main concepts for an affordable integration of the various data sources on the Web are approximate schema mappings, keyword queries with routing and heterogeneous result ranking. Integration is regarded as a process that begins with disparate data sources and continues with incremental improvement of semantic mappings amongst them. At any point during this ongoing integration, the system should be able to process queries using the available information and mappings. Thus it is different from traditional data integration systems that require large upfront effort to manually create complete mappings for the available data sources.
In our paper, we follow the paradigm of pay-as-you-go integration and propose an infrastructure called Hermes that addresses the challenges discussed above:
- –
Expressive keyword search: In Hermes, users can formulate queries in terms of keywords. These keywords are translated to the best (top-k) structured queries representing possible interpretations of the information need. Unlike approaches in existing systems (e.g. Sindice,2 Watson3) that simply match keywords against an index of data elements, the results obtained using Hermes do not only match the keywords but also satisfy the structured query computed for the keywords. While existing approaches to keyword translation focus on single data source [16], [13], [28], we propose a novel technique for the computation of queries that might span over multiple data sources, i.e. distributed queries.
- –
Integration of Web data sources: Hermes integrates publicly available data sources such that users can ask queries against the data Web in a transparent way. In order to support this, mappings at both the schema- and data-level are precomputed and stored in an index. Existing techniques are used for the actual computation of the mappings. This computation is embedded in a procedure that implements an iterative integration of Web data sources. In particular, it crawls data sources, extracts schemas, and automatically computes mappings as needed, i.e. only those mappings are precomputed that can be used for query processing. This substantially reduces the size of the data that have to be analyzed during the computation of mappings.
- –
Efficient query processing: We present techniques for an efficient translation of keywords to structured queries. Instead of searching the entire data space for possible interpretations [16], [13], we construct a query space primarily composed of schema elements. Since it is much smaller than the data space, the search for interpretations can be performed more efficiently. For an efficient processing of the distributed queries computed from the keywords, we propose a special procedure for combining results from different data sources. In particular, we propose the map join, a variant of the similarity join [17], [24]. This form of join is necessary to combine information about the same entities that have different representations in different data sources. An important step part of the join processing is the computation of similarities. The map join procedure can leverage the data-level mappings and thereby avoid the expensive online computation of similarities during online query processing.
The rest of this paper is organized as follows: In Section 2, we introduce the underlying data-, and query model and architecture of Hermes. We then discuss specific aspects of data and query processing in more detail: preprocessing and indexing in Section 3, translation of keywords into structured queries in Section 4, and the distributed processing of queries in Section 5. In Section 6 we report on our evaluation experiments performed with Hermes. Finally, after a discussion of related work in Section 7 we conclude in Section 8.
Section snippets
Hermes infrastructure
In this section we introduce the conceptual architecture of our Hermes infrastructure. Before discussing the components of the infrastructure in detail, we will define the data and queries involved in our data Web search setting.
Data preprocessing
This section describes the offline process where the data graphs are preprocessed and stored in specific data structures of the internal indices.
Keyword query translation
In this section, we describe the computation of possible interpretations of the user keywords. These interpretations are presented to the user in the form of query graphs. For computing such query graphs from keywords, Ref. [28] proposes a procedure consisting of three main steps: (1) construction of the query search space and (2) top-k query graph exploration, and (3) query graph ranking. We extend this work on keyword search to the data Web scenario. Instead of a single data source, the
Distributed query processing
Query translation results in a list of top-k query graphs. Distributed query processing is the subsequent step that starts with the query graph selected by the user. The query graph is d ecomposed into parts such that each part can be evaluated against a particular data graph. Before routing, each part needs to be mapped to the query format supported by the local query engines. For optimizing performance, a planner is employed to determine an appropriate order of query execution. Finally,
Evaluation experiments
We will now discuss experiments we have performed with a system implementing the Hermes infrastructure. The goal of the experiments is to show the performance and effectiveness of our system with real life data sets available on the data Web.
Related work
There exist several dimensions of related work. We structure our discussion along the presentation of our contributions: (1) infrastructures for data Web search, (2) keyword query translation, and (3) federated query processing.
Conclusions
We have presented Hermes, an infrastructure for search on the data Web. In the realization of Hermes, we have presented a number of original contributions: We have proposed a novel technique for translating user keywords to structured queries against heterogeneous Web data sources. Further, we have designed a number of indices that are needed in order to realize efficient search over the data Web. Finally, we have elaborated on techniques for distributed query processing on the data Web,
References (29)
- et al.
The anatomy of a large-scale hypertextual web search engine
Comput. Net.
(1998) - et al.
Fast similarity join for multi-dimensional data
Inform. Syst.
(2007) - et al.
Efficient similarity-based operations for data integration
Data Knowl. Eng.
(2004) - et al.
Object rank: authority-based keyword search in databases
- et al.
Linked data on the web
- et al.
Falcons: searching and browsing entities on the semantic web
- et al.
A survey on ontology mapping
SIGMOD Rec.
(2006) - et al.
Characterizing knowledge on the semantic web with watson
- et al.
Boosting semantic web data access using swoogle
- et al.
Indexing dataspaces
Automatic direct and indirect schema mapping: experiences and lessons learned
SIGMOD Rec.
Ontology Matching
Dataguides: enabling query formulation and optimization in semistructured databases
Xrank: ranked keyword search over xml documents
Cited by (59)
An improved RDF data Clustering Algorithm
2019, Procedia Computer ScienceEffective searching of RDF knowledge graphs
2018, Journal of Web SemanticsCitation Excerpt :The approaches in the second group [18–23] allow users to search RDF knowledge graphs using keyword queries. The keyword query is then mapped into one or more triple-pattern query [18–22]. Again in all these approaches, result ranking and diversity for the generated candidate queries are not considered, as is the case with the approaches that deal with natural language questions.
Optimizing Keyword Search Over Federated RDF Systems
2023, IEEE Transactions on Big DataAggregation Path Search using Multiple Large RDF Datasets with Equivalence Relations
2023, Transactions of the Japanese Society for Artificial IntelligenceA multi-aspect approach to ontology matching based on Bayesian cluster ensembles
2020, Journal of Intelligent Information SystemsRDF graph mining for cluster-based theme identification
2020, International Journal of Web Information Systems
- 1
Present address: Fluid Operations, D-69190 Walldorf, Germany.