Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Problem Statement

The strong support that Web based technologies have received from researchers, developers, and practitioners has resulted in the publication of data from almost any domain on the Web. Additionally, standards and technologies have been defined to query, search, and manage Web accessible data sources. For example, Web access interfaces or APIs allow for querying and searching sources like Twitter, Google+, or the DBpedia, Wikidata. Web data sources make overlapping as well as complementary data available about entities, e.g., people or products. Although these entities may be described in terms of different vocabularies by these Web data sources, the data correspond to the same real-world entities. Thus, the distributed data needs to be integrated in order to have a more complete description of these entities (Fig. 1).

As example, consider a distributed and heterogeneous search scenario in the context of crime investigation. During a crime investigation process, collecting, and analyzing information from different sources is a key step performed by investigators. Although scene analysis is always required, a crime investigation process greatly benefits from searching information about people, products, and organisations on the Web. Typically, data collected from the following data sources is utilised for enhancing crime analysis processes: (1) The Social Web encompasses user generated content and personal profiles. (2) The Deep Web advertises products and services offered by organisations, e.g., the eBay e-commerce platform. (3) The Web of Data includes billions of machine-comprehensible facts, which can serve as background knowledge for collecting information about different types of entities. (4) The Dark Web refers to sites accessible only with specific software, and restricted trading of goods that can be accessed through the so-called dark-net markets.

Fig. 1.
figure 1

Motivating Example. Pieces of data (RDF molecules) about Joaquin Chapo Guzman collected from different Web social networks.

To solve this data integration scenario, in this doctoral work we propose FuhSen, a semantic integration approach that exploits Web APIs (e.g., REST APIs) provided by Web data sources to collect, and integrate molecules of data, to then enrich and summarize information about an entity (e.g., a suspect). Using Linked Data as the core technology, the objectives of FuhSen approach is to provide a novel integration technique able to: (1) integrate heterogeneous data extracted from APIs into a unified data schema on-demand; (2) create a Knowledge Graph on-demand with the data extracted from the different data sources; and (3) enrich this Knowledge Graph using semantic algorithms e.g., entity disambiguation, typing and entity summarization, and ranking.

2 Research Objectives

This doctoral work attempts to answer the following research questions:

figure a

To answer RQ1, we plan to explore and evaluate the use of RDF vocabularies to facilitate source selection and data fusion tasks.

figure b

To answer RQ2, we will analyse how to use both the explicit semantics, e.g., Properties, Relations, and Hierarchy of classes, and also the implicit semantics.

figure c

To answer RQ3, we will evaluate semantic similarity approaches that can be used in the context of data integration. Propose a new semantic similarity metric with the goal of data integration is also an option to answer RQ3.

3 State-of-the-art

Traditional approaches toward constructing Knowledge Graphs (KG), e.g., NOUS [3], Knowledge Vault [7], or DeepDive [12], imply materialization of the designed graph built from (un-, semi-)structured sources. Therefore, a heavy Extraction Transformation Loading (ETL) process needs to be executed to integrate the data. In comparison, the novelty of FuhSen approach resides in a non-materialized Knowledge Graph and usage of semantics encoded in the data. Non-materialization supports efficient knowledge delivery on-demand. Further, FuhSen creates RDF molecules that unify and encode hybrid knowledge from heterogeneous sources in an abstract entity. Moreover, the problem of integrating RDF graphs is in the research focus for many years. Knoblock et al. [10] propose KARMA, a framework for integrating structured data sources. Schultz et al. [15] describe a Linked Data Integration Framework (LDIF) that provides a set of independent tools to support the process of interlinking RDF datasets. For instance, SILK [9] identifies owl:sameAs links among entities of two datasets and Sieve [11] performs data fusion. Although the aforementioned approaches are fast and effective, they require domain knowledge and significant manual effort while configuring the pipeline. In contrast, FuhSen is a universal black box technique that requires only a small number of high-level parameters, while enables users to adjust the system according to the application domain.

4 Proposed Approach

Given a keyword query, FuhSen executes the query over the relevant sources, and utilizes semantic similarity measures to determine the relatedness among the entities to be integrated. FuhSen creates a Knowledge Graph with the integrated entities at query time. A Knowledge Graph is composed of a set of entities, their properties, and relations among these entities. The Semantic Web technology stack provides the pieces required to define and build a Knowledge Graph. To properly understand these concepts, we follow the notation proposed by Arenas et al. [1], Piro et al. [13], and Fernandez et al. [8] to define RDF triples, Knowledge Graphs, and RDF molecules, respectively.

Definition 1

(RDF triple [1]). Let \(\mathbf {I}\), \(\mathbf {B}\), \(\mathbf {L}\) be disjoint infinite sets of URIs, blank nodes, and literals, respectively. A tuple \((s, p, o) \in (\mathbf {I} \cup \mathbf {B}) \times \mathbf {I} \times (\mathbf {I} \cup \mathbf {B} \cup \mathbf {L})\) is denominated an RDF triple, where s is called the subject, p the predicate, and o the object.

Definition 2

(Knowledge Graph [13]). Given a set T of RDF triples, a Knowledge Graph is a pair \(G=(V, E)\), where \(V = \{s \mid (s, p, o) \in T\} \cup \{o \mid (s, p, o) \in T\}\) and \(E=\{(s, p, o) \in T\}\).

Definition 3

(RDF Subject Molecule [8]). Given an RDF graph G, an RDF subject-molecule \(M \subseteq G\) is a set of triples \({t_1, t_2, \dots , t_n}\) in which \( subject (t_1) = subject (t_2) = \dots = subject (t_n)\).

Fig. 2.
figure 2

The FuhSen Architecture. FuhSen receives a keyword query Q and a threshold T, and produces a Knowledge Graph G populated with the entities associated with the keywords

FuhSen is a two-fold approach, the architecture is shown in Fig. 2. The first step is the creation of RDF molecules of data from the heterogeneous Web data sources. Web services facilitate accessibility of data on-demand e.g., using REST services. In this step, RDF molecules from the same entities have to be recognized and integrated in order to build complete Knowledge Graphs

4.1 Creation of RDF Molecules

As an input, FuhSen receives a keyword query Q, e.g., Joaquin Chapo Guzman, and a similarity threshold value T, e.g., 0.7. The input values are processed by the Query Rewriting module, which formulates a correct query to be sent to the Search Engine module. The Search Engine explores several wrappers and transforms the output into RDF molecules. Intermediate results are enriched with additional knowledge in the RDF Molecules Enrichment module.

4.2 Integration of RDF Molecules

This module constructs a Knowledge Graph out of the enriched molecules. The input is a set of RDF molecules, and the output is an integrated RDF graph. The module consists of three sub-modules:

  • Computing Similarity of RDF Molecules. Similar RDF molecules should be integrated in order to create a fused, universal representation of a certain entity. In contrast with triple-based linking engines like Silk [15], we employ a RDF molecule-based approach increasing the complexity level and considering the semantics of molecules. That is, we do not work with independent triples, but rather with a set of triples belonging to a certain subject. The RDF molecule-based approach allows for natural clustering of a Knowledge Graph, reducing the complexity of the linking algorithm.

  • 1-1 Weighted Perfect Matching. Given a weighted bipartite graph BG of RDF molecules, where weights correspond to values of semantic similarity between the RDF molecules in BG, a matching of BG corresponds to a set of edges that do not share an RDF molecule, and where each RDF molecule of BG is incident to exactly one edge of the matching. The problem of the 1-1 weighted perfect matching of BG corresponds to a matching where the sum of the values of edge weights in the matching have a maximal value.

  • Integration functions. When similar molecules are identified under the desired conditions, the last step of the pipeline is to integrate them into an RDF Knowledge Graph. The result Knowledge Graph contains all the unique facts of the analyzed set of RDF molecules. The implementation of the integration function in FuhSen is the union, i.e., the logical disjunction, of the molecules identified as similar during the previous steps.

5 Research Methodology and Research Design

The research methodology of this doctoral work includes the following steps:

  1. 1.

    Review the literature to evaluate the state-of-the-art approaches relevant to the problem of integrating heterogeneous Web data sources on-demand.

  2. 2.

    Formalise an on-demand semantic integration approach named FuhSen.

  3. 3.

    Empirically evaluate different properties of the approach, e.g., effectiveness and performance. Evaluate different components of the architecture and propose new algorithms and operators to realize the vision of this work.

6 Results and Contributions

So far, we have evaluated the architecture and the effectiveness of FuhSen. In  [4, 6], we proposed and implemented an RDF vocabulary mediator-wrapper architecture and proposed an evaluation study to answer RQ1.

Lessons Learned: The proposed architecture implemented in FuhSen is able to query heterogeneous Web data sources and create RDF molecules of data in a federated manner. The results of our evaluations shows that the vocabulary approach defined in FuhSen architecture allows for handling heterogeneity of data in an effective way. At the same time more Web data sources can be plug in/out in an easy manner, as consequence, the integration process is reduced. However, the experiments show problems in terms of scalability, the more Web data sources the slower the integration process becomes. Thus, a better resource selection approach should be investigated to answer research question RQ1.

In  [5], we propose a two-fold approach to integrate RDF molecules from different data sources, with this approach and its evaluation we answer research questions RQ2 and RQ3. We evaluate the effectiveness of integration on different datasets. We also experiment with two similarity metrics: Jaccard and GADES [14]. Our goal was to determine the impact of similarity function on the integration approach. Therefore, a triplet-based similarity metric Jaccard is compared against a semantic similarity function GADES [14].

We created a Gold Standard (GS) of the type Person extracted from DBpedia, which results in 829,184 triples. Two Test data Sets (TS) were created from the Gold Standard with their properties and values randomly split among two test datasets. Each triple is randomly assigned to one or several test datasets. We measure the behavior of our integration approach FuhSen [5] in terms of the following metrics: Precision, Recall, and F-measure during the experiments. Precision is the fraction of RDF molecules that has been identified and integrated by the approach (M) that intersects with the Gold Standard (GS), i.e., \( Precision = \frac{|M \cap GS|}{|M|}\). Recall corresponds to the fraction of the identified similar molecules in the Gold Standard, i.e., \( Recall = \frac{|M \cap GS|}{|GS|} \).

Lessons Learned: Table 1 shows the effectiveness of FuhSen on the integration task over 20,000 molecules. Jaccard demonstrates lower performance on the data set as its algorithm just relies on the particular properties of the RDF molecule. Jaccard does not utilize semantics encoded in the Knowledge Graph. On the other hand, GADES exhibit a good performance and it might be used as a black box in FuhSen approach.

Performance of the integration depends on the threshold parameter. As a simple set-based approach, the performance (precision, recall, and F-measure) of the Jaccard similarity quickly decreases with higher thresholds, while GADES remains stable. These insights suggest a positive answer to research questions RQ2 and RQ3. However, GADES performance is impacted by the quality of the schema e.g., a good design of the hierarchy of classes, properties, and relations. Thus, enrichment of the molecules and tuning process is a pre-requisite in GADES. This impacts on an automatic nature of the problem we are trying to solve. Therefore, a pre-trained and automatic similarity function to compare RDF Molecules is required to answer research questions RQ2 and RQ3.

Table 1. Effectiveness of FuhSen on 20,000 RDF molecules. Jaccard vs GADES approach using different thresholds (T). Highest values of Recall and F-measure are highlighted in bold.

Although significant progress has been done in the context of this doctoral work, more empirically results are needed to fully answer research questions RQ1, RQ2, and RQ3. Next section describes the plan for the next year.

7 Work Plan

This doctoral work is entering in its final stage (3rd year). To completely answer the defined research questions the following research tasks remain:

  1. 1.

    Propose a novel RDF fusion operator, the operator should be able to determine the relatedness between two RDF molecules and integrating them. This task is related to RQ2, and the target publication is a research paper.

  2. 2.

    Present a semantic similarity measure based on TransE [2] which utilizes the gradient descent optimization method to learn the features representation of RDF entities automatically. This task is related to RQ3, and the target publication is a research paper.

  3. 3.

    Present a scalable and efficient source selection based on the semantic description of the Web sources and keyword query. This task is related to RQ3, and we plan to publish a research paper.

8 Conclusions

In this doctoral work, we address the problem of data integration about the same entity that is spread in different Web data sources. We propose FuhSen, an on-demand semantic integration approach that creates Knowledge Graphs on-demand by integrating data collected from a federation of heterogeneous data sources using an RDF molecule integration approach. We have explained the creation of RDF molecules by using Linked Data wrappers; we have also presented how semantic similarity measures can be used to determine the relatedness of two resources in terms of the relatedness of their RDF molecules. Results of the empirical evaluation suggest that FuhSen is able to effectively integrate pieces of information spread across different data sources. The experiments suggest that the molecule based integration technique implemented in FuhSen integrates data in a Knowledge Graph more accurately than existing integration techniques.