Conceptually-grounded Mapping Patterns for Virtual Knowledge Graphs

Knowledge Graphs (KGs) have been gaining momentum recently in both academia and industry, due to the flexibility of their data model, allowing one to access and integrate collections of data of different forms. Virtual Knowledge Graphs (VKGs), a variant of KGs originating from the field of Ontology-based Data Access (OBDA), are a promising paradigm for integrating and accessing legacy data sources. The main idea of VKGs is that the KG remains virtual: the end-user interacts with a KG, but queries are reformulated on-the-fly as queries over the data source(s). To enable the paradigm, one needs to define declarative mappings specifying the link between the data sources and the elements in the VKG. In this work, we try to investigate common patterns that arise when specifying such mappings, building on well-established methodologies from the area of conceptual modeling and database design.


Introduction
Data integration and access to legacy data sources are key challenges for contemporary organizations.In the whole spectrum of data integration and access solutions, the approach based on Virtual Knowledge Graphs (VKGs) is gaining momentum, especially when the underlying data sources to be integrated come in the form of relational databases (DBs) [1].VKGs replace the rigid structure of tables with the flexibility of a graph that incorporates domain knowledge and is kept virtual, eliminating redundancies.A VKG specification consists of three main components: (i) data sources (in the context of this paper, constituted by relational DBs), where the actual data are stored; (ii) a domain ontology, capturing the relevant concepts, relations, and constraints of the domain of interest; and (iii) a set of mappings, linking the data sources to the ontology.A critical bottleneck in this setting lies in the definition and management of mappings.In this work, we focus on this issue by proposing a comprehensive catalog of mapping patterns that emerge when linking data to ontologies.Our catalog is based on the (somehow SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy calvanese@inf.unibz.it(D.Calvanese); avigal@technion.ac.il (A.Gal); lanti@unibz.it(D.Lanti); montali@unibz.it(M.Montali); mosca@unibz.it(A.Mosca); r.shraga@northeastern.edu (R. Shraga) 0000-0001-5174-9693 (D.Calvanese); 0000-0002-7028-661X (A.Gal); 0000-0003-1097-2965 (D.Lanti); 0000-0002-8021-3430 (M.Montali); 0000-0003-2323-3344 (A.Mosca); 0000-0001-8803-8481 (R. Shraga)  reasonable) assumption that both the ontology and the DB schema are derived from a conceptual analysis of the domain of interest.The resulting knowledge may stay implicit, or may lead to an explicit representation in the form of a structural conceptual model, which can be represented using well-established notations such as UML, ORM, or E-R.On the one hand, this conceptual model provides the basis for creating a corresponding domain ontology through a series of semantic-preserving transformation steps.On the other hand, it can trigger the design process that finally leads to the deployment of an actual DB.The whole view is depicted in Figure 1.
Our catalog is built on well-established methodologies and patterns studied in data management (e.g., W3C direct mappings (W3C-DM) 1 and extensions), data analysis (e.g., algorithms for discovering dependencies), and conceptual modeling (e.g., relational mapping techniques).
The idea of mapping patterns is not new.For instance, work in [2] is closely related to ours, as it also introduces a catalog of mapping patterns.However, there are some key differences with our approach.One difference is that we consider KGs (with ontologies), whereas that work focuses on property graphs without an ontology.More importantly, in [2] and in the related literature, patterns are not formalized or grounded to a specific conceptual representation, but are rather informally specified and discussed in a "by-example" fashion.On the contrary, each of our patterns explicitly and non-ambiguously specifies the link between the conceptualization and the DB instance, which is the one arising from applying well-known semantics-preserving transformations studied in the area of DB design.
We argue that this foundational grounding paves the way for a variety of VKG design scenarios, depending on which information artifacts are available, and which ones must be produced.For example, our patterns could be used to validate existing mappings, or to automatically generate (i.e., bootstrap) ontology and mappings when only the DB is available.In fact, specific patterns have been proposed also in relation to ontology and mapping bootstrapping, for which a variety of tools and approaches have been developed in the last two decades [3,4,5,6,7].The approaches in the literature differ in terms of the overall purposes of bootstrapping (e.g., OBDA, data integration, ontology learning, checking of DB schema constraints using ontology reasoning), the adopted ontology and mapping languages (e.g., OWL 2 profiles or RDFS as ontology languages, and R2RML or custom languages for the specification of mappings), the different focus on direct and/or complex mappings, and the assumed level of automation.The majority of the most recent approaches closely follow W3C-DM, deriving ontologies that mirror the structure of the input DB.

Table 1
Semantics of the DL-Lite ℛ constructs that involve datatypes.

Construct Syntax Element Example Semantics
Top domain The remainder of the paper is structured as follows: Section 2 introduces the notation and basic notions on VKGs, Section 3 presents (an extract of) our catalog of mapping patterns, and Section 4 concludes the paper.

Preliminaries
We use the bold font to denote tuples, e.g., x, y, are tuples.When convenient and nonambiguous, we treat tuples as sets and use set operators on them.We assume familiarity with standard notions and languages from DBs [8], such as SQL or E-R diagrams.
A VKG specification is a triple ⟨ , ℳ, ⟩ where  is an ontology (or TBox), ℳ a set of mappings, and  the schema of a DB (with constraints, e.g., primary and foreign keys).In VKGs, the ontology is formulated in OWL 2 QL2 , but for conciseness we use its Description Logic (DL) counterpart, DL-Lite ℛ [9], here slightly enriched to handle datatypes.
We fix the following enumerable, pairwise-disjoint sets: NI of individuals, NL of literal values, NC of class names, NP of object property names, and ND of data property names.
An OWL 2 QL TBox  is a finite set of inclusion axioms of the form  ⊑ ,  ⊑ , () ⊑  , or  ⊑ , where ,  are classes, ,  are object properties,  is a data property,  is a datatype expression, () is a data property range expression, and  is a data property expression.These are defined according to the following grammar, where  ∈ NC,  ∈ ND,  ∈ NP, () is a data property domain expression, and  1 , . . .,   are the RDF datatypes: In the rules above, ⊤  denotes the "top" element for concepts and ⊤  the one for data values (called literals in the RDF terminology).An OWL 2 QL ABox  is a finite set of assertions of the form (), (, ), or (, ℓ), where  ∈ NC,  ∈ NP,  ∈ ND,  and  are individuals in NI, and ℓ ∈ NL.We call the pair  = ⟨ , ⟩ an OWL 2 QL Knowledge Graph (KG).
Similarly to first-order logic, the semantics of DL-Lite ℛ KGs is given through Tarski-style As usual [10], we say that an interpretation ℐ satisfies a KG , denoted by ℐ |= , if ℐ satisfies the ABox assertions and the inclusion axioms in .
Mappings.Mappings specify how to populate classes and properties of the ontology with individuals and values constructed from the data in the underlying DB.In other words, mappings provide the ABox that, together with a given TBox, realizes a KG.In VKGs, the adopted language for mappings in real-world systems is R2RML 3 , but for conciseness we use here a more convenient abstract notation inspired by the literature [11]: a mapping  is a pair of the form ⟨: (x), : L(t(x))⟩, where (x) is a SQL query with answer variables x over the DB schema , called source query, and L(t(x)) is a list of target atoms of the form (t 1 (x 1 )), (t 1 (x 1 ), t 2 (x 2 )), or (t 1 (x 1 ), t 2 (x 2 )), where  ∈ NC,  ∈ NP,  ∈ ND, and t 1 (x 1 ) and t 2 (x 2 ) are terms that we call templates.We express source queries in relational algebra, omitting answer variables under the assumption that they coincide with the variables used in the target atoms.Intuitively, a template t(x) in the target atom of a mapping corresponds to an R2RML string template 4 , and is used to generate an IRI (hence, an object identifier) or an RDF literal, starting from DB values retrieved by the source query in that mapping.For the examples, we use the concrete syntax from the Ontop VKG system [6], in which the source query is expressed in SQL and each target atom is expressed as an RDF triple pattern with templates.The answer variables of the source query occurring in the target atoms are distinguished by enclosing them in curly brackets { • • • }.The following is an example mapping expressed in such syntax: source SELECT ssn FROM person target ex:pers/{ssn} a ex:Person .
In the mapping above, the string ex: denotes a URI prefix, e.g., ex:Person is an abbreviation for the URI http://www.example.com/Person.Such mapping, when applied to a DB instance  of , populates the class ex:Person with IRIs constructed by replacing the answer variable ssn occurring in the target atom with the corresponding values assigned to that variable by the answers to the SQL source query evaluated over .For instance, if the source query returns two answers that assign to the answer variable ssn respectively the values 1234 and 5678, then the mapping above produces the following RDF graph (expressed in the Turtle syntax 5 ), stating that individuals ex:pers/1234 and ex:pers/5678 are both instances of class ex:Person: ex:pers/1234 a ex:Person .ex:pers/5678 a ex:Person .
We denote by  ℳ() the virtual ABox constructed through mappings ℳ from a DB .Given a VKG specification ⟨ , ℳ, ⟩ and a database instance  of , the KG  = ⟨ ,  ℳ() ⟩ is called the Virtual Knowledge Graph of ⟨ , ℳ, ⟩ through .The qualifier "virtual" in the name derives from the fact that the virtual ABox  ℳ() in a VKG setting is not materialized and stored somewhere.Query answering in VKGs, in fact, is carried out through query rewriting and query unfolding techniques [11,6]: user queries, expressed in SPARQL 6 , get translated onthe-fly into equivalent SQL queries, which then are directly evaluated against the DB.

Mapping Patterns
In its basic form, a mapping pattern is a quadruple ⟨, , ℳ,  ⟩, where  is a conceptual model,  a database schema, ℳ a set of mappings, and  an (OWL 2 QL) ontology.In such pattern, the pair ⟨, ⟩ puts into correspondence a conceptual representation with one of its (many) admissible (i.e., formally sound [12,13]) database schemata, like those prescribed by well-established database modeling methodologies.The pair ⟨ℳ,  ⟩, instead, is formed by the DB ontology  , which is the OWL 2 QL encoding 7 of the conceptual model , and the set ℳ of mappings, providing the link between  and  .The term "DB ontology" refers to an ontology whose concepts and properties reflect the constructs of the conceptual model, mirroring the structure of the relational database, as displayed in Figure 1.
Some of the more advanced patterns have a more complex structure, where pairs of conceptual models and/or pairs of database schemata are used in place of  and , respectively (e.g., the pattern "SHa" falls in this category).These patterns prescribe specific transformations to be applied to an input conceptual (resp., DB) schema, in order to obtain an output conceptual (resp., DB) schema.These output artifacts make explicit the presence of specific structures that are revealed through the application of the pattern itself.These structures can in turn enable further applications of patterns.
Presentation Conventions.We show the fragment of the conceptual model that is affected by the pattern in E-R notation (adopting the original notation by Chen [14]).To compactly represent sets of attributes, we use a small diamond in place of the small circle used for single attributes in Chen's notation.For cardinality constraints we follow the "look-here" convention, that is, the cardinality constraint for a role is placed next to the entity participating in that role.In the DB schema, we use  (K, A) to denote a table with name  , primary key consisting of the attributes K, and additional attributes A. Given a set U of attributes in  , we denote by key  (U) the fact that U form a key for  .Referential integrity constraints (like, e.g., foreign keys) are denoted with arcs, pointing from the referencing attribute(s) to the referenced one(s).For conciseness, we denote sets of the form { | condition} as {} condition .In order to express datatypes for data properties, we introduce two auxiliary functions: a function  that, given a DB attribute , returns the DB datatype of , and a function  that associates, to each DB datatype, a corresponding RDF datatype.For the definition of , we re-use the Natural Mapping8 correspondence provided by the R2RML recommendation.As a final note, following the E-Rdiagrams convention, we assume a default (1, 1) cardinality on attributes.For such a reason, in the DB schema we assume all attributes to be not nullable by default (using the SQL convention, declared as "NOT NULL").An optional attribute  is instead denoted by adding opt() to the DB schema.Such notation extends in the natural way to a set A of attributes.
• In case of cardinality (_, 1) on role   (resp.,   ), the primary key of   is restricted to the attributes K  (resp., K  ).In case both roles have cardinality (_, 1), either choice for the primary key is made, and the remaining attributes form a non-primary key in the logical schema.• In case of cardinality (1, _) on role   (resp.,   ), the inclusion dependency K  ⊆ K  (resp., K  ⊆ K  ) holds in the schema, and the first (resp., second) inclusion axiom in the ontology holds in both directions.Note that when the maximum cardinality on role   (resp.,   ) is 1, the corresponding inclusion dependency is actually a foreign key.
Schema Relationship with Identifier Alignment (SRa) Schema Hierarchy with Identifier Alignment (SHa) In this pattern, the "alignment" is meant to align the primary identifier used in the child entity to the primary identifier used in the parent entity.The other two possiblities for applying the pattern are: • the foreign key in the child entity is the primary key of that entity, and references a non-primary key of the parent entity; • the foreign key in the child entity is a non-primary key of that entity, and references a non-primary key of the parent entity.
We depict here the most common scenario, where the foreign key points to the primary key of the parent entity.
Observe that this pattern requires a change in the conceptual model (essentially keeping track of the attributes used for identifying the objects of the subclass).
Pattern Catalog.Table 2 shows an excerpt of our patterns, which we discuss in detail here.

Schema Entity (SE)
. This fundamental pattern describes the correspondence between an entity with a primary identifier and attributes in the DB schema, and a class and data properties in the ontology.The entity is expressed in the DB schema through a single table   with primary key K and other attributes A, as it is the norm in sound DB design practices.The mappings column explains how   is mapped into a corresponding class   .The primary key of   is employed to construct the IRIs of the objects that are instances of   , using a template t  specific for that entity.Each relevant attribute of   is mapped to a data property of   , with suitable domain and range axioms.A mandatory participation constraint is added to each data property corresponding to a mandatory attribute.
Example: A client registry table containing SSNs of clients, together with their name as an additional attribute, is mapped to a Client class using the SSN to construct its objects.In addition, the SSN and name are mapped to two corresponding data properties.

Schema Relationship (SR).
This pattern describes the correspondence between a binary relationship without attributes and an OWL 2 QL object property, for the case where such relationship is represented in the DB as a separate (usually, "many-to-many") table.This pattern considers three tables   ,   , and   , for which the set of columns in   is partitioned into two parts K RE and K RF that are foreign keys to   and   , respectively.The identifier of   depends on the role cardinalities in the E-R model.The pattern captures how   is mapped to an object property   , using the two parts K RE and K RF of the partition to construct respectively the subject and the object of the triples in   .The templates t   and t   must be those respectively used for building instances of classes   corresponding to   and   corresponding to   .
Example: An additional table in the client registry stores the addresses of each client, and has a foreign key to a table with locations.The former table is mapped to an address object property, for which the ontology asserts that the domain is the class Person and the range an additional class Location, which corresponds to the latter table.
Schema Relationship with Identifier Alignment (SRa).This pattern is similar to pattern SR, but it comes with a modifier a, indicating that the pattern can be applied after the identifiers involved in the relationship have been aligned.The alignment is necessary because the foreign key in   does not refer to the primary key K  of   , but to an alternative key U  .Since the instances of the class   corresponding to   are constructed using the primary key K  of   (cf.pattern SE), also the pairs that populate   should refer in their object position to that primary key, which can only be retrieved via a join between   and   on the key U  .
Example: The primary key of the table with locations is not given by the city and street, which are used in the table that relates clients to their addresses, but is given by the latitude and longitude of locations.
Schema Hierarchy with Identifier Alignment (SHa).This patterns handles the case where a hierarchy is specified and the child entity uses a primary identifier different from the one in the parent entity.In this situation, the foreign-key constraint can come in three different variants.In the depicted one, the foreign key in   is over a non-primary key K FE .The objects for   have to be built out of K FE , rather than out of the primary key of   .For this purpose, the pattern creates a view   identical to   , except that K FE is the primary key.Also the foreign key relations are preserved.Such view might enable further applications of patterns.
Example: An ISA relation between entities Student and Person.Students are identified by their matriculation number, whereas persons are identified by their SSN.

Conclusions and Future Work
In this work, we have identified and formally specified a number of mapping patterns emerging when linking DBs to ontologies in a typical VKG setting.Our patterns are grounded in wellestablished practices of DB design, and render explicit the connection between the conceptual model, the DB schema, and the ontology.We envision that the organization in patterns can enable a number of relevant tasks, notably mapping bootstrapping for incomplete VKGs.
This work is only a first step, with respect to both categorization of patterns, and their actual use.Regarding the former, we are currently extending this initial catalog with more advanced "data-driven" patterns, which are patterns where the data component needs to be taken into account.Regarding the latter, we are investigating solutions to specific problems that need to be addressed when setting-up a VKG scenario, like the problem of mapping bootstrapping.

Figure 1 :
Figure 1: The database and the ontology both stem from common domain knowledge.
where Δ ℐ  is a non-empty domain of objects, Δ ℐ  is a non-empty domain of values, and • ℐ is an interpretation function.Table 1 reports the semantics for the constructs involving datatypes.The other constructs are defined as in standard DL-Lite ℛ [9].

Table 2
An extract of our catalog of mapping patterns.In case of optional attributes, for each optional attribute  ′ of , add an opt( ′ ) constraint to the DB schema and drop the corresponding inclusion axiom   ⊑ (  ′ ) from the ontology.:   :   (t  (K  ), t  (K  ))