DBRepo: A Semantic Digital Repository for Relational Databases

Data curation is a complex, multi-faceted task. While dedicated data stewards are starting to take care of these activities in close collaboration with researchers for many types of (usually file-based) data in many institutions, this is rarely yet the case for data held in relational databases. Beyond large-scale infrastructures hosting e.g. climate or genome data, researchers usually have to create, build and maintain their database, care about security patches, and feed data into it in order to use it in their research. Data curation, if at all, usually happens after a project is finished, when data may be exported for digital preservation into file repository systems. We present DBRepo, a semantic digital repository for relational databases in a private cloud setting designed to (1) host research data stored in relational databases right from the beginning of a research project, (2) provide separation of concerns, allowing the researchers to focus on the domain aspects of the data and their work while bringing in experts to handle classic data management tasks, (3) improve findability, accessibility and reusability by offering semantic mapping of metadata attributes, and (4) focus on reproducibility in dynamically evolving data by supporting versioning and precise identification/cite-ability for arbitrary subsets of data.


Introduction
In many institutional settings, databases are set up locally at the level of labs or individual research teams and maintained by some IT-savvy researcher who gets the privilege of having to take care of data management tasks on top of (or rather, before being able to) performing the actual research. This leads to several challenges: (1) research may be enabled late when the relational databases are set up, (2) concerns are centred around the researcher who takes care of data management tasks, (3) data stewards do not have the semantic context to ensure findability, accessibility, interoperability and reusability (Wilkinson et al., 2016) (FAIR) of the data in the database, (4) reproducibility at arbitrary points in time may be sacrificed in favour of providing data dumps and citing them.
Since researchers understand their data best, the semantic knowledge should be collected by them during the project lifetime through assistance of the repository system. We offer an opensource prototype implementation for any organization to complement their existing digital repository system for providing semantic context to data stored in databases to ensure their fitness for the organization. In this paper, we assume the fitness for the organization is present, when challenges (1) to (4) are solved.

Related Work
The way a digital repository is described in literature is manifold and ranges from understanding and preserving cultural heritage as digital heritage (Barry et al., 2017) and digital scholarship (Zhou, 2020) to archiving research and publication data in a central place as research libraries (Hienert et al., 2019).
The University of Edinburgh developed DataShare (Rice and Haywood, 2011), a digital repository for people affiliated with the university (or verified guests) to deposit data collected from various research domains, although data from biological sciences is predominant 1 . It allows researchers to deposit their research output and make it findable, accessible and reusable. Despite their claim to deposit "interactive resources", we were not able to interact with these resources in the repository apart from file downloads. Our database repository allows for all databases (currently we only support open data) to interact with the data by i.e. executing viewonly queries.
The European Organization for Nuclear Research developed Zenodo 2 , a digital repository that encourages researchers to provide their work along with all supplementary files. Although useful for early data-driven research results, this repository only captures a static data set (i.e., a snapshot 3 ), which can easily be cited, but does not support the key features for which databases were designed in the first place, i.e. to flexibly and efficiently store dynamic data, allow its recombination, definition of different views, selection of subsets to answer specific needs, etc. We provide these features in DBRepo through providing view-access to the underlying relational database to anyone with Internet access after a short, automated registration process.
The United States Geological Survey developed ScienceBase (Hutchison et al., 2021), a digital repository to make federally funded data publicly available therein. Researchers within this organization are allowed to deposit their data and make it available through machineactionable interfaces, such as an application programming interface (API). In comparison, their repository also collects datasets with metadata such as contact information of the responsible person and provides semantic context to a physical location by allowing annotation of spatial Weise et al. | 3 information. We extend this specific use-case in DBRepo and allow a more generalized semantic assignment to table columns by providing a reference to a semantic concept from where, e.g.. relationships can be extracted. For example, assigning <Kelvin> to a column containing temperature values (and linked to the concept of <Temperature_Values>), a relation with <Degrees_Celsius> can be extracted and the value can be adjusted by subtracting 273.15 from it.

Database Repository
Relational databases constitute an important resource in many research and industry projects.

Motivation
Their machine-actionable interfaces allow for efficient reuse of data in services. However, a standard database system lacks many of the features offered by a repository and which are essential for data being used in scientific studies that need to ensure (i) proper mapping of metadata to increase findability; (ii) support for versioning of evolving data as well as data citation to ensure reproducibility, (iii) providing licensing information to enable reuse, and many other features covered by repository systems.
Therefore, we need to devise a repository system for databases that offers these features while maintaining the key services of a database and providing a separation of concerns, i.e. allowing each stakeholder to concentrate on their core expertise, such as, the focus on infrastructure operations including resilience, back-up and security on the IT level; metadata mapping, curation activities and findability, accessibility, interoperability and reusability on the data stewardship side; licensing and GDPR-compliance aspects on the legal and ethical review board side; and, ultimately, on the content and research for the researchers. In DBRepo the databases are deployed through Docker 4 containers that are managed by the repository itself in a hypertext transfer protocol (HTTP) API (c.f. Figure 1) that follows the representational state transfer (Barry and Sifton, 2008) principles.
This will help to ensure that the data is FAIR after a project ends and the data is dumped into a digital repository right from the onset of a data collection or research initiative, rather than only after a project ends and the data is dumped into a digital repository. Our database repository DBRepo addresses these problems and provides a prototypic solution for digital repositories to make relational databases available in their collection. The system is available as open-source software at GitLab 5 .

Increasing Findability via Semantic Mappings
In relational databases, understanding the data stored in tables in some cases require more information than just the raw tuples holding the information. DBRepo allows for a researcher to add a semantic concept and a unit of measurement via a thorough ontology (Rijgersberg et al., 2013) at a table column-level, i.e. concept "temperature" and unit of measurement "Kelvin". By providing this semantic information, a unit independent search across databases, i.e. to retrieve databases that contain temperature information where certain statistical properties (minimum, maximum, mean, etc.) match is possible. The FAIR Portal using Nuxt 6 (GUI), offering the query builder (through Knex.js 7 ) for inexperienced users allows the assignment of semantic information to a table column (c.f. Figure 2) when using file-based upload. The same can be achieved using the HTTP API.

Data Versioning for Reproducible Queries
Data citation on evolving data (Rauber et al., 2015) is needed when citing data used in experiments and studies. Databases in DBRepo implement these recommendations through adding an invisible time-period to each tuple when a modifying operation occurs, to mark the validity of each tuple in the database as a time window (valid from until valid to).  Table   We show an example of data versioning c.f. Error: Reference source not found, correcting the temperature of tuple with ID=1 at timestamp t3. This adds another tuple that is valid from t3 with the correct value and marks the original tuple as valid from t1 to t3. We illustrate these timestamps (Valid From, Valid To) in the FAIR Portal to indicate periods of time where the data was modified to give an indication for relevant timestamps for researchers that want to browse the dataset.

Repository Workflow
Researchers have different levels of expertise when it comes to database management. To support each level, the database repository assists researchers in deposit of data, interacting with the data and the persistent identification of data. Researchers can obtain an account that allows them to create databases in individual Docker containers, deposit data and create and persist queries.

Deposit of Data
Time-series data like regular sensor measurement polls, stock exchange closing values, etc. can be ingested through the HTTP API, JDBC API or AMQP API. The collected data can be inspected as current dataset or at a specific point in time using either the HTTP API or FAIR Portal. For applications that cannot use HTTP connections, i.e. continuous data feeds, the repository offers a RabbitMQ endpoint that accepts advanced messaging queue protocol (AMPQ) tuples. For each table that is created, the schema is stored in the metadata database and a table consumer is started that consumes the AMQP tuples in the queue. The API also allows the import of static comma-or tab-separated values datasets.
The overall approach harnesses the semantic information available for the table in the metadata database to map the AMQP tuple to a SQL query in the Query Service that inserts the data via the Java Database Connectivity (JDBC) API directly into the database. Further, it abstracts the internal representation of the table in the database from the data source. Researchers who want to deposit data through a GUI can add, update and delete tuples through the FAIR Portal which communicates directly with the API.

Interaction with the Data
Static datasets can be viewed in the FAIR Portal and view-only query statements can be issued to the API to interact with the dataset. DBRepo offers support to create, update and delete IJDC | Conference Paper 6 | DBRepo: a Semantic Digital Repository for Relational Databases single tuples in the database e.g., to correct wrong sensor calibration. Any query issued is stored and can be persistently identified by adding metadata following the DataCite schema (2021). The query result is returned which can also be viewed or exported in the FAIR Portal. Whenever a sensor provides wrong measurements, the value can be corrected via the HTTP API or the help of the FAIR Portal. The API identifies a single tuple by their primary key (the repository supports primary keys that contain multiple columns).

Persistent Identification of Arbitrary Subsets
Data can be persistently identified by attaching metadata to a query statement. Internally, this operation mirrors the query information from the Query Store to the Metadata Database and makes it findable by assigning a PID (c.f. Figure 3) via the DataCite Metadata Schema (2021). The repository supports the mandatory fields such as identifier, creator, title, publisher, publication year and resource type:  Identifier: currently we do not mint DOIs, but a URI that points to the record (during the current pilot deployment stage the permanent availability to the databases is not foreseen. Yet, DOIs will be minted via the TU Wien DOI provider as soon as the repository is put in production mode.)  Creator: Taking a list of creators that are involved in the subset generation, the repository supports the creatorName attribute and the non-mandatory nameIdentifier (and nameIdentifierScheme) to allow identification of researchers via the ORCID 8 system. The non-mandatory affiliation attribute is also supported.
 Title, Publisher, Publication year: Assignable via the HTTP API.
 Resource type: The repository persistently identifies a query that generates an arbitrary subset. The value is hardcoded to Dataset.
The FAIR Portal offers a Query Builder for simple queries that assist researchers without SQL knowledge. The researcher obtains a result that can be exported as .csv through the HTTP API or persisted. This means the researcher assigns mandatory metadata required by the DataCite Scheme to make the metadata available even when the subset is not available anymore, along with execution timestamp, query hash, result hash and result number, etc. (In this phase of development, the repository does not mint DOIs for the queries, but a persistent URI that redirects to the query page displaying the metadata instead and can be included in publications.

Import of Static Datasets
A researcher can import static datasets as .csv file into databases managed by DBRepo in two ways: (1) import a .csv file and create a table from it (2) import a .csv file into an existing table. The latter also allows replacement by the Query Service of tuples in a transaction when the primary key is already present in the table as follows:  internally, the Query Service creates a temporary table for this import process named like the original table name but with suffix "_temporary".
 the service loads the .csv contents to this temporary  (1) and (2), the researcher needs to provide the table name, table description, separating character, optional value quote character, optional encodings for Boolean values (true/false) or the null encoding. The FAIR Portal assists the researcher to provide a data type for each column that is suggested by the Analyse Service on basis of the .csv file. Some timestamp formats still have to be manually selected if not detected automatically (c.f. Figure  4Error: Reference source not found). To do this, we use the Python 9 library messytables 10 in the Analyse Service to determine numerical (integer, decimal), categorical (enumeration), temporal (date, timestamp) and textual 9 "Python". URL: https://python.org, accessed 2022-06-12 10 "Messytables". URL: https://github.com/okfn/messytables, accessed 2022-06-03

Evaluation
In this section we evaluate to what extent the Broker Service and Query Service are able to put tuples through instantly, without having to store them in the queue, i.e. we want to find the maximum throughput where not a single tuple needs to be held back from immediate consumption in the Query Service (=stored in the database). Next to the HTTP API and JDBC API, DBRepo also offers a message queue endpoint at the Broker Service that processes AMQP 11 tuples that are data objects represented as text in JavaScript Object Notation (JSON).

Data Set
The Open Data Catalogue publishes non-sensitive data collected from the public administration of the city of Zürich, Switzerland and makes them available online via an open source license for anyone to use. The hourly updated air quality measurements dataset (Open Data Zürich, 2021) contains mean measurements for Stampfenbachstrasse, Schimmelstrasse, Rosengartenstrasse and Heubeeribüel stations for Ozone (O 3 ), nitrogen oxides (NO x ), nitrogen monoxide (NO), nitrogen dioxide (NO 2 ), particulate matter (PM 10 and PM 2.5 ), carbon monoxide (CO) and sulfur dioxide (SO 2 ). The dataset contains one table with 210.192 tuples with seven columns (timestamp, station, parameter, interval, unit, value, state) and has about a 17 MegaBytes file size. We selected this dataset because it is public and contains sensor readings to also showcase the AMQP API and refer to it as Airquality Dataset throughout the remainder of this paper.

Experiment Setup
We deployed DBRepo on a Lenovo T14s (16 CPU threads, 32GB RAM, 1TB SSD storage) using Rocky Linux 8.6 with the latest software updates installed. Each experiment measurement instance is run in an isolated deployment such that only one user is present in the system along with one database consisting of one table with the imported Airquality Dataset. In order to evaluate the maximum throughput of tuples per second depending on the tuple size, we wrote a Python script that is capable to produce synthetic tuple data of defined size and submits them to the Broker Service at a defined interval. The smallest possible synthetic tuple size is 0.2 kilobytes. We approximated the throughput also for tuples with 0.5, 1, 2, 3, 4 and 5 kilobytes size. For each configuration (size, throughput) the script checks if the queue remains 11 Advanced Message Queuing Protocol

IJDC | Conference Paper
Weise et al. | 9 empty during publishing of AMQP tuples of fixed size and fixed throughput for a duration of 60 seconds. We iteratively increased the throughput to approximate the maximum throughput of tuples without letting the Broker Service use the queue for storage. Figure 5.Ingest time-series data into a researcher database

Result
A total of seven approximations were performed to evaluate the maximum throughput of the AMQP API where tuples are still processed immediately within DBRepo. We observe that the tuple size has a strong influence on the maximum throughput, as it decreases rapidly when the tuple size increases (c.f. Figure 6).

Conclusion & Future Work
We presented DBRepo, a semantic digital repository for relational databases in a private cloud setting as an implementation that ensures that (1) research data is stored in relational databases right from the onset of a research project, (2) separation of concerns is maintained, (3) findability, accessibility and reusability are increased by offering semantic mapping of metadata attributes, and (4) reproducibility on evolving data and data versioning is offered. The database repository already has first cooperating institutions and external partners in Austria. We are currently working with an increasing number of initial pilot users to test the deployment of DBRepo for a range of different domains.
In the future, we want to ensure that instances of our database repository, locally deployed at an organization, can be found in the registry of repositories by adding metadata to the repository i.e., via the re3data schema (Strecker et al., 2021).