The Forgotten Document-Oriented Database Management Systems: An Overview and Benchmark of Native XML DODBMSes in Comparison with JSON DODBMSes

In the current context of Big Data, a multitude of new NoSQL solutions for storing, managing, and extracting information and patterns from semi-structured data have been proposed and implemented. These solutions were developed to relieve the issue of rigid data structures present in relational databases, by introducing semi-structured and flexible schema design. As current data generated by different sources and devices, especially from IoT sensors and actuators, use either XML or JSON format, depending on the application, database technologies that store and query semi-structured data in XML format are needed. Thus, Native XML Databases, which were initially designed to manipulate XML data using standardized querying languages, i.e., XQuery and XPath, were rebranded as NoSQL Document-Oriented Databases Systems. Currently, the majority of these solutions have been replaced with the more modern JSON based Database Management Systems. However, we believe that XML-based solutions can still deliver performance in executing complex queries on heterogeneous collections. Unfortunately nowadays, research lacks a clear comparison of the scalability and performance for database technologies that store and query documents in XML versus the more modern JSON format. Moreover, to the best of our knowledge, there are no Big Data-compliant benchmarks for such database technologies. In this paper, we present a comparison for selected Document-Oriented Database Systems that either use the XML format to encode documents, i.e., BaseX, eXist-db, and Sedna, or the JSON format, i.e., MongoDB, CouchDB, and Couchbase. To underline the performance differences we also propose a benchmark that uses a heterogeneous complex schema on a large DBLP corpus.


Introduction
With the emergence of Big Data and the Internet of Things (IoT) and the increasing amount of semi-structured information generated daily, new technologies have arisen for storing, managing, and extracting information and patterns from such data. The new technologies for storing data have been labeled with the name NoSQL and were initially developed to solve very specific problems.
Currently, they provide different trade-offs and functionality (e.g., choosing high-availability over consistency) to be as generic as their counterparts Relational Database Management Systems (RDBMSes). Due to the semi-structured nature of data, NoSQL Database Management Systems (DBMSes) have been classified based on the data model used for storing information [1], i.e., key-value, document-oriented, wide column, and graph databases.
In this paper, we particularly study NoSQL Document-Oriented Databases Systems (DODBMSes) that encode data using the XML or JSON formats. We further focus on two subcategories of DODBMSes with respect to the data model used to encode documents: i) DODBMSes that encode data using the XML format are Native XML Database Management Systems (XDBMSes), and ii) DODBMSes that encode data using the JSON format are JSON Database Management Systems (JDBMSes).
The NoSQL DBMSes became very popular with the increasing need for data storage, management, and analysis systems that scale with the volume. To address these needs, many NoSQL DBMSes compromise consistency to offer high-availability, partition tolerance, improved analytics, and high-throughput. These features are also a requirement for real-time web applications and Big Data processing and analysis and are available in JDBMSes as well.
XDBMSes have started to emerge after the eXtensible Markup Language (XML) has been standardized and became the common format for exchanging data between different applications running on the Web. Their primary use was to facilitate secure storage and fast querying of XML documents. Besides their primary use, XDBMSes prove useful for OLAP (Online Analytical Processing) style analysis and decision support systems that incorporate a time dimension and encode data in the XML format [2], and thus removing the need of using ETL (Extract Transform Load) processes to transform XML documents into a relational model. XML query languages and technologies, including XDBMSes, had been around before the NoSQL trend, and have been forgotten during the Big Data hype. In the field of relational databases, XML format is used as a Data Type, e.g., Oracle, DB2, PostgreSQL, etc. Currently, with the rise of the NoSQL movement, XDBMSes have become a subcategory of DODBMSes. But, with the emergence of processing platforms that uses Big Data or IoT technologies, where the data are transferred over computer networks into formats such as XML and JSON, the XDBMSes can be seen as a viable solution for storing and manipulating computer-generated semi-structured data.
We hypothesize that the more classical XDBMSes may still be useful in the Big Data era. Thus, in this study we want to address and use as guidelines the following research questions: As a result of our research and as a response to Q1, we claim that the more classical XML based DODBMSes may still be useful in the Big Data era. To demonstrate this and answer Q2, we propose a new benchmark for comprehensive DODBMSes analysis using a large dataset . And thereby we present a qualitative and quantitative performance comparison between XDBMSes and the more modern JDBMSes to answer Q3. This paper is structured as follows. Section 2 presents an overview of different NoSQL DBMSes models, surveys, and benchmarks. Section 3 offers an in-depth overview and comparison of DODBMSes, focusing on the XDBMSes and JDBMSes subcategories. Section 4 introduces the proposed benchmark specification and discusses the data and workload models, while Section 5 discusses the database physical implementation and presents the description of the benchmark's queries. Section 6 thoroughly details the experiments performed on the selected DODBMSes using our benchmark and discusses the results in detail. Finally, Section 7 concludes the paper, summarizes the results, and provides future research perspectives.

Related Works
The NoSQL Database Management Systems (DBMSes) emerged as an alternative to Relational Database Management Systems (RDBMSes) in order to store and process huge amounts of heterogeneous data. However, NoSQL DBMSes did not appear as a replacement for RDBMSes, but as a solution to specific problems that require additional features (e.g., replication, high-availability, etc.) that are not handled well by traditional means [3]. The reasons commonly given to develop and use NoSQL DBMSes are summarized as follows [4]: avoidance of unneeded complexity, high throughput, horizontal scalability, running on commodity hardware, avoidance of expensive objectrelational mapping, lowering the complexity and the cost of setting up a cluster, compromising reliability for better performance, and adapting to the requirements of cloud computing.
The classifications used for NoSQL DBMSes usually are done by either taking into account the persistence model or the data and query model. Using the persistence model, NoSQL DBMSes are classified as follows [4]: i) In-Memory Databases [5] are very fast because the most current used data are stored in memory, with optional subsequent disk flushes triggered at given periods or when the inmemory data are not used. Evidently, the size of the currently in-use data that can be stored is limited to the amount of memory. This problem can be resolved using vertical scaling to some degree as there is a limit to the amount of memory a system can hold. Moreover, the durability may become a problem if data are lost between subsequent disk flushes or if data persistence is disabled. A solution to this problem is data replication.
ii) Memtables and SSTables Databases [6] buffer operations in memory using a Memtable after they have been written to an append-only commit log to ensure durability. After a certain amount of writes the Memtable gets flushed to disk as a whole into a SSTable. These DBMSes have performance characteristics comparable to those of In-Memory Database but solve the durability problem.
iii) B-trees Databases [7] use the B-tree self-balancing tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time [8].
NoSQL DBMSes are also classified by using the data and query model as follows [1,9]: i) Wide Column Databases are used to store, retrieve, and manage data using column families.
Each record can have different numbers of cells and columns, making a row sparse without storing NULLs.
ii) Graph Databases are used to store, retrieve, and manage information using a graph.
Therefore, an object is modeled as a node and the edges between nodes become the relationships between the objects.
iii) Key-Value Databases (KVDBMSes) are data storage systems designed for storing, retrieving, and managing associative arrays, i.e., dictionaries or hash tables. iv) Document-Oriented Databases (DODBMSes) have evolved form KVDBMSes and are used to store, retrieve, and manage semi-structured data, i.e., documents, encoded using JSON, BSON, XML, or YAML formats.
There are multiple surveys on NoSQL DBMSes, in the following phrases we present the most relevant ones for our analysis. Article [10] provides a comparison regarding the performance and flexibility of KVDBMSes and DODBMSes over RDBMSes. The NoSQL DBMSes prove to be a better choice for high throughput applications that require data modeling flexibility and horizontal scaling. The authors of [1] offer a classification by data models of NoSQL DBMSes, and also they present the current and most popular solutions. In [11], the authors make a comparison and overview of NoSQL data models, query types, concurrency controls, partitioning, and replication.
Article [12] presents a top-down overview of the NoSQL database field and propose a comparative classification model that relates functional and non-functional requirements to techniques and algorithms employed in these systems. The authors of [13] provide an overview of XML data manipulation techniques employed in conventional and temporal XDBMSes and study the support of such functionality in mainstream commercial DBMSes. Unfortunately, the paper presents only a general discussion about XDBMSes and other DBMSes with XML manipulation capabilities, and also no evaluation is provided. Thus, we can conclude that none of these surveys present an in-depth discussion and comparison of different subcategories of DODBMSes.
In the literature there are many data-centric benchmarks for the Big Data distributed systems and NoSQL DBMSes that focus either on structured data or on specific applications, such as MapReduce-based applications, rather than on unstructured or variety. In [14], the authors present a comprehensive survey and analysis of benchmarks for different types of Big Data systems including NoSQL systems. The authors of [15] present a new benchmark for textual data for distributed systems including MongoDB. None of the current literature presents benchmarks for modern native XDBMSes.
XDBMSes benchmarks are application-oriented and domain-specific, e.g., OpenEHR XML medical records [16], XMark which contains documents extracted from electronic commerce sites and content providers [17] or Transaction Processing over XML (TPoX) [18] which simulates a financial multi-user workload with XML data conforming to the FIXML standard. These benchmarks are used for testing the performance of DBMSes that are capable of storing, searching, modifying and retrieving XML data. Unfortunately, the majority of these benchmarks use rather small collections. And even for the benchmarks where the XML or JSON document size is up to the order of Gigabytes (GBs), the contained information is mostly homogeneous. Our proposed benchmark solution uses large heterogeneous collections with over 6 million records to test the scalability, filtering, and aggregation performance of complex queries for the current native XDBMSes.
Based on the lack of current literature regarding XDBMS, in this paper, we analyze the performance and functionality of DODBMSes solutions, while focusing on two distinct subclasses that use JSON or XML formats to encode data.

Document-Oriented Databases
Document-Oriented Databases Management Systems (DODBMSes) have evolved from Key-Value Databases [1]. DODBMSes are used for storing, retrieving, and managing semi-structured data. They have a schema-less flexible data representation, thus providing more flexibility for data modeling [19]. DODBMSes use documents for storing data such as XML or JSON. The flexibility provided by XML and JSON makes it easier to manipulate the information than it is for tables in Relational Database Management Systems (RDBMSes). Usually, documents are stored in collections. A Native XML Database Management System (XDBMS) uses the XML (eXtensible Markup Language) data structure to encode documents and defines a hierarchical logical model based on the elements of this markup language [20,21]. A JSON Database Management System (JDBMS) uses the JSON structure for modeling documents and storing them in collections.
In DODBMSes, labels are used in storing the information. These labels describe the data and values in a record. New information can be added directly to a record without the need to modify the entire schema, as is the case for RDBMSes.
One of the benefits of using a DODBMS solution is the flexibility of modeling the data [22].
Data from the web, mobile, social, and IoT devices change the nature of the application's data model. In an RDBMS, these changes impose the modification of the schema by altering tables and adding or removing columns. Whereas, the flexibility of DODBMSes eliminates the need to force-fit the data into predefined attributes and tables.
Another benefit of a DODBMS is the fast write performance. Some DODBMSes prioritize high availability over strict data consistency. This ensures that both read and write operations will always be executed even if there is a hardware or network failure. In case of failure, the replication and eventual consistency mechanisms ensure that the environment will function.
Fast query performance is another benefit of a DODBMS. Most DODBMSes provide powerful query engines for CRUD (Create, Read, Update and Delete) operations and use indices and secondary indices to improve data retrieval. Additionally, the majority of DODBMS solutions support aggregation frameworks, either native or using MapReduce, for Data Analysis and Business Intelligence.

XDBMSes
In this subsection, we present several examples of XDBMSes that use standardized XPath and XQuery. Although there are multiple solutions of DBMSes that incorporate XML as data type (e.g., Oracle, PostgreSQL, DB2, MS SQL, etc. just to name a few), the majority of them fall out of the NoSQL movement. Furthermore, some have licenses that explicitly forbids benchmarking, e.g., commercial XDBMSes such as MarkLogic Server and Oracle Berkeley DB XML. Thus, for our comparison and benchmark, we chose the following three XDBMSes: BaseX, eXist-db, and Sedna.

BaseX
BaseX is an XDBMS written in Java that stores the data using a schema-free hierarchical model. Transactions in BaseX respect the ACID (Atomicity, Consistency, Isolation, and Durability) properties, enabling the concurrent access of multiple readers and writers [23]. Documents are stored either persistently on disk or in the main memory. BaseX uses a single instance environment, replication and data partitioning are not available.
BaseX provides CRUD operations and ad-hoc queries, including aggregation using XQuery 3.1 and XPath 3.1 [24]. Although, it works with various APIs such as XML DB or JAX-RX, it was not designed to work with a MapReduce framework.
BaseX supports multiple structural and value indices [23]. Structural indices are automatically created and include: i) name indices to reference the names of all elements and attributes, ii) path indices to store distinct paths of the documents in the database, and iii) document indices to reference all document nodes. Value indices are user-defined. They include: i) text indices for documents' text nodes to improve the performance of exact and range queries, ii) attribute indices to speed up comparisons on attribute values, iii) token indices to improve the multi-token attribute values, and iv) full-text indices to normalized tokens of text nodes and speed up queries which contain text expressions.
eXist-db eXist-db [25] is a XDBMS implemented in Java that stores documents in the XML format. It stores data in-memory using Document Object Model (DOM) trees.
Although eXist-db does not have support for database-level transaction control, it has transactions internally, transparent to the user, and also has a persistent journal that is used to ensures the durability and consistency of the stored data. The database consistency is done automatically or using a sanity checker to detect the inconsistencies or damages in the core database files [26]. eXist-db supports data primary-secondary replication, thus allowing applications to be distributed over multiple servers through the use of Java Message Service (JMS) API. Although replication is available, data partitioning or sharding and distributing queries across multiple servers are not.
eXist-db provides CRUD operations and ad-hoc queries for filtering and aggregation using XQuery 3.1 and XPath 3.1 [24]. Unfortunately, it does not have the MapReduce functionality, which would offer more flexibility to the aggregation queries.
eXist-db supports four types of indices [27]: i) range indices that provide range and field-based searches, ii) text indices for full-text search, iii) n-gram indices for improving the performance of n-gram search, and iv) spatial indices for querying data using geometric characteristics, although this feature is currently experimental.

Sedna
Sedna is an XDBMS written in C that stores documents in the XML format [28]. Sedna provides ACID transactions, indexing, and persistent storage [29]. In uses the main memory to improve query performance [30]. Replication and partitioning are not implemented in Sedna.
Like the other XDBMSes, Sedna provides CRUD operations and ad-hoc queries for filtering and aggregation using XQuery 1.1 and XPath 2.0. However, it does not provide MapReduce functionality in working with these queries.
Value indices are used to index elements' content and attributes. Full-text indices can be created in Sedna to facilitate full-text search using XQuery.

JDBMSes
DODBMSes are designed for storing, retrieving, managing, and processing semi-structured data in the form of document. With the rise of the NoSQL movement, multiple DODBMS solutions,

MongoDB
MongoDB is a DODBMS developed in C++ that focuses on combining the critical capabilities of RDBMSes with the innovations of NoSQL DBMSes. MongoDB uses a flexible, dynamic schema to store data. A record is stored in a document and multiple documents are stored in a collection.
Documents in a collection do not necessarily have the same structure and so the number of attributes and their data type can differ from one record to another. In practice documents usually model objects from a high-level programming language. Although the database allows documents with a different number of attributes and different data types for the same attributes, records have almost the same structure in a collection [31].
MongoDB stores the data in BSON documents. A BSON is a binary-encoded serialization of JSON-like documents. This format is easily parsed and lightweight with respect to the overhead needed to store data.
Transactions in MongoDB respect the BASE (Basically Available, Soft state, Eventual consistency) transaction model which ensures that all the modification operations will propagate on all the nodes in an asynchronous way. MongoDB uses Causal Consistency that enables operations to logically depend on preceding operations [32] and in-memory functionalities to improve the query execution time. Furthermore, this JDBMS supports multi-document transactions with ACID data integrity guarantees.
To achieve redundancy and data availability, MongoDB uses Replica Sets for primary-secondary replication. A replica set is a group of MongoDB instances that store the same dataset. To partition the data and distribute it across multiple machines, MongoDB uses Sharding. Sharding is a horizontal scaling mechanism that partitions and balances the data on multiple nodes or replica sets.
MongoDB supports CRUD operations and ad-hoc querying through the use of a JavaScript API available in the MongoDB client. The Aggregation Pipeline framework is a multi-stage pipeline that transforms documents into aggregated results using the concepts of data processing pipelines.
Aggregation can also be achieved using the MapReduce framework.
MongoDB supports primary and secondary indexing. These indices can be a single field, compound (multikey), geospatial, hashed, and text. Text indices enable full-text search.

CouchDB
CouchDB is an open-source DODBMS developed in Erlang that provides a schema-free model for storing self-contained data using the JSON format [33].

Transactions in CouchDB respect document-level ACID properties with Multi-Versioning
Concurrency Control (MVCC) [34]. CouchDB relies on Eventual Consistency together with incremental replication to maintain the data consistency. CouchDB does not provide in-memory capabilities. CouchDB provides primary-primary and primary-secondary asynchronous replication.
Sharding is used to distribute horizontally in a cluster the copies of each replica [35]. To resolve inconsistencies, CouchDB uses a conflict-flagging mechanism.
CouchDB supports CRUD operations and ad-hoc querying using a JavaScript API called Mango. For aggregation, CouchDB provides Views and MapReduce functionalities [36]. Indexing in CouchDB is achieved through the use of views. CouchDB provides two types of indices: JSON and text for full-text search support.

Couchbase
Couchbase is a highly-scalable DODBMS that stores documents using the JSON encoding. It offers high availability, horizontal scaling, and high transaction throughput [37].
Transactions in Couchbase respect the ACID properties and rely on Eventual Consistency and Immediate Consistency. Couchbase has in-memory capabilities and keeps records into buckets.
The buckets are of the following type i) Couchbase buckets used to store data persistently and inmemory, ii) Ephemeral buckets used when persistence is not required, and iii) Memcached buckets used to cache frequently-used data and minimize the number of queries a database-server must perform.
Couchbase uses a shared-nothing architecture and provides primary-primary and primarysecondary as well as partitioning through the use of sharding. Couchbase scales horizontally in a cluster.
Ad-hoc data querying is achieved using a JavaScript API or a SQL-like language, i.e., N1QL (Non-1NF Query Language) [38]. These languages enable Couchbase to have OLTP (Online Transaction Processing) CRUD operations and ETL (Extract Transform Load) capabilities [39].
JavaScript MapReduce Views can be developed and stored on the server-side to specify complex indexing and aggregation queries [40].
Couchbase provides multiple types of indices: [40] i) composite indices to index multiple attributes, ii) covering indices to index the information needed for querying without accessing the data, iii) filtered (partial) indices to index a subset of the data used by the WHERE clause, iv) function-based indices that compute the value of an expression over a range of documents, v) sub-document indices to index embedded structures, vi) incremental MapReduce views to index the results of complex queries that perform sorting and aggregation to support real-time analytics over very large datasets, vii) spatial views using Spatial MapReduce to index multi-dimensional numeric data, and viii) full-text indices used for full-text search capabilities. partitioning mechanisms, except for eXist-db which offers primary-secondary replication. An advantage of XDBMSes is the use of XQuery and XPath for querying the data which makes adhoc querying an easy task. Although XDBMSes support aggregation queries, they do not provide MapReduce frameworks as a result of the lack of distribution capabilities. Another advantage of XDBMSes is that they offer different types of indices, including text indices for full-text search. As can be seen from Table 1, the chosen JDBMS solutions also offer different types of indices, but in addition to JDBMS, the one used in XDBMS systems can also be added on properties and paths, not only on keys and values.

Data Model
For our benchmark, we proposed a heterogeneous entity-relationship schema that can be easily expanded with more complex relationships and new entities. Figure 1 presents the proposed schema. The model's entities are described below.
• Authors is the entity that stores information about authors. Besides the unique identifier for each author AuthorID, the attribute Name is used for storing the name of each author.
• Records contains information about the published work of one or more authors. It stores the Title, the URL for quick access on the web, and the publishing Year. The manyto-may relationship WrittenBy correlates each record with the authors. A record can be either published as a book (or book chapter) or as an article (conference or journal). The relationship IsA is used for denoting the sub-type of a record.
• Books is the first sub-type of a record. This entity stores the following information: i) the unique book identifier ISBN, ii) the pages of a record using the attribute Pages, iii) the book editors using the multi-variate attribute Editors, and iv) the type of a record of this sub-type, i.e., book or book chapter, using the attribute Type. The one-to-many relationship PublishedBy is used to correlate each record of sub-type Book to a Publisher.
• Articles is the second sub-type of a record. Besides the unique identifier of a record in this sub-type, the entity Articles stored information about i) the pages of a record using the attribute Pages, and ii) the type of a record of this sub-type, i.e., conference or journal article, using the attribute Type. The one-to-many relationship PublishedIn is used to correlate each article to a journal.
• Journals entity stores information about an article publication venue. The attributes are: i) ISSN used as the unique identifier, ii) Type used to determine if the publication is a journal, proceedings, or special issue, iii) Title used for keeping the title of the journal or the conference name, iv) Volume used to store the number of years since the first publication, and v) Issue used to store how many times the journal has been published during a year. The one-to-many relationship PublishedBy is used to correlate each record of sub-type Journal to a Publisher.
• Publishers is the entity that stores a unique identifier and the Name of a publishing house.

Workload Model
The workload model follows two analysis directions: i) selection queries for filtering the corpus and extract subsamples, and ii) aggregation queries for creating reports.
For the selection queries, a constraint c i 1 = contains(Records.T itle, t i ) is used to extract the most relevant records that are contained in the title of a given set of terms. The constraint c i 1 utilizes the contains(·, ·) function, which verifies if a substring t i ∈ {t|t ∈ vocabulary} belongs to a string. In this case, the vocabulary is the set of terms extracted from each title using Tokenization.
Aggregation queries are used to create reports about the publishing activity of each author.

Database Design
The conceptual entity-relational diagram described in Section 4 must be translated into the

Query Description
The proposed benchmark features nine queries with different complexity and selectivity, i.e., Q 1 to Q 9 . The first five queries are used to filter the dataset based on different constraints. Whereas, the last four queries are used to filter and group the data in order to obtain aggregated results.

Selection Queries
The first set of queries selects the records that respect a given constraint.
The first query (Q i 1 ) uses the constraint c i 1 to extract the documents which contain in their title a certain given term t i (Equation (1)). The projection for the query, which specifies the set of selected attributes following the query execution, is Π 1 = {Records.T itle}.
The second query (Q ij 2 ) extracts the records that contain in their title two terms (Equation (2)). It uses the constraint c s 1 , s ∈ {i, j} with i = j. The query is written using the INTERSECTION operator between the results returned by Q i 1 for term t i and Q j 1 for term t j . Due to the nature of the filtering condition, we can concatenate the separate conditions to create a single conditional expression using the and logical operator (∧), i.e., c i 1 ∧ c j 1 . As in the case of the first query, the projection remains Π 1 .
Q ij 3 extracts the records that contain in their title at least one of the terms given through the c i 1 or c j 1 constraints, with i = j (Equation (3)). The query is written using the UNION operator between the results returned by Q i 1 for term t i and Q j 1 for term t j . The projection remains Π 1 . As for query Q ij 2 , the conditions can be concatenated to create a single conditional expression using the or logical operator (∨), i.e., c i 1 ∨ c j 1 .
The fourth query (Q 4 ) filters the Records entity and extracts the documents that contain in their title the terms t i , t j , and t k (Equation 4). As for the previous queries, the projection attributes are given using Π 1 . The query is written using the INTERSECTION operator between the results obtained by Q i 1 , Q j 1 , and Q k 1 for terms t i , t j , and t k respectively. Due to the nature of the filtering conditions, they can be concatenated into one constraint c i The last selection query (Q 5 ) extracts the documents that contain in their title one or more of the searched terms t s , s ∈ {i, j, k} with i = j ∧ i = k ∧ j = k. The query is written using the UNION operator between the results obtained by each Q s 1 for t s terms. The nature of the filtering constraints permit the query to be written using one constraint c i 1 ∨ c j 1 ∨ c k 1 and the projection Π 1 (Equation 5).

Aggregation Queries
The last four queries use aggregation to count the number of articles using different filtering constraints and attributes in the GROUP BY clause.
The sixth query (Q 6 ) uses aggregation to determine the number of articles written by each author (Equation (6)). It uses a JOIN operation between the Records and Authors entities. Because there is a many-to-many relationship between the two entities, the JOIN also traverses WrittenBy.
The projection attributes are Π 6 = {Author.N ame, count}. To determine the number of articles for each author, we use the aggregation operator γ L6 , where L 6 = (F 6 , G 6 ). The list of aggregation functions is given by F 6 , while the set of attributes in the GROUP BY clause is given by G 6 . The list of aggregation functions is F 6 = {count(Records.RecordID)}, where the count is the counting aggregation function. The set of attributes in the GROUP BY clause is G 6 = {Authors.N ame}.
The seventh query (Q 7 ) counts the number of articles published by an author for each year (Equation (7)). The query makes use of a JOIN operation between the Records and Authors entities, as in the case of query Q 6 . The projection uses the following attributes The eighth query (Q 8 ) extracts the documents that contain in their title all of the searched terms, and then it counts the number of articles grouped by author and year. As in the case of Q 6 , the JOIN operation is between the Records and Authors entities. The query is written using the INTERSECTION operator. The filtering is done using the constraints c i 1 , c j 1 , c k 1 which ensures that the title contains all terms t i , t j , and t k with i = j ∧ i = k ∧ j = k. The projection attributes and the aggregation operator remains the same as in the case of Q 7 , i.e., Π 7 and γ L7 . Due to the nature of the filtering conditions, the query can be rewritten using only one constraint c i 1 ∧ c j 1 ∧ c k 1 .
The last query (Q 9 ) extracts the documents that contain in their title one or more of the searched terms t s , s ∈ {i, j, k} and i = j ∧ i = k ∧ j = k, by filtering through the use of constraint c s 1 . The JOIN operator is used once again between the Records and Authors entities, as in the case of Q 6 . The projection attributes and the aggregation operator remain the same as in the case of Q 7 , i.e., Π 7 and γ L7 . The filtering constraints c i 1 , c j 1 , c k 1 are applied on the Records entity. The query uses the UNION operator between the relationship obtained after filtering. Due to the nature of the filtering, the query can be rewritten using one constraint c i 1 ∨ c j 1 ∨ c k 1 .

Experimental Conditions
All tests were run on an IBM System x3550 M4 with 64GB of RAM, and an Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz. The XDBMSes used for benchmarking are BaseX, eXist-db, and Sedna. For comparison reasons we also use three JDBMSes: MongoDB, CouchDB, and Couchbase.
We chose these DODBMSes because they are free to uses and because their licenses do not forbid benchmarking.
The versions of the deployed DODBMSes are listed in Table 2. The proposed benchmark, the results, and the used dataset are publicly available on-line 4 .
As the chosen XDBMS solutions do not have partitioning, we could not distribute them.
Therefore, we deployed and tested them on a single instance environment. Moreover, for The query parameterization is presented in Table 3. Each term t i (i = 1, 3) is used for filtering the records through the constraint c (i) 1 . Thus for the first set of queries, i.e., Q i 1 , Q ij 2 , and Q ijk 3 , the i, j, and k indices (i = j ∧ i = k ∧ j = k) represent the i ∈ 1, 3 index of the t i used for filtering.  Table 4 presents the size of the 4 subsets, both as raw data and the resulting DODBMS collection dimension. For all the XDBMSes as well as for CouchDB and Couchbase, we can observe that database size is larger than the raw dataset. This increase is a direct result of the overhead required by the DODBMSes to manage and store the data. MongoDB uses compression mechanisms, which in turn manage to decrease the database size by minimizing the overhead.

Query Implementation
Data are stored within each DODBMS using a denormalized schema; thus, one-to-many and many-to-many relationships are encapsulated inside the same document. To achieve denormalization, JDBMSes employ nested documents, lists, and lists of nested documents, while XDBMSes use the hierarchical structure of the XML format. To normalize the information and apply filtering and aggregation operations and functions, we use the native syntax, operators, query language clauses, and frameworks provided by each DODBMS. Table 5 presents the implementation language and operators. Line Interfaces to run these queries.
The aggregation queries in MongoDB are implemented using its Aggregation Pipeline framework. To deal with nested documents, the unwind operator is used to flatten an array field of nested documents. This operator is useful when trying normalize the one-to-many and many-to-many which trough denormalization are stored in the JSON format as nested documents or lists of nested documents. We used the native Command Line Interfaces to run these queries.
CouchDB uses Materialized Views for aggregation and to deal with nested and list of nested documents. These views are implemented using CouchDB's MapReduce framework. The mapper function is used to flatten nested documents and filter the field. The reducer function is used for applying an aggregation function and returning the final result. We used cURL to run these queries.
To manipulate nested array in Couchbase, N1QL offers developers the UNNEST clause. This clause is used to flatten the arrays in the parent document. Thus, the UNNEST clause conceptually performs a JOIN operation between nested arrays and the parent document. As data are stored using the JSON format, the JOIN operation increases the runtime and decreases the overall retrieval performance. For Couchbase, we used the native Command Line Interfaces to run these queries.

Query Selectivity
Selectivity, i.e., the amount of retrieved data (n(Q)) w.r.t. the total amount of available data (N ), depends on the number of attributes in the WHERE and GROUP BY clauses. The selectivity formula used for a query Q is S(Q) = 1 − n(Q) N . For the selection queries, we set N equal to the cardinality of the Records entity, i.e., N = ||Records||. Table 6 presents the filtering queries' selectivity w.r.t. the SF . The queries with more restrictive conditions return a smaller number of records and the selectivity is higher, e.g., Q ij 2 . The queries with more inclusive restrictions return a higher number of records and the selectivity is lower, e.g., Q ij 3 .   Table 7 shows the aggregation queries' selectivity w.r.t. the SF factor. Query Q 8 is the most restrictive query. Because of the filtering and grouping conditions, Q 8 returns a small number of records, and its selectivity is almost equal to 1. The most inclusive query is Q 7 , and it has a low selectivity w.r.t. SF . Because of the less restrictive filtering and grouping conditions, the selectivity of this query is less than 0.45. The selectivity of Q 6 increases w.r.t. SF , meaning that the number of records returned by the query increases more gradually than the size of the corpus.

Performance Metrics and Execution Protocol
We use the query response time as the only metric for the benchmark. It is symbolized for each query by t(Q * i )∀i ∈ [1,9]. All queries are executed 10 times, which is sufficient according to the central limit theorem. Additionally, all executions are warm runs, i.e., either caching mechanisms must be deactivated, or a cold run where each query must be executed once (but not taken into account in the benchmark's results) to fill in the cache. Queries must be written in the native scripting language of the target DODBMS and executed directly inside the specified system using the command line interpreter. Lastly, the average response time and standard deviation are computed for each t(Q * i ). Figure 3 presents the results of Q i 1 where i = 1, 3 is used to denote the keyword t i . MongoDB and BaseX offer the fastest time performance among the DODBMSes that encode documents using JSON and XML, respectively, regardless of the keyword w.r.t. SF . For Q 2 1 query which has the lowest selectivity of the three Q i 1 queries, the time performance of CouchDB is with a factor of ∼ 2x faster than eXist-db w.r.t. SF . The time performance of CouchDB and eXistdb for Q 1 1 and Q 3 1 tend to become the same w.r.t. SF , i.e., the performance difference factor between CouchDB and eXist-db at SF = 0.125 is ∼ 0.8x which increases to ∼ 0.9x for SF = 1.

Results
CouchDB time performance is with a factor of ∼ 1.1x faster than Couchbase for all the Q i 1 queries regardless of SF . Couchbase and eXist-db have similar performance for query Q 3 1 and SF = 1. Sedna performance is almost constant regardless of query selectivity w.r.t. SF . The overall best performance is achieved by MongoDB. Figure 4 presents the results of Q ij 2 and Q ij 3 queries where i and j indicate the t i and t j keywords used for filtering (Table 3) with i = 1, 3, j = 1, 3, and i = j. For this set of queries, MongoDB has the best overall time performance regardless of the SF factor. BaseX achieves the second overall best performance and the best performance among the tested XDBMSes, regardless of the SF .
For the Q ij 2 set of queries, the time performance of MongoDB has a factor between ∼ 3.2x and ∼ 3.6x faster then BaseX w.r.t. SF . For the Q ij 3 set of queries, the time performance of MongoDB has a factor between ∼ 1.8x and ∼ 2.2x faster then BaseX w.r.t. SF .
Couchbase presents the highest execution time for the Q ij 2 queries regardless of SF , followed by the execution time of CouchDB. CouchDB time performance is with a factor of ∼ 1.2x and ∼ 1.1x faster than Couchbase for the Q ij 2 , respectively Q ij 3 queries regardless of SF . The eXist-db XDBMS has the worst performance for the Q ij 3 set of queries regardless of the SF . For the Q ij 2 set of queries, Sedna time performance has a factor of ∼ 2x better than CouchDB and a factor of 2x worse than eXist-db. For the Q ij 3 set of queries, Sedna's query execution time is with a factor of ∼ 1.5x better than CouchDB and with a factor of ∼ 5x worst than BaseX. Figure 5 presents the time performance of Q 4 and Q 5 queries for each DODBMS w.r.t. SF . The time performance trend for Q 4 and Q 5 remains similar to the ones for Q ij 2 and Q ij 3 , respectively. CouchDB time performance is with a factor of ∼ 1.3x and ∼ 1.2x faster than Couchbase for the Q ij 2 , respectively Q ij 3 queries regardless of SF . MongoDB achieves the overall best time performance for both queries. BaseX has the second-best time performance among the tested DODBMSes and the best performance among the XDBMSes.   Figure 6 shows the results for the aggregation queries, i.e., Q 6 to Q 9 . For the queries Q 6 , Q 7 , and Q 9 , BaseX has the best time performance and significantly outperforms MongoDB and CouchDB with a factor of ∼ 2x, regardless of the SF . For the Q 8 query, CouchDB achieved the best query execution time, while Couchbase the worst. MongoDB has the second best query response time among the studied DODBMSes for Q 6 , Q 7 , and Q 9 . MongoDB's response time for these queries is almost on parity with the response time of CouchDB w.r.t. SF , although MongoDB executes the aggregation functions at runtime. For Q 7 , Couchbase has a large standard deviation. During testing, this query finished with the error "Index scan timed out". The tests that finished with the status "success" returned fluctuating time performance for each run. This abnormal behavior of the Couchbase system can be sometimes observed for complex queries on large collections.
For Q 8 which has the highest selectivity, CouchDB holds the best time performance.
We attribute this result to the mechanism used by CouchDB to store aggregation functions.    to filter and group the information, increases the runtime significantly while decreasing the overall query performance.
The aggregation queries did not work on Sedna. When executing these queries, the XDBMS remained unresponsive for days, and we had to manually stop the system, the related services, and the background processes. We note that Sedna also executes aggregation functions at runtime. We suspect that one reason for Sedna's failure to execute the aggregation queries is also the outdated XQuery 1.0 query language.  Figure 5: Response time for Q 4 and Q 5 The eXist-db XDBMS has the highest query time for Q 6 , Q 7 , and Q 9 queries.

Discussions on the Experimental Design Choices
In this study, we present our findings regarding the performance of filtering and aggregation queries on a large dataset for XDBMSes and JDBMSes w.r.t. different scale factors. We observe that the XDBMSes perform as well as JDBMSes for specific use cases, with BaseX even outperforming the more popular JDBMSes on three out of the four aggregation queries. Among the JDBMSes, MongoDB has the overall best performance.
For our comparison, we do not take into account horizontal scalability through sharding and replication, as not all of the analyzed DBMSes have such a functionality. Furthermore, it is essential first to understand single-node performance before considering horizontal scaling. Thus, the aim of the paper is to examine single instance deployments.
There are many real-world scenarios where such single-instance deployment is preferred. As a first example, XDBMSes can be used for fast application development, analyzing and querying log data, or storing and retrieving IoT sensor data. XDBMSes are good candidates for storing large documents, managing long-running transactions, and querying hierarchical data structures in environments that require rapidly evolving schemes. Furthermore, these DBMSes are lightweight and do not require dedicated hardware, software, or a lot of resources. Thus, managing to lower resource costs at the data center site and enabling on-site data analysis and decision making. Therefore, they can be utilized in Edge and Fog Computing with ease.
The creation of network islands due to faulty nodes is very common in the Edge/Fog environment. Even in the presence of well-defined recovery mechanisms, the formation of temporal network islands is unfavorable for sharding, as the overall latency increases if nodes go down and then up again. Hence, single-instance deployments are favored in these environments.
Another real-world scenario where such single-instance deployment can be used is for small to medium scale document management systems. These management systems are useful to smaller enterprises, where data is kept in the company due to GDPR (European Union Legislation on General Data Protection Regulation). Moreover, as in many cases most of the data is in semistructured formats, such as XML and JSON, single instance DODBMSes are a good candidate for storing and managing such documents. Thus, removing from the company's costs the maintenance of a data center.
It is also important to mention that the focus of our benchmark is on data retrieval and not on write operations because, in real-world applications, multiple techniques can be put in check to balance the write operations and minimize the workload. Moreover, data persistence can be achieved much later within a DBMS, depending on the workload and the systems write and logging mechanisms.
Furthermore, we loaded the data into the database using different methods. Because not all of the tested DODBMSes have their own data load tools, we developed our own data loading programs. By utilizing our data load programs and not native load DBMS functionalities, we added a new layer of complexity which decreases write performance. This makes the loading process to be dependent on external DBC (database connectors) implementations, and not on the DODBMS internal functionalities.

Conclusion
In this paper, we present an overview and comparison of DODBMSes that encode information using XML and JSON formats and propose a benchmark using filtering and aggregation queries on a heterogeneous dataset. For our experiments we chose three XDBMSes, i.e., BaseX, eXist-db, Sedna, and three JDBMSes, i.e., MongoDB, CouchDB, and Couchbase. These DODBMSes are open-source and free to use systems, whose license does not forbid benchmarking.
Our comparison focuses on key functionalities required by Big Data and IoT systems for storing and extracting information from large volumes of data. For this comparison, we also consider the transactions' properties of each DODBMSes, their in-memory capabilities, and how these systems deal with atomicity, consistency, isolation, durability with regards to operations such as accessing, modifying, and saving documents. We also present for each DODBMS its support for replication and partitioning of data and how it manages these Big Data requirements. Furthermore, we present the querying languages used for extracting information as well as the different types of indices provided by each DODBMS to improve retrieval response time.
The proposed benchmark uses different queries to emphasize the time performance of DODBMSes and highlights the capabilities of XDBMSes and JDBMSes. Furthermore, our solution proves its portability, scalability, and relevance by its design. The benchmark is portable, as it works on multiple systems. For this purpose, we compare the performance of several DODBMSes, i.e., BaseX, eXist-db, Sedna, MongoDB, CouchDB, and Couchbase. To demonstrate the scalability of our solution, we introduced SF , the scaling factor that generates an incremental growth in the data volume for our experiments. By increasing the queries' complexity together with the SF factor, we analyze the behavior of the systems from the scaling perspective. We observe that all the DODBMSes have a linear increase at runtime. Furthermore, BaseX proves to be a good choice when dealing with aggregations. Finally, our experimental results show that our benchmark is indeed relevant in comparing the runtime performance of different DODBMSes.
The performance tests provide some interesting and unexpected results. Among the XDBMSes, BaseX has the best overall performance. BaseX even outperforms the JDBMSes selected for this benchmark, i.e., MongoDB, CouchDB, and Couchbase, for three out of the four aggregation queries proposed. We observe that Couchbase has the overall worst performance among the JDBMSes.
Sedna outperforms CouchDB and Couchbase when dealing with filtering queries, but does not work for the aggregation queries. MongoDB has the overall best time performance for the filtering queries and it outperforms BaseX only for the aggregation query Q 8 . eXist-db has some strange behavior when dealing with both filtering and aggregation queries. Also, it is highly dependent on the JVM, which needs to be tuned for each query, making this XDBMS hard to work with.
However, we can assume that eXist-db works well on a query to query basis.
Following the results obtained by the benchmark, we can answer the three research questions and conclude that XDBMSes are still useful: their performance is as good as JDBMSes and they are good candidates for Big Data Management. Furthermore, XDBMSes are well-suited for several current real-world scenarios. Firstly, XDBMSes are reliable systems for storing large documents, managing long-running transactions, and querying hierarchical data structures in Edge/Fog environments (e.g., smart agriculture, healthcare wearables, etc.), as these types of DODBMSes are lightweight and do not require dedicated hardware, software, or a lot of resources.
Secondly, XDBMSes can be used as small to medium scale document management systems in smaller enterprises, where data are kept in the company due to GDPR. Thirdly, in the case of Big Data analysis, they prove to be well-suited when the documents are in XML format, by removing the ETL (Extract, Transform, Load) processes from the storing, managing, and analysis pipeline.
As future work, we plan to improve the support for OLAP queries [41] on XML data and XML data in combination with other data [42,43] both in terms of performance and functionality.
This includes designing new sampling strategies and supporting more aggregation queries [42]. The sampling methods will include constraints on other labels and values contained in the records. Also, we aim to add more dimension for grouping [42], to boost the performance by lowering the query selectivity and performing query rewriting [43], and to add further grouping functionality [42].