Database management system performance comparisons: A systematic literature review

Efficiency has been a pivotal aspect of the software industry since its inception, as a system that serves the end-user fast, and the service provider cost-efficiently benefits all parties. A database management system (DBMS) is an integral part of effectively all software systems, and therefore it is logical that different studies have compared the performance of different DBMSs in hopes of finding the most efficient one. This study systematically synthesizes the results and approaches of studies that compare DBMS performance and provides recommendations for industry and research. The results show that performance is usually tested in a way that does not reflect real-world use cases, and that tests are typically reported in insufficient detail for replication or for drawing conclusions from the stated results.


Introduction
Efficiency is important in effectively all software systems, whether efficiency is measured by response times, how many concurrent users the system can serve, or how energy-efficient the system is [79].Despite its importance, many software systems suffer from efficiency problems [49], as optimization has been largely recognized as a complex task [79,27].The more a system holds and handles data, the more the system's performance depends on the database, and the database is often one of the first suspect when a performance issue is detected.The domain of database management systems (DBMS) saw rapid advancements in performance especially in the 1980s and 1990s, as benchmarking competitions between DBMS and hardware vendors led to innovations in DBMS technology that significantly improved DBMS performance [24].Performance improvements are related to DBMS aspects such as different supporting data structures [82], and algorithms for sorting [33,29] and joining [72,61].Given that DBMSs are annually a multi-billion dollar industry, the performance of a DBMS is one of the most crucial aspects when a company chooses a DBMS for their product or service [26].As different DBMS performance comparison studies and DBMS vendor white-papers highlight the performance gains of one DBMS over another, it may seem tempting to either consider choosing the fastest DBMS for a business domain or to migrate from one DBMS to another for performance gains.However, as we show and argue in this study, performance is typically tested in very specific contexts which are not necessarily generalizable, and there are other aspects besides performance to consider.
This study was inspired by a study by Raasveldt et al. [69], which claimed that "[...] we will explore the common pitfalls in database performance benchmarking that are present in a large number of scientific works [...] " while consciously refraining from citing example studies.While we agree with their claim based on our personal experiences, we wanted to systematically explore whether this phenomenon is common among performance benchmarks, and whether such studies show performance gains of one DBMS over another in a setting that can be replicated.This study is not an attempt to criticize studies comparing DBMS performance, as no scientific study (ours included) is without threats to validity.Rather, based on the survey of the literature, the primary goals of our study are to (i) propagate information on how DBMS performance has been tested, (ii) how performance has been recommended to be tested, (iii) how the performance comparison results should be interpreted, (iv) what other aspects besides performance should be considered, and (v) what other avenues might be fruitful for DBMS performance testing.Additionally, we provide (vi) a relatively accessible background on database system performance, followed by (vii) a systematic review of literature on DBMS performance comparisons, (viii) describing which DBMSs and which types of DBMSs have been compared with each other, (ix) the outcomes of the performance comparisons, and (x) by which benchmarks the DBMSs have been compared.
The rest of this study is structured as follows.In Sections 2 and 3, we provide theoretical background for understanding the results and discussion provided by this study.These background sections are deliberately presented by refraining from using unnecessary information technology-related terms, acronyms, algorithms, or mathematics, to cater to the needs of readers from various backgrounds.For readers more technically inclined or interested, we have provided further reading at the end of Sections 2 and 3. Section 4 details how we searched, selected, and categorized the DBMS performance comparison studies, and Section 5 presents a high-level overview of the results, which is complemented by Appendix A detailing the performance comparison outcomes.In Section 6, we discuss what these findings mean, how they are applicable in industry, and present our recommendations for industry and research based on the findings.Section 7 concludes the study.

Database System Overview
A database is a collection of interrelated data, typically stored according to a data model.Typically, the data is used by one or several software applications via a DBMS.Collectively, the database, the DBMS, and the software application are referred to as a database system [31, p.7][17, p.65].The separation of the database and the DBMS, especially in the realm of relational databases, is typically impossible without exporting the database in another format.In these situations, the database is often unusable by the DBMS, unless the database is imported back to a format understood by the DBMS.Possibly due to this inseparability, both the DBMS and the underlying database are often colloquially referred to simply as database.It is worth noting, though, that the former is a piece of software that does, while the other is a collection of data that is.
Fig. 1 shows a simplified example of a system where the components crucial for a database system and the scope of this study are emphasized.We refer to the components in the figure throughout this study.Several things are worth noting in considering the figure, as we have traded technical precision and comprehensiveness for ease of presentation by depicting only a single end-user, a single software application (some parts typically reside on the end-user's device, while others reside on a separate server), a single DBMS, single hardware components, and a single database.Furthermore, we have not illustrated other DBMS components such as access control, data structures such as metadata, or outputs such as query execution plans.The figure also adopts the view that the database resides in persistent storage -this is not always the case.Additionally, we have depicted merely a centralized database system in which neither the DBMS nor the database has been distributed across multiple nodes.These are willful omissions given the scope of this study.

Data Models
Databases follow one or several data models, i.e., definitions of how and what data can be stored, and sometimes, what operations are available for data retrieval and manipulation.Data models may be conceptual, logical, or physical.Conceptual models such as the Entity-Relationship model [11] do not dictate how data should be stored, but are rather used to describe the interrelations and characteristics of the data.Logical data models such as the relational model [14] are related to how data is stored and presented, but often without describing how the data is physically stored, e.g., which computing node is responsible for storing the data, where the data is located on a disk, and what types of indices (i.e., redundant data structures which facilitate query performance) and physical data retrieval operations are available.One DBMS is not limited to using a single data model [34].
There are several popular logical data models, some of which are inseparable from their underlying physical data models.One of the most prominent logical data models is the relational data model rooted in set theory [14].Relational DBMSs (RDBMS) follow many of the concepts introduced in the relational model.Many of the popular RDBMSs such as PostgreSQL and Oracle Database have adopted data structures from other logical data models as well [57].What is common for effectively all modern RDBMS is that they utilize Structured Query Language (SQL) [46,47] to define data structures and to retrieve and manipulate data.

Optimizer
Execution engine

User
Figure 1: A simplified view of a database system and the end-user with the emphasis on components relevant to this study; the arrows represent the flow of information from the end-user's device to the database residing in persistent storage; the flow of information back to the software application is not illustrated here; gray rectangles represent boundaries of physical devices Typically, RDBMSs also implement a strong data consistency model which dictates or allows that database operations grouped into a transaction must all succeed or all fail, data must follow defined business logic, successful transactions persist in storage, and concurrent transactions [cf.5] must result in the same data as if the transactions were serial.At least the last rule can often be loosened in modern implementations to various degrees.These constraints are collectively referred to as the ACID consistency model [42].
NoSQL is an umbrella term for several data models, typically developed or popularized in the first decade of the 2000s [38].Contrary to the relational model, the data models within NoSQL typically have no formal definitions, and different NoSQL DBMSs implement different data models such as key-value (e.g., Redis), document (e.g., MongoDB), wide-column (e.g., Cassandra) and graph (e.g., Neo4J) [22,71].Furthermore, these DBMSs often have a distinct query language developed to cater to the particular data structures available in the DBMS's implementation of a data model.While RDBMSs have favored data consistency [9] by eliminating redundant data through logical database design, and through a strong consistency model, NoSQL DBMSs have generally adopted the opposite approach.In several NoSQL data models such as keyvalue pairs and documents, redundant data is stored at the cost of storage space [43].This approach enables query languages to be simple [25], avoiding complex and potentially slow queries.Furthermore, consistency models are typically less strict than in RDBMSs [73], which facilitates higher performance demanded by, e.g., web applications with a large number of concurrent users [70].
Although NoSQL DBMSs popularized several database-related approaches such as non-strict database structures, data availability over data consistency, and relatively effortless database replication (i.e., data is copied over computing nodes) and sharding (i.e., data is divided between computing nodes) [38], some industry leaders such as Google deemed a strong consistency model and an expressive query language important enough to design a DBMS which incorporates features from both RDBMSs and NoSQL DBMSs [19].These so-called NewSQL DBMSs use the relational model, often with extensions, SQL as their primary query language, and a distributed database architecture [64].In addition to these three main categories of RDBMS, NoSQL, and NewSQL data models, others such as object stores [148] and GPU-intensive [190] systems are used in specific contexts.

Query Execution
The word query typically refers to query language statements that retrieve some data from the database.However, in this study, we use the word query to refer to any data retrieval or manipulation statement for brevity.In times it is necessary to differentiate between data retrieval or manipulation, we use appropriate terms such as read operations for data retrieval, and write operations for data insertion, updates, and deletes.
In this subsection, we describe how queries are executed, using mainly general (i.e., not specific to a single DBMS) literature from the domain of RDBMS query execution.
When a user -were it a human actor directly using a terminal, a transaction processing software application, or a database benchmark software -submits a query to a DBMS, a multitude of events must take place before the user receives feedback.Illustrated in a general fashion in Fig. 1, the query parser checks, among other things, that the query is syntactically valid [44].If the query passes these (and other) checks, the query is translated to a lower-level presentation and passed to the query optimizer.The optimizer generates one or several query execution plans.These plans consist of physical operators for implementing, e.g., which physical data structures will be utilized in executing the query, and in RDBMSs in particular, how tables are joined together [36].If several plans are generated, the optimizer evaluates which of these plans is the most effective in regards to, e.g., query execution time [44].The accuracy of the optimizer relies on aspects such as database metadata [12], statistics of previous query executions, and the indices available [10].Generating effective query execution plans is a complex effort and takes time [36,10], but once formulated, the plans can be re-used to a degree.
Next, the query execution engine implements the query execution plan, using the physical operators therein.Simplified, the data objects required by the query are typically first searched from a memory area called the buffer pool which is allocated and maintained by the DBMS.If some or all data is not found, the data is requested from disk.Before accessing the disk, many systems may additionally utilize other areas of memory to avoid disk access [86].
Effectively all database systems function in an environment where multiple concurrent end-users use the database.This concurrency presents challenges particularly when the users execute write operations on the same database, e.g., when two or more users withdraw money from the same bank account, concurrently updating the balance [5].To guarantee that the write operations do not interfere with each other in a way that would cause the data to not represent the real world, DBMSs typically implement concurrency control through locking or versioning data.Effectively, the simpler implementations of locking restrict data objects or larger data structures from being accessed by other operations while the data objects are being modified [44].These locking mechanisms may be implemented to ensure that no anomalies happen, or with implementations that theoretically allow some anomalies [4].Typically, the business domain dictates what types of anomalies are tolerated.
Finally, as strong consistency models often require that transactions persist in the database and that either all or none of the operations in a transaction either succeed or fail, locking is typically complemented by transaction logs.These logs are written before write operations are committed to the database, and can be used in reversing earlier write operations if a later write operation in the same transaction fails.All these considerations discussed in this section play a significant role from a performance perspective, which is discussed in the next section.
Further reading on database systems: for readers interested in the basics of database systems, either the undergraduate level textbook by Connolly and Begg [17], or Elmasri and Navathe [31] are excellent albeit lengthy introductions covering the topic from several points of view and with the focus on RDBMSs.For readers interested in query processing, we point to studies by Chaudhury [10], and Hellerstein, Stonebraker and Hamilton [44].If you are interested in logical relational database design, the book by Date [21] is an in-depth resource covering both formal and informal approaches.For a survey of literature on NoSQL data models, the study by Davoudian, Chen and Liu [22] is an accessible starting point.

Performance Measurement
In general, performance is a measurement of how efficiently a software system completes its tasks.Performance is typically measured in response time, throughput [44], or in some cases, utilization of computing resources [20, p.4]. Response time is the time taken for a call in the system to traverse to some other part of the system and back.This is also sometimes called latency [39, p.10], and in the context of database systems, the response time may be measured as the response time to the first or the last result item [36].In a broad perspective described in Fig. 1, the response time might be the time taken after the end-user sends a request to the software application (e.g., an online store), which passes the request to a DBMS, which returns a set of data to the software application, which finally presents the data to the end-user's device.
In database benchmarking, however, response time might be measured by running the benchmark on the same device the DBMS and the database reside, effectively eliminating inter-device-induced performance drawbacks such as network latency [62,23] and firewalls, and mitigating the effects of other software running on the devices.Although DBMSs perform other tasks besides querying, querying is typically what is measured in DBMS performance testing [26].While response time is perhaps the least arduous performance metric to measure, it is not often enough for reliable measurement of transaction processing environments [26] (often dubbed online transaction processing, OLTP).That is, response time might be a metric better suited for long-running queries in decision support environments (often dubbed online analytical processing, OLAP), but as transaction processing environments often process a large number of concurrent transactions, response time alone might not reliably account for the effects of concurrent transactions, unless response time is measured as an average of multiple concurrent transactions.
Performance can also be measured by throughput, i.e., how many transactions the DBMS can execute in a given time frame.Throughput is often expressed as transactions per second [26] and requires a more sophisticated approach, e.g., benchmarking software.Again, throughput may be measured either locally (i.e., using only the hardware the DBMS and the database reside on), or over a network in case the database is distributed.Alternatively, throughput may be measured by connecting the benchmarking software to the software application, which simulates the throughput of the whole database system by accounting for, e.g., network and the software application [e.g., 54,75].Such an approach arguably requires significantly more investment, but provides a holistic perspective on the performance of the whole system, also uncovering potential performance issues unrelated to the DBMS and the database.Finally, performance may be measured by resource utilization, either CPU time, I/O, memory allocation, or energy consumption [36] in systems striving for energy-efficiency due to, e.g., limited battery power, or due to environmental concerns [40].
In summary, we might consider the measurement of throughput a process that typically requires a simulation of some level, and the measurement of response time as an exact or approximated mathematical method.The former approach requires relatively high investments into the development of such simulations [20, p.142], while the latter often relies on a set of assumptions that do not necessarily reflect real-world scenarios due to inaccuracies in predicting what the real-world scenario ultimately is and how it can change.

Factors Affecting Performance
Hardware: An intuitive factor in performance is the power of hardware [60, p.1], and while it is true that most of the local response time is attributed to time taken by CPU processing, memory and disk access, and software waiting for other tasks to complete [20, p.5], first investing in software performance rather than hardware performance is often more cost-effective.That being said, it is generally accepted that memory access is at least four orders of magnitude faster than disk access [e.g., 39, p.42].That is, if memory access takes minutes (nanoseconds), disk access takes months (milliseconds).These numbers are largely dependent on the speed of memory and the type of disk, but paint a picture of how zealously DBMS optimization strives to minimize disk access.Since memory is typically more expensive than disk storage, keeping the whole database in memory is often not feasible.Additionally, the underlying hardware is important, as, e.g., some DBMSs have been shown to utilize multi-processor or multi-core environments more effectively than others [81].Intuitively, how well a DBMS can exploit parallelism affects the performance of query execution [76,80].Ultimately, performance measurement is about gains or losses in percentages, not in, e.g., response times.
Data models: Data models described in Section 2.2 have indirect effects on DBMS performance.Relational databases often follow design guidelines that strive to minimize redundancy to eliminate potential data anomalies caused by redundant data [15,16], and to minimize the need for storage space, which in turn typically causes queries to run slower due to a larger number of table joins.In contrast, different NoSQL data models -especially key-value, document, and wide-column -follow design guidelines according to which data structures are designed to effectively satisfy predetermined business logic queries, with the elimination of redundant data being a secondary concern [22].It follows that because many NoSQL data structures are designed to serve queries, queries are typically simple [25], and their execution requires less computational resources than complex queries in relational databases.As discussed in Section 2.3, locking data objects (both on disk and in memory, and both primary data structures as well as indices), logging write operations, and how memory is managed by the DBMS all play a significant role in DBMS performance [44,73].For example, preventing write operation-induced anomalies is a costly action, and the level of granularity of database locks presents significant considerations on write operation performance, which is largely dictated by the ratio of read and write operations.
Distribution: Write operations in distributed configurations pose non-trivial challenges to both performance and data consistency [23].In distributed database systems, effectively all transactions must choose either data consistency or data availability [7,35].The former guarantees that the data the end-user receives are not stale, with the cost of performance, while the latter guarantees to a degree that the end-user receives data faster, but with no guarantees that the dataset received is the most recent.The preferred approach is largely dictated by business logic.
DBMS and OS parameters: Moving from data models and database system distribution to lower levels of abstraction, operating system (OS) and DBMS parameters and their interrelationships (e.g., page size) can have direct or indirect effects on performance [26].Additionally, DBMS parameters such as the amount of memory the DBMS is allowed to use for data processing is typically closely related to the amount of memory available.Furthermore, as a query is sent to the optimizer (cf.Fig. 1), it depends on the DBMS internals how efficiently the optimizer can select the most efficient physical operations to implement the query, and what physical operations are available to the optimizer in the first place [10].For example, MySQL implemented only one physical operation for table joins until 20181 , limiting the number of options the optimizer could choose from.Regarding query optimization, the optimizers of RDBMSs in particular are relatively mature and can spot some unnecessary complications in queries, while overlooking others [6].Despite the benefits brought by the optimizers, some queries are inherently slow and can only be optimized through query rewrites.
Physical database design: Last, but definitely not least, physical database design plays a key role in DBMS performance.It has been argued that performance bottlenecks are difficult to find in large systems [2], and that efficiency is gained by focusing on the vital few areas instead of the trivial many [51, p.450].One of the most vital areas in database systems is physical design.In relational databases, efficient physical design is largely achieved through indices, and in NoSQL databases, typically through database distribution over computing nodes.In contrast to a holistic system overview, performance bottlenecks may be easier to find in queries, since many DBMSs provide detailed information on query execution (Fig. 2).PostgreSQL (Fig. 2a) lists the physical operations used to execute the query, which of the operations took the most time units, and which indices, if any, were used.For example, it can be seen in Fig. 2a that the sequential scan on line 12 accounted for approximately 94% of the execution time of the whole query (178 time units out of 189 ms), probably because the query fetched a large number of records from the database.The query could be optimized by, e.g., selecting a smaller number of records, and showing the results to the end-user by paging them, i.e., showing a subset of results first, and fetching more later if necessary.In NoSQL systems, the query optimizer plays a smaller role due to typically less expressive query languages (cf.Fig. 2b).Some NoSQL systems such as Cassandra do not permit the execution of queries that do not utilize the physical structures effectively.

Database Performance Benchmarks
There are several database performance benchmarks available, each typically consisting of a sample database and a workload that simulates how the database could be used [27,68].The benchmarks usually measure the efficiency of querying while taking into account factors such as concurrency but disregarding other DBMS tasks such as efficiency in data structure definition or bulk loading [26].

QUERY PLAN
In the more general DBMS domain, the Yahoo!Cloud Serving Benchmark (YCSB) is a framework for benchmarking transaction processing in systems with different data models and architectures [18].Due to its extensibility, YCSB can be adapted to different NoSQL data models.YCSB contains different workloads, each with a different ratio of read and write operations.YCSB and its extensions such as YCSB+T typically utilize transactions which consist of single operations and do not enforce strong consistency [68,25].The benchmarks described above are by no means an exhaustive list but cover the most popular benchmarks (cf.Section 2.1).Other benchmarks include LUBM [41], OLTP-Bench [27], and JOB [55].Regardless of the data model and DBMS, transaction processing benchmarks have typically been the de facto method of comparing different DBMSs and hardware [80].
Further reading on performance: for readers interested in physical database operations and query execution from a performance perspective, Graefe [36] provides an in-depth, DBMS-independent survey.For more information on physical database design, especially indices and how they work, the book by Lightstone, Teorey and Nadeau [56] is a detailed and descriptive source.For a practical and concise guide on SQL query optimization, we point readers towards Winand's book [85].Regarding NoSQL DBMS optimization, we suggest referring to the manual of the DBMS of your choice, and always making sure that the source of information is current, as NoSQL systems tend to evolve rapidly.

Process and Criteria
The DBMSs in this study were selected based on the selected primary studies.That is, we did not choose, e.g., the most popular DBMSs to include, but reported the DBMSs yielded by the primary studies.The results herein may be considered the most popular DBMSs in terms of benchmarking reported in scientific studies.Fig. 3 describes the primary study selection process starting from ACM Digital Library, IEEE Xplore, and ScienceDirect, complemented by subsequent Google Scholar searches.The search strings are detailed in Table 1.To account for potentially missing relevant studies, we conducted three rounds of backward snowballing (i.e., following the lists of references in selected studies), until snowballing revealed no additional studies.A total of 117 primary studies comparing DBMS performance were selected.
Table 2 describes our inclusion criteria applied in the primary study selection.The first four criteria are related to bibliographic details, while the last three criteria are concerned with article focus and content.Regarding criterion #3, we excluded academic theses and dissertations [e.g., 13] due to the fact that they Article focus is on comparing the performance of two or more different DBMSs.7 Article is based on at least seemingly objective metrics.
are typically not peer-reviewed.We also excluded white and gray literature for the same reason, and because those studies are often written or published by partial parties, e.g., DBMS vendors.We only selected studies that compared query (i.e., retrieving or modifying data) execution performance, not regarding e.g., database replication performance [32] or performance of different join operations [53].We also excluded studies that compared a single DBMS performance in different configurations such as hardware, replication strategy, database structure, or query language [45] and studies that compared a DBMS with different data-related platforms [65].Studies that reported pseudonymized DBMS names were also excluded.Finally, we only included studies that reported results based on at least seemingly objective metrics and empirical results.That is, studies simply stating the opinions of the authors such as "based on our experiences, we believe MySQL is faster than SQL Server" were not considered.

Selected Studies
The selected 117 primary studies compared the performance of a total of 44 different DBMSs.We categorized these DBMSs into three top-level types defined and discussed in Section 2.2: RDBMSs, NoSQL systems, and NewSQL systems.Five DBMSs not clearly pertaining to any of these three categories were categorized under other systems (Table 3).It is worth noting that these DBMS types are not always clearcut due to the lack of specificity and changing nature of the definitions, and should be interpreted as merely means to compartmentalize the results of this study into a more readable form.Five selected primary studies did not report results implying the performance of one DBMS over another [168,182,151,118,138].
Fig. 4 shows the distribution of publication years and the types of DBMSs discussed in the selected studies.Although our criteria allowed for studies from the year 2000, the first studies selected were published in 2008.The figure shows that generally, there is a somewhat constant number of DBMS performance comparison studies each year.It is worth noting that one study may pertain to several types of DBMSs.

Performance Comparison Results
The most popular DBMS performance comparisons compared one or several RDBMSs to one or several NoSQL systems, one NoSQL system to another NoSQL system, or one RDBMS to another RDBMS, respectively.A total of 48 studies compared solely read performance, while 6 studies compared solely write  performance.The rest of the studies compared both read and write performance, with the exception of two studies [111,164] which were unclear whether they compared write operations.All comparisons and their results per DBMS type are summarized in Fig. 5. Fig. 6 presents an overview of which DBMSs and DBMS types the primary studies compared.The figure perhaps conveys how both other and NewSQL systems are typically compared within their respective DBMS type groups, while RDBMS and NoSQL systems are both compared within their respective groups as well as with each other.Additionally, the size of the nodes such as MongoDB, Redis, Cassandra, and MySQL show that these DBMSs typically outperform the DBMSs they are compared to.Due to their length, the detailed results from the primary study comparisons are presented in Appendix Appendix A, which includes tables detailing which DBMSs outperformed which.
Regarding the benchmarks defined in earlier scientific literature, the most popular was YCSB, which was utilized by 15 primary studies (approximately 13%) [90,91,92,125,183,185,201,93,142,191,192,147,102,134,174].The second most popular benchmark was the TPC-H benchmark and its variations, utilized by five primary studies (4%) [99,122,167,190,196].It is worth noting, though, that two of the studies [167,196] seemed to have executed the queries of TPC-H, instead of running the benchmark and accounting for, e.g., the effects of concurrent transactions.One primary study utilized the OLTP-Bench benchmark [193], one the LUBM benchmark [124], and one, in addition to TPC-H, the JOB benchmark [190].Regarding the benchmarks formulated by the primary study authors, 25 primary studies (21%) reported using ad hoc queries instead of earlier defined benchmarks to compare the performance of DBMSs.These queries were defined verbatim in the primary studies.In contrast, 70 of the primary studies (59%) compared DBMS performance using undisclosed ad hoc queries, likely formulated by the study authors.In other words, 22   5: DBMS performance comparisons overview; a directed edge from node a to node b represents the number of studies according to which a system of type a outperformed a system or systems of type b in (r)ead and (w)rite operations, e.g., a NoSQL system outperformed a NewSQL system in read operations in one study, and in write operations in one study; thicker edges visualize the most popular comparisons primary studies (19%) used some type of earlier defined database benchmarking suite.The performance tests of these 22 primary studies and what aspects of the environment they reported are detailed in Table 4.

General Discussion
The difficulty of rigorous performance testing is perhaps one of the root causes of why optimization is difficult, and several studies have highlighted the complexity of performance testing due to, e.g., the effects of DBMS parameters [66], testing environment settings [83], and how well the data in the performance test database reflects the real application data [67].Is it also important whether an impartial actor has carried out the performance test, or whether the test results are published e.g., by a DBMS vendor [24].However, this is sometimes difficult to assess and can be mitigated by simply explicitly reporting the test so that it can be replicated and verified by others.
Despite the fact that we were aware of some DBMS performance comparison studies as they have been touched on in previous works, we were surprised by the extent the few examples presented in the previous works [69,83] generalize to so many studies on the subject.For example, in read operations, MongoDB outperforms Cassandra according to ten studies, Cassandra outperforms Redis according to four studies, and Redis outperforms MongoDB according to six studies (cf.Appendix Appendix A), leading to a situation of M o > Ca > Re > M o, where MongoDB is both the best and the worst performing DBMS.Furthermore, as discussed in Section 5, few of the selected studies reported the test setting in enough detail for replication.Unfortunately, without sufficient details for replicating an experiment, such experimental results can claim any outcome [69].One aspect that was typically reported was some details about the hardware the test was run on, i.e., processor make and model, clock rate, memory size, and disk size.Without other details about the DBMS parameters, parallel execution, etc., these details are inconsequential.Despite the importance of the topic of DBMS performance comparisons, with the exception of one study [174], no primary studies were published in major data management fora such as ACM SIGMOD or VLDB.

Consider the Environments in Performance Testing Studies
If the environment in which the performance testing was carried out does not provide sufficient details, whatever the study states, you may interpret the results as if they do not generalize to other environments.That is, if you are in the process of deciding on a DBMS for your application, or perhaps considering changing : An overview of read operation performance comparisons between NoSQL systems (green, upper right), NewSQL systems (yellow, lower right), RDBMSs (red, lower left), and other systems (blue, upper left); a clockwise turning edge from node a to node b depicts node a outperforming node b, and the color of the edged corresponds to the type of the outperforming node, e.g., Caché outperforms PostgreSQL according to one or several studies; the size of a node represents out-degree, i.e., larger nodes have outperformed more systems than smaller nodes Table 4: An overview of primary studies using previously defined benchmark software and which aspects of the testing environment they explicitly disclosed; performance measurements abbreviated as ET (execution time) and TP (throughput); one DBMS to another, consider whether the performance comparison study you are reading presents a similar use case.Compare your business domain to that presented in performance comparison studies, remembering that a single, sometimes even a seemingly inconsequential parameter (cf.e.g., data types in SQLite [66]) may change the results.DeWitt and Levine [24] aptly describe performance comparisons as the maximum potential performance gain of one DBMS over another.The performance gain in your particular environment might be less, or it might be that the DBMS that performed better in the comparison performs worse in your environment.One important aspect of the environment is the physical setup.Different hardware has been shown to affect DBMS performance, as some DBMSs exploit parallelism more efficiently than others [58,48], effectively meaning that if a test was performed on one single-core CPU, the results might not generalize to distributed environments.Additionally, different hardware aspects such as the relative sizes of different CPU memory caches may significantly affect DBMS performance, making performance comparisons between different hardware a complex task [1,83].In distributed environments, which were rarely tested in the primary studies, it is worth considering whether data availability is prioritized over data consistency, as the latter setup is typically significantly slower.Benchmarks that simulate concurrent users should also be considered separately from performance tests that merely execute queries sequentially.Concurrency introduces several challenges, many of which severely affect performance [83].For example, SQLite uses database locking on a level of granularity which makes concurrent writes slow, but this has no negative effects on single-user writes [59].Unfortunately, some studies have shown that developers do not widely understand concurrency-related security aspects [84], and that concurrency-related performance problems are understudied [87].Some have even stated that the research has not been focusing on relevant issues [63].
Intuitively, different business domains have different databases and they are used in different ways.For example, in some domains, the end-users typically read data, while in others, write operations are more common.The ratio of read and write operations in a performance test plays a crucial role, as some DBMSs are specifically designed for specific workloads [18].The credibility of testing results is also related to how well the test database and data therein represent the target environment [67].Furthermore, in business domains such as online stores, there are typically popular products, and thus the data related to them are targets of a relatively large number of database operations.For generalizable benchmarking results, the benchmark must account for such skewness in database use, rather than, e.g., randomly querying data objects.It is also worth considering how the performance tests have tested performance.For example, is your application about inserting 10,000 rows in bulk, but one row at a time randomly generated by the application?If it is not, you should not consider this type of benchmark results as an indication of how well one DBMS performs compared to another in your particular business context.It is also worth considering that decision support benchmarks such as TPC-H test performance in environments that can be fundamentally different from transaction processing environments.Finally, even similar business domains can have a myriad of different technical implementations.
We have discussed some of the particulars involved in database system design in this subsection, and in Sections 2 and 3, from which one can infer what has often been repeated in database system research: the environments and their optimization is a task so complex [36,18] that DBMS optimization is a whole profession [69].It follows that there are several threats to rigorous DBMS benchmarking.Even though RDBMS optimization is widely and deeply studied in both academic and industry contexts, RDBMS optimization remains a complex task.In the domain of NoSQL DBMSs, there exist far fewer scientific studies simply due to the age of the NoSQL DBMSs, and the heterogeneity of NoSQL data models.Additionally, there are several querying anti-patterns to avoid, such as performing joins in the software application instead of the DBMS, or paging query results by utilizing ordering and limiting.All these points considered, a reader of a performance comparison study must trust that the performance comparison study writers have been able to optimize the database systems to a similar degree for the performance comparison results to be credible.This requires particularly specific, in-depth expertise when DBMSs with more than one data model are compared.Furthermore, decades of benchmarking software development by entire councils (e.g., TPC) cannot simply be skipped by writing a set of (often arbitrary) queries, running them on two or more DBMSs in a single-user environment, recording response times, and consequently stating that one DBMS is faster than another.Although this was the case in over 80% of the selected primary studies, we do not consider this sufficient.
In summary, if it is possible that changing even one of the environmental aspects discussed above may affect the performance test results significantly, it seems reasonable to argue that, no matter how many DBMS performance comparison studies state that one DBMS outperforms another, these DBMSs were not tested in an environment that is the same as your environment, and thus have little concern in the decision of which DBMS is performance-wise the best fit for your environment.

Consider Other Aspects Besides Performance
There are other aspects besides response time or throughput to consider when choosing a DBMS.Performance gains, such as those provided by many NoSQL systems, rely heavily on redundant data to minimize the complexity of queries, thus providing faster response times.Naturally, storing redundant data increases the cost of storage, and may lead to data inconsistencies.Another comparison perspective is related to the features provided by the DBMSs compared.Intuitively, a DBMS that is tailored for a specific purpose outperforms a general-purpose DBMS [69,74].For example, one primary study [104] noted that while MongoDB outperformed PostgreSQL/PostGIS in most of the tests, MongoDB provides only a subset of the geospatial operations provided by PostGIS.If the rest of the operations needed by the business domain need to be implemented in the software application, it is not realistic to assume that such task is either trivial to implement, nor trivial to implement in a way that outperforms the solutions offered by existing DBMS features.
Another consideration is the availability of suitable workforce, which is closely related to the DBMS technology and its maturity.It is not surprising that as query languages such as SQL have been a topic of effectively all information technology-related curricula in higher education for several years [50,78], there is a relatively large number of professionals fluent in SQL, as opposed to new query languages.Some studies have also shown that strong consistency models [19] and the SQL language [8] are desired as skills as well as features in a DBMS.That is, it is worth considering how feasible it is to implement a database system with each specific technology, and DBMS performance is only one of the important aspects to consider.
Finally, as the primary studies typically considered performance in terms of response time or throughput, we have approached the topic from a similar viewpoint.However, as discussed in Section 3.1, performance may be measured by the usage of computing resources, which can be a goal conflicting with response time [10].It is typical that increasing parallelism through multiple CPUs lowers response time, but increases the total amount of work due to the parallelism overhead [60, p.13].Finally, it has been shown that migrating data from one DBMS to another is all but trivial, and prone to fail due to a lack of clear methodologies [77] -especially when the DBMSs differ in data models and query languages [52].Therefore, migrations such as RDBMS ⇔ RDBMS or RDBMS ⇔ NewSQL are arguably less complex than migrations such as NoSQL ⇔ NoSQL, RDBMS ⇔ NoSQL or NewSQL ⇔ NoSQL.

Consider Using Existing Guidelines for Testing and Reporting
Database benchmarking guidelines are not a novel invention in database system research and have been described in detail [37] and in short [26] in the early 1990s, and as a reader-friendly checklist later [69].Additionally, benchmarking pitfalls have been discussed in numerous studies in respected database systems fora [83,30].Based on the primary studies, however, neither of these lines of research has been widely applied in practice.Database benchmarking has been argued to be difficult [69], as environmental parameters such as the nature of data [67], DBMS parameters [83], and data types [66] can all have significant impacts on performance testing results.Furthermore, benchmarking tools have received critique [71,38] despite the fact that some of the tools have been under development for decades.Therefore, we urge researchers, at the very least, to consider whether using a performance test suite of one's ad hoc queries is credible when well-known performance benchmarks are freely available.
As for reporting, Raasveldt et al. [69] provide a 24-point checklist for fair benchmarking.Some of the points are concerned about how performance is tested, and others about how the testing is reported.A performance comparison that cannot be replicated may present whatever results [69].Furthermore, an empirical study without reproducible evidence should be considered an opinion of the authors, rather than an empirical study.Indeed, at the start of the NoSQL movement, we have witnessed several studies with high praise for the strengths of different NoSQL products, yet with little or no critical notions addressing the acknowledged shortcoming of such DBMSs.Therefore, we would caution the reader from inferring from these results that one DBMS performs better than another.Rather, each such argument should be carefully scrutinized and interpreted in a specific context, like in the primary study assessing the performance of GPU DBMSs [190], in which performance between DBMSs was compared, but the comparison was merely one aspect of the study.

Consider a Different Approach to DBMS-DBMS Testing
Especially for a junior researcher, comparing the performance of one DBMS to another may seem like a relatively simple research setting to both carry out, and also justify based on the prevalence of the DBMS industry.We hope that the arguments presented in previous studies as well as here have highlighted that neither of these points are as clear-cut.Following the guidelines [e.g., 69] can make performance testing a time-consuming task, and in many cases, perhaps overly time-consuming, and given the considerations on the generalizability of the results, the results may not be of interest in other environments.Alternatively, not following guidelines introduces significant threats to validity.While generalizability is hardly an intrinsic value, concluding that, e.g., MySQL outperforms PostgreSQL in "my webstore" but not in others unless they have similar data, hardware, number of end-users, etc., does not carry the implication of being as a scientifically impactful result as saying that, e.g., MySQL will always outperform PostgreSQL.Therefore, we must either perform the performance comparisons with rigor and accept that the results do not probably generalize, or perform the comparisons without scientific rigor and state sophisms.Since the latter is hardly ethically sound, DBMS performance comparisons should be limited to domains where the goal of a study is not the generalizability of the results, but the betterment of the very particular domain the study is concerned [e.g., 100].
Given the arguments above, we propose that future studies, if inter-DBMS performance must be compared, consider taking a different approach to performance testing.First, using a wide range of database system optimization experts to ensure that all aspects of the system are fairly optimized, and avoiding situations where one system is optimized beyond diminishing returns, while the other is barely optimized.We challenge research teams to explicitly disclose which authors optimized which systems, for authors to further one's intellectual investments in the performance comparison.These solutions should be benchmarked by a party independent of both optimization teams, and fair benchmarking guidelines should be utilized.Second, after the benchmarking has been carried out, we urge researchers to consider what causes the differences in performance, and critically compare those aspects as well, as gains in performance arguably have root causes such as loosened consistency or increased storage space.Nonetheless, performance comparisons of two or more DBMS with different data models should be considered particularly complex.Unfortunately, such comparisons seem to be the most popular (cf.Fig. 5).

Consider Other Use Cases Besides DBMS-DBMS Testing Altogether
It is worth noting that benchmarking software has other use cases besides inter-DBMS performance comparisons.Instead of comparing one DBMS to another, researchers might consider testing the performance effects of different hardware [28], DBMS parameters [83], operating system parameters, query languages [45], physical configurations such as database distribution, physical structures such as different indices, or different levels of data consistency.

Limitations and Threats to Validity
It might be that some relevant studies are missing from this survey.However, it was not our intention to select primary studies to quantitatively demonstrate that one DBMS outperforms another by the number of studies corroborating such an argument.Rather, the results verify previous observations [69] according to which many of such comparisons are problematic and should be interpreted with care, if at all.Nevertheless, we have strived to include at least most of the primary studies that fit our criteria (Table 2) by several rounds of snowballing (Fig. 3) as well as a complementary literature search.Furthermore, as the DBMS classification (Table 3) and the interpretation of the primary study results (Appendix A) involve human judgment, it is possible that another group of researchers may attain at least slightly different results.

Conclusion
Several database management system performance comparisons have been conducted and published as both vendor white-papers as well as in scientific fora.The approaches and reporting in such studies have been criticized in previous literature.In this study, we systematically surveyed 117 DBMS performance comparison studies.What seemed to be common among the selected primary studies is that they lack sufficient detail for reproducibility.Scientific, peer-reviewed research of high external validity concerning database management performance comparison is effectively scarce.Based on the survey, we presented several considerations for the industry as well as database system researchers.Namely, we argued for considering (i) the environments (i.e., business domain, amount of data, amount of concurrent users, hardware, database distribution, read/write operation ratio, etc.) when interpreting the results of DBMS performance comparison tests, and for considering (ii) other aspects besides DBMS performance when choosing a DBMS or changing one DBMS to another, and for researchers to consider (iii) using existing guidelines in performance testing and reporting the testing environments transparently, to consider (iv), different approaches to performance testing when one DBMS is compared to another, and to consider (v) other use cases for performance testing besides comparing the performance of one DBMS to another.The results highlight how rarely benchmarking software is used in performance testing, how often different DBMSs with data models are compared with each other, how often performance testing results in different studies conflict with each other, and why.This study is not an attempt to argue the performance gains of one DBMS over another using primary studies.That is, please do not cite this study by consulting the Appendix and stating that DBMS 1 outperforms DBMS 2 .
(a) PostgreSQL query execution plan shows the time units anticipated and taken by each phase of the query execution 1 Tracing session: 4e5c44b0-6714-11ed-97f4-0597e908e9d9 2 activity | source | source_elapsed 3 -

Figure 2 :
Figure 2: Query execution plans illustrating the physical operators such as hash join and seq scan chosen by the optimizer

Figure 4 :
Figure 4: The number of publications by publication year and DBMS type; the year 2022 was only considered until March

Figure
Figure5: DBMS performance comparisons overview; a directed edge from node a to node b represents the number of studies according to which a system of type a outperformed a system or systems of type b in (r)ead and (w)rite operations, e.g., a NoSQL system outperformed a NewSQL system in read operations in one study, and in write operations in one study; thicker edges visualize the most popular comparisons Figure 6: An overview of read operation performance comparisons between NoSQL systems (green, upper right), NewSQL systems (yellow, lower right), RDBMSs (red, lower left), and other systems (blue, upper left); a clockwise turning edge from node a to node b depicts node a outperforming node b, and the color of the edged corresponds to the type of the outperforming node, e.g., Caché outperforms PostgreSQL according to one or several studies; the size of a node represents out-degree, i.e., larger nodes have outperformed more systems than smaller nodes -Hash Join (cost=3802.67..5996.69rows=469 width=67) (actual time=183.354..189.409rows=473 loops=1) Hash Cond: (o.customerid = c.customerid) -> Bitmap Heap Scan on orders o (cost=167.89..2341.57rows=7753 width=52) (actual time=1.446..5.581 rows=7925 loops=1) Recheck Cond: (((paymenttype)::text = 'OC'::text) AND (orderdate > to_date('20101010'::text, 'YYYYMMDD'::text)))

Table 1 :
Search strings

Table 2 :
Primary study selection criteria

Table 3 :
DBMSs discussed in this study divided into four types 1the YCSB benchmark defines a single-table with n columns (or loose equivalents in non-relational data models)