Evaluation of Relational and NoSQL Approaches for Cohort Identification from Heterogeneous Data Sources in the National Sleep Research Resource

Patient cohort identification across heterogeneous data sources is a challenging task, which may involve a complicated process of data loading, harmonization and querying. Most existing cohort identification tools use a relational database model implemented in SQL for storing patient data. However, SQL databases have restrictions on the maximum number of columns in a table, which necessitates the breaking down of high dimensional data into multiple tables and as a consequence affects query performance. In this paper, we developed two NoSQL-based patient cohort query systems based on an existing SQL-based system for the cross-cohort query in the National Sleep Resource Research (NSRR). We used eight NSRR datasets in our experiment to evaluate the performance of the NoSQL-based and SQL-based systems in data loading, harmonization and query. Our experiment showed that NoSQL-based approaches outperformed the SQL-based and are rather promising for developing patient cohort query systems across heterogeneous data sources.


Introduction
Patient cohort identification is the selection of patient subgroups satisfying predefined criteria from a large population in Electronic Health Records (EHR) systems [1]. It is important for clinical trial recruitment [2], outcomes research [3] and other research studies [4,5]. However, the process of identifying patient cohorts can be challenging and expensive when the patient data come from heterogeneous sources. Researchers have developed automated tools and systems to query patient cohorts from disparate data sources [6][7][8][9][10].
The existing cohort identification tools use SQL-based relational databases as the database model for managing patient data. However, one limitation of SQL databases is the restriction on the maximum number of columns that can be stored in a table. As a result, a single table may not be sufficient to store patient data with extremely high dimensions (or a large number of data elements). In such cases, splitting data into multiple tables is an alternative strategy. However, such splitting may cause extra data loading effort and affect query performance for data elements across multiple tables. NoSQL databases may provide a better choice to handle such high dimensional patient data.
This paper provides an evaluation on SQL and NoSQL approaches for patient cohort identification across multiple data sources. We compare three patient cohort identification systems, utilizing MySQL, Mongo DB and Cassandra as the backend database, respectively. We use eight de-identified patient datasets from the National Sleep Research Resource to compare the performance of data loading and querying using these three systems.

Background National sleep research resource (NSRR)
Funded by the National Heart, Lung and Blood Institute, NSRR was designed to share de-identified sleep data obtained from NIH-funded cohort studies and clinical trials with the sleep research community [11]. NSRR provides a web-based data portal [12] that aggregates and organizes signal and clinical data from over 26, 000 patient subjects. NSRR has over 2, 500 registered users since it's launching in 2014. Up to date, over 80 terabytes of data have been downloaded by the sleep research community.
Clinical data in NSRR are formatted in comma-separated values (CSV) files by patient visits. Each patient visit has a corresponding CSV file with all the clinical data elements collected for this visit. Note that an NSRR dataset may involve one or multiple visits. For example, the SHHS dataset has two patient visits: shhs1 (1, 266 data elements) and shhs2 (1, 302 data elements); the CHAT dataset has two visits: baseline (2, 897 data elements) and follow up (2, 897 data elements); and the CFS dataset has a single visit: visit5 (2, 871 data elements).

Specific challenges for identifying patient cohorts from heterogeneous sources
High-dimensional data: Dealing with high-dimensional data is one of the challenges for patient cohort identification using relational databases due to the limitation of the maximum number of columns in a table. For example, MySQL has a hard limit of 4, 096 columns per table, but the actual maximum number for a given table may be even less considering the maximum row size and the storage requirements of individual columns [13][14][15][16][17][18][19][20][21][22][23][24][25][26][27]. High-dimensional data (or columnintensive data), if exceeding a single table's capacity, need to be split into multiple tables. For instance, in the CFS dataset, the "visit5" table needs to be split into 3 tables with the de-identified patient identifiers to connect the separated tables ( Figure 1). The consequence of such splitting is that it would be more computationally expensive to query data elements located in different tables since it involves costly join *Corresponding author: Licong Cui, Institute for Biomedical Informatics and Department of Computer Science, University of Kentucky, Lexington, USA, Tel: (859) 257-3062; E-mail: licong.cui@uky.edu widely used JSON (JavaScript Object Notation) representation to include additional types such as int, long, date, floating point and decimal 128. BSON documents contain one or more fields and each field contains a value of a specific data type, including arrays, binary data and sub-documents. Documents that share a similar structure are organized as collections. One can think of collections as being analogous to tables in a relational database: documents are similar to rows and fields are the equivalence of columns.
2) Cassandra Database System: Apache Cassandra [33] is another free and open-source distributed NoSQL database management system, which is designed to store large amounts of data from multiple servers. Cassandra can be considered as a hybrid of keyvalue-and column-based NoSQL database. • Cassandra Data Model: Cassandra consists of nodes, clusters and data centres. A group of nodes or even a single node is a cluster and a group of clusters is a data centre. It provides support for clusters across multiple data centres. Cassandra is a combination of keyvalue and column-oriented database management system. The main components of Cassandra data model are keyspace, tables, columns and rows. A key space in Cassandra is a namespace that defines data replication on nodes. A cluster contains one key space per node. A table is a set of key-value pairs containing a column with its unique row keys. Rows are organized into tables. The first part of primary key of a table is partition key, which clusters the rows by the remaining columns of the key. To evaluate SQL-and NoSQL-based approaches for patient cohort identification, we adapt the existing NSRR Cross Dataset Query Interface (CDQI) [31][32][33][34][35] based on MySQL and develop two NoSQL-based query systems using MongoDB and Cassandra, respectively. Figure 2 shows the general system architecture of the three systems. It consists of four major components: (i) database management system; (ii) Ruby driver for the database management system; (iii) query translation; and (iv) web-based cross dataset query interface. The database component serves as the data warehouse to store the actual datasets. The web-based query interface receives queries composed by users, which are then translated into the statements in the corresponding query language. The Ruby driver then executes the translated query statements to retrieve data from the database. After receiving the query results, the interface presents them to the end users. operation of tables and matching of the unique identifiers. Therefore, the query performance may be significantly affected due to the split.

Materials and Methods
Heterogeneous data: Querying heterogeneous data to find patient cohorts is also a challenging task, as disparate data sources may use different representations to express the same meaning. For example, in NSRR, different coding's for patient gender are used in disparate datasets: 1 means male and 2 means female in the SHHS dataset, while 0 represents female and 1 represents male in the CHAT dataset. Such coding inconsistencies happen frequently as the number of disparate datasets increases, thus need to be harmonized to guarantee accurate queries.
Usually, there are two ways to handle coding inconsistencies. One way is to harmonize the inconsistencies in the data loading step, where the source data of each dataset need to be updated to share uniform coding's across all the datasets.
The other way is to address the inconsistency issue in the data query step, where a mapping of the heterogeneous coding's in each dataset to the uniform coding's needed to be incorporated when the patient cohort identification system performs the query translation. In this paper, we adapt the first way to perform harmonization in the data loading step so that we can evaluate both data harmonization and query performance of the SQL-and NoSQL-based systems.
NoSQL databases: NoSQL [28] databases have been rapidly emerged, becoming a popular alternative to the existing relational databases that can better store, process and analyse large-volume data.
Without a fixed data schema, NoSQL databases are more flexible in dealing with various data sources and formats. NoSQL databases have shown the potential in managing big biomedical data [29][30][31]. For example, Tao et al. [31] have developed a prototype query engine for large clinical data repositories utilizing MongoDB as the backend database. There are two main components in MongoDB: 1) MongoDB Query Language; 2) MongoDB Data Model.

1) MongoDB Database System: MongoDB [32] is a free, open source
and cross-platform NoSQL database. It is a mature documentoriented NoSQL database with well-written documentation and large-scale commercial use. MongoDB also provides rich drivers for multiple programming languages.
• MongoDB Query Language: As a NoSQL database, MongoDB provides an expressive query language that is completely different from SQL. There are many ways to query documents: simple lookups, creating sophisticated processing pipelines for data analytics and transformation, or using faceted search, JOINs and graph traversals.

Web-based query interface
We adapted the code base of the SQL-based NSRR CDQI in Ruby on Rails (RoR) to develop the two NoSQL-based query interfaces. RoR follows the model-view-controller architectural pattern, providing rich interaction with different types of databases and supporting HTML, CSS and JavaScript for developing interactive user interfaces. The query translation, Ruby driver and backend databases were newly implemented for MongoDB and Cassandra, respectively.

Query translation-dynamic generation of database query statement
Each time a user initiates a query through the web-based interface, the automated translation of this query (so-called query translation) into specified database query statement is needed. We illustrate the MongoDB-based query translation in the followings (MySQL-and Cassandra-based are similar).
The dynamic query translation relies on predefined general templates of MongoDB statement according to the types of queries. For example, the general template for querying a range of values for a numeric data element (or field) is predefined as: Where, the variables <dataset.name> and <field_n> represent the specific dataset and the field that the user intend to query; and <field_n_ lower_value>, <field_n_upper_value> represent the user-specified minimum value and maximum value of the field, respectively. All the variables in the angle brackets can be replaced by real values to generate the actual MongoDB statement. For instance, "finding patients in the SHHS dataset aged (field_1) from 20 to 80 with height in centimetres (field_2) between 145 and 188" will have the following values for the variables in the template:

Ruby driver for the database management system
As illustrated in Figure 2, we utilize certain types of databases (MySQL, MongoDB and Cassandra) as the data warehouse to store disparate datasets. All the three database management systems used in this study support a Ruby driver, which can seamlessly work with RoR to interact with the database management systems. Take MongoDB as an example, we use MongoDB Ruby driver [36] (version 2.4.1), which enables the connection to the MongoDB data warehouse and executes query statements to retrieve patient cohorts satisfying the query criteria.

Data modeling in NoSQL databases
Utilizing NoSQL databases require different data model compared to SQL relational databases.
• MongoDB: The data schema for MongoDB in this study consists of one database, called "nsrr" and one collection, called "nsrrdata". All the eight datasets were integrated into the collection of "nsrrdata". To differentiate records from different datasets, a key-value pair with a key as "source" was inserted into each record to indicate the source dataset of this record during the importing process. For those datasets which have more than one visit, another key-value pair with a key as "visitType" was inserted.
• Cassandra. The Cassandra database schema consists of a single cluster, called "nsrrcluster", a single keyspace, called "nsrrdata" and eight tables corresponding to the eight datasets. Similar with MongoDB, one extra column named "visitType" was added for those datasets with more than one visit. A keyspace in Cassandra is a namespace that defines data replication on nodes. The replication strategy for replicas and the replication factor are properties from the keyspace. By selecting the replication strategy for replicas, one can determine whether data is distributed through different networks. In this work, we chose the Simple Strategy [37] since it was performed in a single cluster. Furthermore, the main purpose of this study is to compare performance rather than fault datasets which are not consistent with this coding, the harmonization was performed to update the source data with the harmonized coding.

Comparison of relational and NoSQL databases
We performed the comparison between SQL and NoSQL databases in terms of the data loading, data harmonization and query performance. For data loading, we compared the time spent on importing data into MySQL, MongoDB and Cassandra, respectively. For data harmonization, we compared the detected number of concepts with coding inconsistency, detection time and harmonization time. For query performance, we designed several sets of patient cohort queries that are composed of a single query concept or multiple query concepts to compare the query time. In the followings, each reported time was obtained by performing the corresponding operation five times and taking the average time.
Data loading: Table 3 shows the time taken for importing each dataset into the three database systems. It took MongoDB a total of 419.2 seconds, MySQL 337.0 seconds and Cassandra 330.9 seconds, to load 39, 342 records in the eight datasets. MongoDB took more time than MySQL and Cassandra for data loading.

Data harmonization:
Although utilizing different databases, the first two steps for data harmonization were identical in three systems. We were able to detect coding inconsistency for the same number (43) of concepts within eight datasets in five seconds. Table 4 shows the time taken to perform data harmonization in each system. It took all the three systems over 6 h to complete the harmonization. The runtime complexities were similar since all these databases need to traverse all the records and update the corresponding column names, values (MySQL, Cassandra) or key-values (MongoDB). Cassandra required the least time to harmonize the data as it provides the best performance on the write operation. recovery, so we set the replication factor as one. Another reason we used a single cluster is that a larger number of replicas would also interfere with the data loading time.

Data integration-loading and harmonization
Integration of disparate datasets into a data warehouse usually involves data loading and data harmonization: • Data loading procedure: In MySQL-based NSRR CDQI, to load the NSRR datasets into databases, we need to perform data pre-processing. A dedicated program is needed to split the data "horizontally" into separate data files and store them in different tables. The detailed procedures for a given dataset are as follows. First, the program reads the CSV file of a patient visit in the dataset, calculates the required number of tables and splits the CSV file into multiple smaller CSV files. Then, the program reads the smaller files individually and imports them into corresponding tables. Apparently, the limitation of maximum table column count in MySQL does increase the complexity from the data loading point of view. Even though each of the eight datasets contains thousands of data elements or columns, importing data into NoSQL databases is fairly straightforward, since (1) following the data model mentioned above, we can easily import all eight datasets into the NoSQL databases; and (2) no data split is needed.
• Data harmonization procedure: We take three steps to harmonize coding inconsistencies before the data can be used for query: (i) we run the inconsistency detection program to detect and extract all the inconsistent coding among different datasets; (ii) we manually harmonize these inconsistency coding into uniform codings and maintain the mappings between them in a CSV file; (iii) we run another program to update the harmonized coding in corresponding tables stored in different databases. All the three query systems take similar steps to perform data harmonization.

Results
In this section, we first present the results for data loading and harmonization of the eight NSRR datasets, and then we present the comparative evaluation of the three patient cohort query systems using MySQL, MongoDB and Cassandra, respectively. All these evaluations were conducted on a computer with Intel Core i5/2.9 GHz processor and 8 GB RAM.

Data loading and harmonization
We integrated a total of 39, 342 patient records from eight NSRR sleep datasets into MySQL, MongoDB and Cassandra, respectively. Table 2 shows the numbers of tables needed for all three systems. MySQL required twenty tables due to the limitation on the table column count, while MongoDB only required one and Cassandra required eight.
We detected coding inconsistencies for 43 query concepts within eight datasets. These coding inconsistencies were harmonized into uniform codings. Take the heterogeneous codings for gender as an example, the harmonized coding is: 1-male and 2-female. For those   Query performance: To evaluate the query performance of the SQL-and NoSQL-based systems, we conducted experiments on performing patient cohort queries across the eight datasets. Each cohort query consists of one or more query concepts. Three sets of cohort queries were used. The first set of queries involved only one concept, while the second set and the third set involved two and four concepts, respectively. Note that due to the limit of the table column count in MySQL, data elements exceeding the limit need to be split into multiple tables. In addition, there might be multiple data elements corresponding to the same query concept. For instance, in the SHHS dataset, there are three data elements mapped to the query concept Hypertension as follows.
• htnderv_s1: Hypertension Status based on 2 nd and 3 rd blood pressure readings or being treated with HTN meds; • srhype: Self-Reported Hypertension; and • htnderv_s2: Derived Hypertension classification (based on blood pressure measurements, history of HTN dx and medication use).
Such related data elements may be stored within the same table or across multiple tables. Therefore, a query concept may involve data elements within the same table or across multiple tables in the MySQLbased query system. We refer to query concepts involving data elements across multiple tables as cross-table query concepts. Table 5 presents the time taken for each query using MySQL-based system. The highlighted time indicates that the corresponding query involves cross-table query concepts in the corresponding dataset. For example, in the SHHS dataset, "Age", "Asthma", "Hypertension" and "Time awake after sleep onset" are the cross-table query concepts.
As can be seen from Table 5, when querying Age in the CFS dataset, the query time was relatively short, since Age (Tables 5-7) was a withintable query concept. Even when querying two or more concepts at the same time, as long as they were from the same table, the query times were almost less than 0.1 seconds.
For the SHHS dataset, querying within-table concept Gender only took 0.03 seconds. However, when executing "AND" logic queries that contain two concepts involving different tables in MySQL, the query took more than 3 seconds. The situation could get even worse if the query consisted of multiple cross-table concepts. For instance, four query concepts "Asthma", "Gender", "Hypertension" and "Time awake after sleep onset" took about 12 seconds to complete. These illustrate that the MySQL-based system encountered a dramatic query time increase when query cross-table concepts. The major reason for such increment is that when performing such queries, the traditional relational database needs to perform costly JOIN operations. Tables 6 and 7 show the query time taken for the MongoDB-based and Cassandra-based systems, respectively. There is no highlighted time in these two tables, since no data split operations were needed for these two NoSQL databases. For the SHHS dataset, both MongoDB and Cassandra achieved better performance when querying MySQL cross-table concepts (see the highlighted times in Table 5); however, for single-table concepts, the performance varied. For the CHAT dataset, all the queries were the cross-table concepts in MySQL, the performance of MongoDB and Cassandra were sometimes better than that of MySQL, while sometimes worse. This may be because the CHAT dataset contains a small number of patient records (917, see Table 1), in which case MySQL was efficient in performing the JOIN operation on data across tables. Figure 4 shows the average time taken for each query using three different database systems. We can see that both MongoDB and Cassandra achieved consistently faster query performance compared to MySQL. MongoDB demonstrated the best query performance. MySQL performance was highly dependent on the query concepts.

Statistical evaluation of average query time:
To evaluate the statistical significance of the differences in the average query times. We conducted NA means unavailable information and Bold numbers indicate that corresponding query concept(s) involve data elements from multiple table
t-test using two independent means with 0.05 significance level and twotailed hypothesis. If the p-value is less than 0.05, then query performances are considered significantly different. As shown in Table 8, we can see the p-values are less than 0.05 for MySQL vs. MongoDB and MySQL vs. Cassandra. This indicates that the two NoSQL-based systems achieved a significantly better query performance than the MySQL-based system did.

Scalability:
To evaluate the scalability of the SQL and NoSQLbased system, we conducted experiments on performing patient cohort queries across SHHS datasets with different scales. The rationale to use the SHHS dataset for scalability evaluation was in two folds: (i) it contained the largest number of data records among these eight datasets; (ii) it contained data elements mapping to both within-table and cross-table query concepts.
We scaled up the SHHS dataset by duplicating the original data records by three, five and ten times, which are denoted as SHHS × 3, SHHS × 5 and SHHS × 10 respectively. Note that these duplicated data also had unique identifiers starting from the last identifier of the original data record. The cohort queries were identical with those that were previously used for evaluating the query performance. Table 9 shows the time taken for each query in different scales using the MySQL-based system. Each highlighted time indicates that the corresponding query involved cross-table query concepts.
As we can see from Table 9, when querying "Gender" for these scaled datasets, the query times were short, since "Gender" was a within-table query concept. Even for a query with two or more concepts, the query time remained short if these concepts were within-table (e.g., concepts "Gender" and "Weight"). However, when performing cross-table queries, the query times increased dramatically along with the scales.    For instance, when querying Age, the query times were 3.10s, 31.76s, 87.1s and 318.56s for SHHS, SHHS × 3, SHHS × 5 and SHHS × 10, respectively. The query time for concept Age was over 5 min when the number of data records was ten times larger. The situation could get even worse for queries consisting of multiple cross-table concepts. For instance, it would take 917 seconds to query four concepts "Asthma", "Gender", "Hypertension" and "Time awake after sleep onset". These illustrates that the MySQL-based system did not provide a decent scalability for high-dimensional data in our case.
Tables 10 and 11 present the query times taken for the MongoDBbased and Cassandra-based systems. For these NoSQL-based systems, there was no need to split tables for a single dataset. We can see from To better demonstrate the scalability of these three systems, Figures  5, 6 and 7 show the query times of different scaled SHHS datasets for each query. In these figures, Q1 to Q12 are corresponding to the queries in Table 9 from top to bottom. We can see that the increment of query time along with the size of datasets for both MongoDB-based and Cassandra-based system was small. These NoSQL-based systems demonstrated a better scalability in terms of query performance compared to MySQL-based system.

Distinction with related work
Weber et al. [6] have developed a prototype Shared Health Research Information Network (SHRINE) based on i2b2 for the federated query of clinical data repositories. However, the i2b2/SHRINE system deals with uniform data across different i2b2 instances, where these instances share the same data structure. In this paper, we mainly focused on the heterogeneous and high-dimensional data across disparate datasets, where these datasets have different data structures.
Another related work is the MongoDB-based cohort query tool for clinical repositories [31], where the tool can be used to query a single data source. In this paper, we deal with multiple data sources and explored another NoSQL-based approach.

Limitations
A limitation of this work is that the sizes of the NSRR datasets are limited in the number of patient records (39, 342 records). Although it was shown that the NoSQL-based systems outperformed the SQL-based system on the NSRR datasets, it would be interesting to see how they perform when the number of patient records gets extremely large and to compare the actual storage required by different databases. Another limitation is that we only explored two NoSQL database systems to facilitate the patient cohort queries across disparate sources. Compared with these two, how other NoSQL databases perform still needs further investigation.

Conclusion
In this work, we developed two NoSQL-based patient cohort identification systems, in comparison to a SQL-based system, to evaluate their performance on supporting high-dimensional and heterogeneous data sources in NSRR. Utilizing NoSQL databases, we overcame the limitation of maximum table column count in traditional relational databases. We successfully integrated eight NSRR cross-cohort datasets into NoSQL databases, which largely enhanced the query performance compared to the MySQL-based system, while maintained similar performance for data loading and harmonization. This study indicates    that NoSQL-based systems offer a promising approach for developing patient cohort query systems across heterogeneous data sources.