ELII: A novel inverted index for fast temporal query, with application to a large Covid-19 EHR dataset

Graphical abstract


Introduction
Covid-19 is an unfolding global pandemic calling for urgent and accelerated efforts in identifying appropriate treatment strategies, developing accurate and rapid testing methods, and producing effective vaccines. Due to this urgency, clinical trials for treatment and prevention of Covid-19 must be complemented by population-based approaches. Clinical data about patients in electronic health records (EHR) provide an important source of information for Covid-19 research [1]. Benefits include those of traditional retrospective analyses such as identifying risk profiles, revealing health disparities, and understanding long-term health implications [2,3]. They also enable machine learning approaches for outcome predication, drug repurposing, and polypharmacy (combinational drug effects) investigation [4,5].
However, population-based Covid-19 study brings into sharp focus two unique query requirements in the context of EHR-derived big data. One is temporal query, particularly on medical events around a patient's Covid-19 diagnosis. The second is interface for interactive cohort exploration, which requires near real-time responses to user-specified queries to facilitate study design and data access.

Temporal query
Covid-19 studies often involve temporal relationships on patient phenotype and healthcare events (e.g. diagnosis, medication, lab test and procedure) before and after Covid-19 diagnosis [6][7][8]. Here are some sample cohorts specifications that involve temporal relationships: 1. All patients who received polymerase chain reaction (PCR) test in May 2020; 2. All patients who developed neurologic complications after extracorporeal membrane oxygenation (ECMO) for Covid; 3. All patients who had stroke within a month after Covid diagnosis; 4. All patients who did not have any cardiovascular condition before positive Covid diagnosis.
Example 1 involves absolute temporal query, while the rest involves relative temporal queries. Example 1 is an instance of query with negation.
Temporal query, an important query modality for population-based Covid-19 research, has not been a traditional focus of clinical query systems [9] which were mostly focused on patient-recruitment for clinical trials [10,11]. Fast temporal query on large EHR-derived datasets presents an emerging big data challenge, since temporal query is computationally expensive and takes too long to execute using a bruteforce approach. Specifically, relative temporal query involves pairwise comparison of dates between clinical events, so a new data structure and query execution strategy is required to achieve a suitable level of response speed. Near real-time interface response is a critical factor to achieve a sense of "interactivity" for cohort exploration.

Interactive cohort exploration
Two general types of data processing pipeline for population-based studies exist. One is ad-hoc data processing, which involves the development of study-specific data extraction programming scripts to run directly on source data to obtain specific patient cohorts of interest [12]. The second is cohort discovery based on a clinical query interface, allowing an investigator to interactively explore and formulate patient cohorts of interest [13][14][15]. Such interfaces are often built on top of databases constructed using common data models such as those populated through i2b2 [16], PCORnet [17], and OHDSI/MOP [18].
As pointed out in [15], the main distinction between the two pipelines can be seen in Fig. 1. The ad-hoc approach (Fig. 1, left) requires an investigator to communicate data request to a data analyst (1), who in turn implements the request as a data extraction script to run on the data source (which can be in database or file-based format) (2), and then obtains requested data and finally returns results for further analysis (3). The time span between steps 1 and 3 can be weeks if not months, and steps 1-3 often need to be iterated as study criteria are refined. The second approach (Fig. 1: right) supports a paradigm which allows investigators and data analysts (1)(2) to construct and issue query directly through a web interface without requiring the knowledge about how the backend data are structured and stored, whereby shortening the dataaccess life-cycle and facilitate collaborative data exploration (Fig. 1, right).
Many existing EHR data warehouses require the use of a commandline query language to extract data. Even for those that are equipped with a graphical user interface, there is a general lack of systematic and dedicated support for temporal queries in both the query language and the user interface [19]. Near real-time response to temporal query is one of the most computationally challenging aspects for interactive cohort exploration [20]. Methods for exploring, querying and interacting with data need to be improved to cope with the size and complexity of data. Indeed, an estimated 40% of study respondents reported that they sometimes gave up because the task was too time-consuming [19]. This query response latency challenge is amplified when using larger Covid-19 EHR dataset for population-based research.
To address these challenges, we introduce a novel approach called Event-level Inverted Index (ELII) to optimize time trade-offs between one-time batch preprocessing and open-ended, user-specified temporal queries. To demonstrate the feasibility of ELII, we developed an experimental query engine in a NoSQL (not only SQL) database using the ELII strategy to support the temporal query modality. Near-real-time performance was achieved on a large Covid-19 EHR dataset, with 1.3 million unique patients and 3.76 billion records. We evaluated the performance of ELII on five types of queries: non-temporal (classical), absolute temporal, relative temporal, query with negation, and patientlevel event sequence look-up. Our experimental results show that ELII handled all five types of queries in seconds, achieving average speed accelerations of 26.8 times on relative temporal query, 88.6 times on absolute temporal query, and 1037.6 times on classical query compared to a baseline approach without using ELII. To summarize, our main contributions are: • A novel inverted index, ELII, to support fast temporal query on clinical events; • A data preprocessing pipeline for a document-based data model for ELII implementation; • An experimental query engine to support common queries and related evaluation using a real-world, large Covid-19 EHR dataset to demonstrate the enhanced performance using ELII.

Database structure
Relational data store. Relational data store is a commonly used method for storing and managing clinical data [21]. EHR data is represented as tables, where each row in a table represents a record, and each column represents an attribute. Table 1 illustrates a clip of Lab  table from an EHR dataset, where the first column captures patient  identifier (ID).
Entity-attribute-value (EAV) data store. One limitation of the relational data store is the inflexible of attributes [22]. An attribute may have different types of data and some records may have missing data or without applicable attributes. EVA data model is an alternative rowbased design. In general, an EAV table stores the attribute-value pairs of an entity. An entity may have multiple tables separated by data types. In Table 2, for instance, a Lab record may have text, numeric and datetime data, and they are stored in three different EAV tables. In each table, the "ID" ( Table 2, first column) is the lab record ID which serves as a linkage between the EVA tables. i2b2, a popular clinical query engine, uses EAV data store in the so-called "star-schema" design [16,10].

Document-oriented data store and MongoDB
Document-oriented data store. Document-oriented data store, or document-oriented database, is a type of NoSQL database [23]. Different from traditional relational databases, document-oriented databases are designed to store, retrieve and manage information [24] represented as a collection of "documents" in JSON (JavaScript Object Notation) or XML (Extensible Markup Language) format [25]. A document can be large, complex, and semi-structured. It serves as the basic unit of data processing and conceptually is equivalent to a record in a relational data table. MongoDB is a document-oriented database system using JSONtype documents with optional schemas [26]. A group of documents in Fig. 1. Typical data processing pipelines for population-based studies (adapted from [15]). Left: ad-hoc scripting. Right: data exploration using a webbased interface. MongoDB is called a "collection." Each element or entry in a document is represented as a key-value pair. Key-value pairs can be nested to provide tremendous expressive power for capturing complex information.
MongoDB's collections do not enforce document structure; therefore, it is "schema-less," providing flexibility for data type and data-modeling choices to match application requirements. Table 3 shows three labrelated documents for the same patient (PT001) in previous examples using a document-oriented data store.
For this study, we use document-oriented data store implemented in MongoDB for the following reasons: 1. Rapid response to the Covid-19 pandemic requires data from multiple sites and sources to be pooled together in a short timeframe. This entails that not all source data can be mandated to follow the same data model, data format, or coding standard [27]. Documentoriented data store provides the flexibility for managing data format and coding variability. 2. MongoDB provides querying facility which can take advantage of customized indices, nested/embedded objects and arrays, and support for on-the-fly regular expression processing during query. MongoDB is also scalable in database size, with a built-in constraint that the maximum BSON (a binary representation of MongoDB's JSON document data records) document size is 16 MB. The number of levels of nesting for BSON documents is limited at 100. These generous constraints provide enough flexibility for most applications (but see Discussion on document size). 3. We have had prior successful experiences using MongoDB [28], so this database environment is familiar to our team, which can help facilitate rapid development through code reuse and software architecture repurposing.
Column-oriented data store. Column-oriented data store is an alternative data structure used for NoSQL databases. In a column-oriented database, each column is stored in a single table. Table 4 is an example of five lab records in the column-oriented data store. Two of the five records do not have "TYPE" data so only other three records were stored in the colum "TYPE." For a query on "TYPE," three records with "TYPE" column will be fetched instead of all five records with entire columns. Such flexibility can help accelerate query execution on sparse dataset, or when not all the columns are involved in the execution of a query.
Though not used in this study, we mention graph-based data store for completeness. Graph-based data store is used in graph databases (GDB), which uses graph structure for semantic query and uses nodes, edges and attributes to represent and store data [29]. A node in the GDB represent an instance, similar to a record or a row in a relational database (or a document in a document-store database), such as person, business, and account. An edge represent a relationship connecting two nodes, an abstraction not directly implemented in a relational model or a document-store model. Resource Description Framework (RDF) is special type of GDB that uses XML syntax to describe the characteristics of web resources and the relationship between resources [30]. The main idea of RDF is to create statements about resources (in particular web resources) in expressions of the form (subject, predicate, object), known as triples [31]. This model provides an infrastructure for metadata of different web applications [32]. A collection of RDF statements intrinsically represents a labeled, directed multi-graph, and this makes an RDF data model better suited to certain types of knowledge representation [33,34].

Inverted index
Originating from the field of information retrieval, a forward index, , associates each document D (or its identifier-ID) to the list of words that the document contains. For example, to answer the query "Which documents contain word X," the forward index requires exhaustive iteration through each document and each word to locate a hit.
, is a common technique used for enhancing query performance. For text-based search, an inverted index consists of a list of all the unique words appearing in a document collection, and for each word, a list of identifiers for those documents that contain the word. An inverted index is usually implemented as a hash map: the key is the word Table 2 Lab test in EHR represented as EAV data tables. ID -lab record ID; A -Attribute; V -Value.  Table 3 Lab test in EHR represented as a document-based data store.
In the context of document-oriented EHR dataset, document identifiers can be patient IDs and document content can be a list of records for each patient. Inverted index for lab test consists of, for each lab test, a list of patients (IDs) who had the corresponding lab test. For example, if patients PT001, PT002, PT003 had Covid-19 PCR test, then this would be captured in the inverted index as a corresponding entry: This comes handy when querying for patients who had Covid-19 PCR test: it is merely a lookup (for Covid-19 PCR) to retrieve the patient list. In contrast, forward index requires searching the entire patient list, as well as each patient's lab records to look for Covid-19 PCR test, which becomes extremely time-consuming for large EHR dataset.

Temporal information in EHR data
There are two conceptual types of temporal information in EHR. The first is event-type, where each occurrence is associated with a single timestamp. Multiple events can be aligned in a one dimensional timeline [36]. In an EHR dataset, lab test and medication are event-type data. The second is interval-type, where each occurrence is associated with a start timestamp and an end timestamp. Basic interval relations are often captured in Allen's interval algebra [37]. Encounters are typical interval-type occurrences in a patient's medical record. For this study, we are interested in the representation and query of event-type information in EHR dataset.
A challenge of temporal query using the conventional data model involves comparison of timestamps between two events with a specified temporal relation. Such a query involves quadratic time complexity of O(n 2 ), where n is the number of events for a single patient. In extreme cases, a patient may have thousands of records, and the number of comparison will be millions. For millions of patients, this operation quickly becomes intractable. Therefore, optimization is necessary for temporal query on large datasets. Indexing time (on the timestamp column) is the most commonly used strategy to speed up temporal query [38]. Preprocessing query before execution is another approach to reduce condition evaluation effort [39]. Approach also has been proposed [40] to optimize the data model by pre-computing the relation between two events and cache such information for future use. Another study [41] demonstrated that complete in-memory processing can improve query execution time on a dataset with millions of rows, although memory remains a relatively expensive commodity today. Our ELII approach is different from those proposed in such studies in that we address the temporal query performance challenge using an innovative collection of forward and inverted indices specifically designed to handle EHR data and clinical events (see Discussion for general applicability).

Methods
Typical EHR data contain four types of source files about a patient: demographics (I), diagnosis (D), medication (M), and lab test (L). We use such file types to introduce our concept of event-level inverted index.

Event-level inverted index
An event-level inverted index consists of four main components: 1. Patient Timeline, which contains all the clinical events and related information (e.g., date and time) for each patient; 2. Conventional Inverted Index, which includes the inverted indices of time-invariant variables, especially for demographic data, such as "gender" and "race;" 3. Timeline Inverted Index (tiII), which consists of inverted indices of time-dependent variables (i.e., event labels with timestamps), such as "diagnosis code" and "lab test;" and 4. Global Lookup Table, which is a forward index of all variables and associated inverted indices.
Patient Timeline Pt. We use multi-level nested documents (supported by MongoDB) to store patients' events where "Patient ID" serves as the "primary key" of documents. Each timeline document consists of all clinical events of a patient. Each patient may have multiple events and each event may have multiple attributes. The result follows a structure of JSON-like nested key-value (a, v) pairs: We use a column-oriented store for the values of each event attribute. In Patient Timeline for PT002 (Fig. 2), for example, the "TEST RESULT" field of "Covid-19 pcr" only contains the value in the column "TEST RESULT" with "Patient ID" equals to "PT002." Query performance will benefit from a column-oriented data store because we only need to access the values of the query attributes instead of loading the entire document for event-specific queries.
For time-related event, we sorted the "Date" attribute for each event, which is another key idea for speeding up temporal query. In Fig. 2, for example, "TEST DATE" field of "Covid-19 pcr" is sorted from the oldest to the latest, and the values of "TEST RESULT" are sorted with the same order of "TEST DATE." Some temporal queries only involve the first or the latest event, which can be retrieved in constant time. Using sorted array, searching for a specific date has worst time-complexity O(log n), where n is the number of elements in the array.
Conventional Inverted Index. Conventional inverted indices are used for time-invariant variables, such as those involved in file type I. In the demographic source file (Table 5), each column represents a different demographic variable as an attribute, such as gender and race, and each row represents a patient. This inverted index has a key-value pair structure: {attribute value : patient ID list}, which the key (attribute value) is the value of attribute and value (patient ID list) is the list of patient IDs. For Table 5, the corresponding inverted indices for "GENDER" looks like: Timeline Inverted Index (tiII). There are two types of tiII: 1) singleattribute tiII; and 2) multi-attribute tiII.
For each event, the single-attribute tiII is structured as an attribute column (e.g., "TEST NAME" or "DIAGNOSIS CODE"), stored as keyvalue based documents. Each document consists of event name, the first date of event for all patients, and the last date of event for all patients. A 2-dimensional array (named Temporal patient list) is used to capture all patients who have had this event. This 2-dimensional array divides and sorts patient by event date chronologically: Therefore, multi-attribute tiII has additional fields for multiple event labels. Multi-attribute tiII provides a mechanism for pre-computed multi-event join using inverted index.   Fig. 3. Construct Multi-attribute tiII based on the Patient Timeline. Fig. 3 shows an example of multi-attribute tiII for "Covid-19 pcr" test combined with test result ("positive" or "negative"). The dotted box on top is an example of Patient Timeline containing three patients who had "Covid-19 pcr" test. The sample event has "TEST NAME" being "Covid-19 pcr" and "TEST RESULT" being "positive," with the corresponding multi-attribute tiII in the second dotted box. Temporal patient list is shown in the last dotted box, which represents consecutive dates of all occurrences from the first date of the event to the last date of the event.
In this example, Temporal patient list with id = 1 (Fig. 3, below) stores the identifiers of patients who had "positive" in "Covid-19 pcr" test from "2020-05-21" to "2020-06-30." Each item of the array is a list, which stores the identifiers of all patients who had the event on this particular date. For example, the 34 th item of the array indicates PT002 and PT003 were tested "positive" for "Covid-19 pcr" in 2020-06-23 (34 days after 2020-05-21). If there is no event record of any patient on a specific day, this array item will be empty (a place holder).
Global Lookup Table. This is a global forward index designed for inverted index management. This global forward index makes it straightforward to look up and access all inverted indices using all the original variable in the EHR data. For example, the documents for inverted index of "gender" can be looked up in the collection of demographic inverted indices.

Query execution
The template for temporal query can be structured as:

"Find patients who had [events] with [temporal constraints]."
User input parameters are "events" and "temporal constraints" (optional). "Events" is one or multiple attributes and value pairs such as diagnosis code: U071, diagnosis status: diagnosis of.
"Temporal constraints" provide the specification for the time interval in which the events take place. For example, a temporal constraint can be a period or temporal relations of two events such as "any stroke diagnosis AFTER first Covid-19 diagnosis BETWEEN 2020-02-01 and 2020-05-01." Fig. 4 presents the architecture of our query engine. For database content (Fig. 4, right), a NoSQL database with ELII is constructed from a collection of EHR source files. On the frontend (Fig. 4, left), a user builds a query using a web-based graphic interface. The query engine translates a user's query as a group of database statements for the "Query Backend." The Query Backend then executes the database statements according to statement type by consulting the Global Lookup Table and combines the results according to query logic. The final result is then presented back to the user in the web-interface. Query results can be exported and downloaded for further analysis. Three basic types of query are available: 1. Classical (non-temporal), 2. Absolute temporal, and 3. Relative temporal.
Classical. Classical query involves only non-temporal attributes, such as demographics and clinical events without time constraint. Conventional inverted index are used to support this kind of query. Pseudocode for classical query appears in Appendix A, Algorithm 1. Absolute temporal. Absolute temporal query contains input parameters that restrict the times for events, executable using tiII. The pseudocode for absolute temporal query execution appears as Appendix A, Algorithm 2. Fig. 6 demonstrates the main steps involved in executing absolute temporal query "Find patients who tested positive in any Covid-19 pcr test between 2020-06-01 and 2020-06-30." These steps are instantiated in the following statements in Algorithm 2:  Relative temporal. A relative temporal query contains two events: A and B, instead of a single event in the previous two query types. The query also specifies a temporal relation between A and B. In EHR, events of the same type may have multiple occurrences. Therefore, further elaboration of the temporal relation, as shown in Table 6, is needed. In total, there are six possible relations between two clinical events with multiple occurrences (and corresponding timestamps).
The pseudocode for query with relative temporal constraints appears as Appendix A, Algorithm 3. Fig. 7 shows the main steps involved in executing relative temporal query "Find patients with any diagnosis U071 (Covid-19) before diagnosis I63 (stroke)." These steps are instantiated in the following statements in Algorithm 3: 1. In addition to these basic types of query, the query engine also supports full boolean queries and individual patient Pt lookup.
Boolean query. Boolean query involves a combination of multiple sub-queries using logical operators (AND and OR). Sub-queries are executed independently first. Then, set intersection is performed on the resulting patient identifiers resulting from sub-queries for AND and set union (with unique patient identifiers) is performed on the resulting patient identifiers resulting from sub-queries for OR. For example, result for the following OR query "Find patients who had a Covid-19 diagnosis OR a Covid-19 test with a positive result"  will be the union of patient identifiers from sub-queries "patients with any Covid-19 diagnosis" and "patients with any positive Covid-19 test." Negation. Negation is a useful operation for clinical studies to find patients without certain conditions. For EHR data, negation is handled by appropriately interpreting "lack of data or information." Three typical scenarios are possible with the lack of information: 1. using "Closed-World Assumption," lack of information entails negation. For example, if a patient did not have cancer diagnosis in medical record, it is safe to infer that the patient did not have cancer. 2. No information. For example, if a patient did not have HIV test, we have no way of knowing if the person has HIV or not. 3. No available result. For example, a patient may be coded as a smoker, but we have no "packs per day" information at all.
With these possible interpretations in mind, "lack of data or information" query can be executed by forming two groups of patients: a group Z of patients to exclude, and the group of all patients U. For example, the query "Find patients who did NOT have any diagnosis U071 (Covid-19) before diagnosis I63 (stroke)" can be performed by following these steps: 1. Z = patients with any diagnosis U071 before diagnosis I63 2. U = all patients in the database 3. R = U⧹Z Individual patient data look up. ELII is also designed for fast individual patient history data retrieval, another important query modality. It uses the Patient Timeline forward index Pt ̅→ . The query input is a patient ID, with other options including specific events the user wants to show in the output, and a period of time the events fall within.

Evaluation method
Data source. We used OPTUM® de-identified Covid-19 EHR dataset with Aug 6, 2020 release date. This dataset consisted 1.3 million patients who had either Covid-19 related diagnoses (U071, U072 and U073) or had Covid-19 specific lab tests (positive or negative). 16 individual source files were contained in the release, for different types of EHR records, such as patient demographics, diagnosis, medication, and lab. Each source file comes with several types of attributes (i.e., the "columns"). For instance, the PATIENT source file contains demographic attributes such as gender, age, and race. The DIAGNOSIS source file contains attributes such as diagnosis code, diagnosis code type, and diagnosis status. The MED ADMINISTRATIONS source file contains attributes such as drug name, National Drug Code (NDC), quantity of dose, and dose frequency. Definition of these attribute types are given in the accompanying data dictionary provided by OPTUM®. In total, 132 such attribute types are suitable for query construction. Table 7 lists the numbers of attributes and lines of records in the main source files, status of relevant date fields, and file size.
Query performance was evaluated using MongoDB with the entire dataset imported (1.3 million total patients). To test the performance of ELII's four components (patient timeline, inverted index, timeline inverted index and global lookup table), three types of query (described in Section 3.2) were tested: after 2020-02-20; 3. Relative temporal: Find patients who diagnosed/tested with [xxx] after 2020-02-20 and before first Covid-19 diagnosis.
For these three types of queries, we selected test events from Lab and Diagnosis for two reasons: 1) they are two commonly used record types for temporal query; 2) they are the top three largest record types in data size and the record number of the OPTUM® Covid-19 dataset version 20200806. As listed in Table 7, "Lab" has the largest data size and the second largest record number, and the data size and record number of "Diagnosis" are both the third largest.
The "[xxx]" in each test query template is the place holder for query parameter. Since the output of the temporal query is a list of patients, we selected 19 query parameters with different number of patient counts. We first picked 5 events with the number of patients ranging from 10 to 100,000 in base 10 logarithmic scale.
For events involving a number of patients greater than 100,000, we picked additional four events with a number of patients from 100,000 to 500,000 in linear scale (every 100,000 patients). We also tested an extreme case of "Lab" records: "test name" = "Oxygen saturation (SpO2).pulse oximetry," which has the largest number of patients in this data release. The last event is Covid-19 diagnosis ("diagnosis code" = "U071"), used as Event B for relative temporal query testing. Summary Table 7 OPTUM® Covid-19 dataset summary statistics.   statistics for each event tested are shown in Table 8. For each event, we executed all three query testing templates and recorded the execution times for MongoDB statement translation, MongoDB running time, and result array conversion. All queries were repeated 10 times consecutively and the average time were reported. Baseline database. For baseline performance, we imported the OPTUM® Covid-19 EHR dataset into MongoDB and kept its original row-oriented data format. Each row in the source files is stored as a document (the document keys are contained in the first column of each source file). To optimize query performance for the baseline database, we use MongoDB's standard built-in indexing for all query experiments for a fair comparison. In this setup, classical query shows how the conventional inverted index in ELII improves performance. Temporal query shows how "timeline inverted index" and "patient timeline" in ELII improves performance, at a different scale.

Data preprocessing
We developed an automated script for preprocessing OPTUM® Covid-19 EHR dataset version 20200806 into MongoDB. 3.76 billion records from 360 GB of text files were processed, creating event timelines for 1.3 million patients. The global lookup table contained 132 attribute types (diagnosis code, test code, test name, etc.), based on which we built single-attribute tiII for all 1.94 million attribute-value pairs. Single-attribute tiII is indexed for only one data field, and multiattribute tiII is for multiple-field queries. For example, to query the patients with "Covid-19 pcr" test and "positive" result, we needed twoattribute tiII for the "TEST NAME" and the "TEST RESULT." Creating multi-attribute tiII will increase the performance for multi-field queries but it requires extra pre-processing time. Table 9 shows the performance result for queries without temporal constraints. Conventional inverted index worked well: all queries were completed in less than 0.1 s for classical query. Table 10 shows the performance result for absolute temporal query. ELII performed an average 88 times faster than the baseline setup. It was not as dramatic as in the case of classical query because query execution time for the baseline setting is reduced by filtering out more records with Notes: Baseline: baseline model using row-oriented data store, ELII: our method using document-oriented data store with ELII, Saved: ELII Time -Baseline Time, X-faster: Baseline Time ÷ ELII Time.  the time constraints. Meanwhile, query time for ELII was slightly increased since an additional step was involved to slice the patient list according to time period. Table 11 shows the performance result for relative temporal query. The baseline result shows relative temporal query was the most timeconsuming in our experiments. ELII achieved even better performance than the other two query types. For the large scale query L10, it had a significantly improvement over the baseline, achieving an average 26 times faster and reduced query execution time by 35 s on average.

Query performance
Our ELII-based query engine surpassed the baseline in almost every single testing query by execution time except for L1 (negligible difference for a small query). Our approach demonstrated more significant time reduction as the number of records became larger. The most timeconsuming query was "test name" = "Oxygen saturation (SpO2).pulse oximetry," which had 120 million records and 1.1 million patients. For all three tested query types, baseline setup for SpO2 took 345.9, 319.2, and 493.3s respectively, while ELII reduced the time to 0.1, 2.1, and 6.6s, demonstrating dramatic performance improvements for larger queries.
We performed validation of ELII result for query accuracy by comparing the results with the same query using the baseline setup. For all 57 tested queries (19 query events in Table 8 tested on three query types), the results of the two approaches are exactly the same in both number of patients and sets of patient IDs. We also randomly selected and manually validated query results using the patient timeline lookup function. By reviewing the records of a set of randomly selected patients, we were able to independently confirm that the query results met our query conditions.

Query interface
To demonstrate the feasibility of ELII for interactive cohort exploration, we developed a web-based user interface called CovidSphere with which to build temporal queries for cohort exploration. The main idea and design of CovidSphere followed the best practice of our previous established query interface design experiences. These include those reported in publications such as "MEDCIS' [42], "X-search" [13], as well as "DataSphere" [28]. Fig. 8 shows the query builder interface layout with three areas annotated. In Fig. 8 area 1, a user may select an attribute type among "query terms," such as diagnosis code or test name, to construct classic and absolute temporal queries. In Fig. 8 area 2, a query template for relative temporal query is provided, where a user may select different settings within four components: i) time period (the start date and the end date), ii) events that include attribute type and attribute value (i.e., event A and event B), iii) temporal relation between two events (BEFORE or AFTER), and iv) operator mode for the temporal relation as specified in Table 6. The example shown in Fig. 8 area 2 represents a query to find all patients who have ALL diagnosis code: I63 AFTER diagnosis code: U071 between Feb. 1, 2020 and Aug. 1, 2020. By clicking the "QUERY" button, the query engine will translate the user input into a backend query statement and obtain the corresponding list of patient identifiers. Query result information is displayed in area 3 of Fig. 8, which includes the number of patients meeting the query criteria. A user may download the resulting list of patients with demographics information in comma separated value format by clicking "EXPORT DATA." Our experimental Covid-19 EHR database was implemented using MongoDB 4.4 Community Edition on MacOS. The query library for performance evaluation was written in Python (version 3.8.3). The ELII query engine was built and evaluated on a Mac Pro 2019 with 2.7 GHz 24-Core Intel Xeon processor and 768 GB 2933 MHz DDR4 memory.

Preprocessing effort
We developed ELII to enhance query performance and demonstrate this for a large Covid-19 EHR dataset. To achieve this goal, we needed to pre-compute ELII indices as part of data conversion to MongoDB data. Our approach for building ELII involved two steps: (1) building a hash map in such a way that the hash key is an existing event and the hash value is a list of patient IDs with the associated records containing the corresponding event. The time complexity for this step is O(n), where n is the total number of data elements in the dataset. (2) sorting the events associated with each patient, which has time complexity ∑ m i=1 l i logl i , where m is the total number of events in the database and l i is the total number of records that contain the i − th event. Additional preprocessing time of 22 h was spent in building all single-attribute tiII of the 360 GB text files. This one time, preprocessing can be speeded up by using parallel computing.
Although our experiments were carried out in a typical high-end desktop machine, the query engine can also be hosted in a more powerful environment, such as a computer cluster. One of our contributions is that, using commonly available computer hardware configurations, ELII can already achieve near-realtime query performance for a large OPTUM® Covid-19 EHR dataset.

MongoDB Cold-start
MongoDB uses Memory Mapped Storage Engine (MMAP) which maps files in the disk to memory for faster process. If the documents of a Fig. 8. Layout of our temporal query interface. query are not in memory, MongoDB will perform the mapping first and then execute the query, which is called a cold-start. Experiments for this study were performed with warm-start by making sure data was already loaded in memory, in order for the results to only reflect the execution time for queries. Even with cold-start, the ELII approach remains speedy because it covers most queries without prompting MongoDB to load all documents in memory.

MongoDB limit on document size
The maximum document size for MongoDB is 16 megabytes. This restriction was intended for managing RAM and I/O bandwidth. In extreme cases of the Covid-19 EHR dataset, a patient may have an extremely large number of events or an event may have taken place for an extremely large number of patients. If the corresponding document is larger than the allowable size, such a document can be split into several. Splitting will reduce query performance, but only for queries that triggered the splitting scenario. For example, {"diagnosis code":"I10"}(D9 in Table 10) had 12,954,359 records that were split into 34 documents using our automated script. As a consequence, this absolute temporal query (D9) took much longer than other events involving fewer patients.

Patient timeline export
Query result consists of a list of patient IDs. A users can retrieve all the records from any individual patient in the result immediately after a query. The export function helps users to not only validate whether patients match their search criteria, but also inspect and make sense of the clinical event sequence in the specific cohort.

Time complexity estimation
MongoDB used B+ Tree for indexing, so a "find" statement is O(log x) in time complexity, where x is the number of records in a document. Let n be the total number of patients, and m be the total number of unique events. Then the time complexity for classical and absolute temporal query is O(log m), and the time complexity of individual patient timeline lookup is O(log n). Relative temporal query is O(xy(log n)), where x is the number of resulting patients, and y is the maximum number of an individual patient's events.

General applicability
Although we primarily designed ELII for Covid-19 EHR data to improve the performance of temporal query, our approach can be applied to other standard and typical EHR data for chronic diseases and conditions, with or without temporal query. In fact, our earlier work [28] demonstrated that a NoSQL (e.g., MongoDB) approach can cut query construction time by half while improving the speed of representative queries by a magnitude.
ELII also is applicable to other data models such as those for i2b2, PCORNet, and OMOP. Query engines built for these widely used data models have not provided sufficient support for temporal query. ELII offers a pathway to creating MongoDB-like query engines for databases using i2b2, PCORNet, and OMOP data models. The general steps involved would consist of exporting data from i2b2, PCORNet, and OMOP into text files for patient demographics, diagnosis, medication, and labs, with appropriately linkable identifies for each record line in the file. Then these source text files can be processed using our preprocessing and ELII-construction scripts, and imported into a Mon-goDB environment, with a CovidSphere-like query interface. Limitations and topics for further study include the need to reduce pre-processing time using parallel computing, incremental database updates without the need for full database reconstruction when updated source data become available, and inefficiency in space utilization that is typically associated with a document-oriented data store.

User interface usability evaluation
This is a substantial topic beyond the scope of the current paper. However, the design of our CovidSphere user interface followed the best practice from our previous established query interface design experiences. These include interfaces reported in publications such as MEDCIS [42], X-search [13], as well as DataSphere [28], and we expect to have similar outcomes. We recognize that some additional query functionalities would be desirable, such as a query manager to save and share queries, interfaces supporting case-control design and exploration, builtin tools for visualization, and more flexibility for customized data export. Using a similar strategy as our previously developed cohort query and exploration tools, we plan to develop a comprehensive interactive cohort exploration interface based on this new ELII data model and our preliminary CovidSphere interface for OPTUM® Covid-19 data. We plan to report the design, implementation, and user evaluation of the system in a separate future paper.

Research utility
We have not performed a formal research utility assessment to provide publishable evidence supporting our original design motivation in enabling population-based Covid-19 research. However, the temporal query functionality in CovidSphere using ELII is already benefiting several ongoing studies locally on the impact and consequence of Covid-19 on topics such as outcome differences between sex groups, and longterm neurological impacts. Our approach provided an intuitive way to communicate and refine study requirements for clinical investigators. It provided live feedback on "what if" questions. Results on such topics will be reported in disease-specific venues.

Conclusion
To better leveraging large EHR data for Covid-19 research, we developed an innovative inverted index system to support fast temporal query. Experimental results showed that a set of temporal queries tested on 1.3 million patients resulted in average execution time of seconds or less. Our study suggests that ELII is a promising approach supporting fast temporal query, an important mode of cohort development for Covid-19 and other population-based research.

Contributors
GQZ developed the ELII concept. YH and XJL designed and refined the ELII system and evaluation. YH wrote code for data preprocessing and ELII backend. XJL implemented the ELII interface. All authors contributed to writing and editing the manuscript.

Dataset availability
The dataset used for experiments in this study is provided by OPTUM®, a third-party vendor. The University of Texas Health Science Center at Houston licensed this dataset.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.