A practical approach to storage and retrieval of high-frequency physiological signals

Objective: Storage of physiological waveform data for retrospective analysis presents significant challenges. Resultant data can be very large, and therefore becomes expensive to store and complicated to manage. Traditional database approaches are not appropriate for large scale storage of physiological waveforms. Our goal was to apply modern time series compression and indexing techniques to the problem of physiological waveform storage and retrieval. Approach: We deployed a vendor-agnostic data collection system and developed domain-specific compression approaches that allowed long term storage of physiological waveform data and other associated clinical and medical device data. The database (called AtriumDB) also facilitates rapid retrieval of retrospective data for high-performance computing and machine learning applications. Main results: A prototype system has been recording data in a 42-bed pediatric critical care unit at The Hospital for Sick Children in Toronto, Ontario since February 2016. As of December 2019, the database contains over 720,000 patient-hours of data collected from over 5300 patients, all with complete waveform capture. One year of full resolution physiological waveform storage from this 42-bed unit can be losslessly compressed and stored in less than 300 GB of disk space. Retrospective data can be delivered to analytical applications at a rate of up to 50 million time-value pairs per second. Significance: Stored data are not pre-processed or filtered. Having access to a large retrospective dataset with realistic artefacts lends itself to the process of anomaly discovery and understanding. Retrospective data can be replayed to simulate a realistic streaming data environment where analytical tools can be rapidly tested at scale.


Introduction
A large amount of heterogeneous data is generated as part of the process of patient care. Such multimodal data streams are integrated and utilized particularly by clinicians in the intensive care environment to understand patients' clinical condition and response to treatment (Rush et al 2018, Charlton et al 2016, Maslove et al 2017. There has been an increase in research approaches that incorporate these kind of data into analysis (Johnson et al 2016, Orphanidou 2019, Yang et al 2019. One of these data types is physiological waveform data, which is generated by interaction of biological systems with technological devices. These information rich streams are sampled and displayed by medical devices and typically sampled at frequencies up to 500 Hz (Brossier et al 2018, Jacono et al 2010. These waveforms are produced, for example, by continuously measuring mechanical pressures within the arteries as an arterial blood pressure (ABP) waveform (McGhee and Bridges 2002). Other examples of waveform data collected in an intensive care environment include electrocardiograms (ECG), electroencephalograms (EEG) (Brinkmann et al 2009), ventilator pressure-volume loops (Nichols and Haranath 2007), oxygen saturation and pulse pressure waveform (pulse oximetry, plethysmography) (Allen 2007).
Physiological waveforms are continuously monitored in environments such as the intensive care unit where their density, continuity and granularity provide an optimal means of longitudinally monitoring patients at risk of sudden or unpredictable deterioration. These individual waveform signals contain information which can facilitate diagnosis. For example, a large number of quantitative physiological descriptors can be extracted from an ECG waveform including heart rate (HR) and heart rate variability (HRV) (Holder et al 2015, Acharya et al 2006, Stein 2013. Also, the waveform morphology itself can be used to diagnose an abnormal heart rhythm (Hannun et al 2019. The interactions between these signals can also shed light on coupled physiological subsystems that have the potential to modify each other's behavior in a clinically relevant manner. An example of this coupling of physiological subsystems are cardiopulmonary interactions that measure interactions between the lungs and the cardiovascular system (Kiviniemi et al 2010, Penzel et al 2016, Novak et al 1993, Clark et al 2011.
Despite the potential utility for research and clinical purposes, storage of physiological waveform data for retrospective analysis presents significant challenges. Ironically, the complexity and granularity that makes this such an appealing data type for analysis also means that the resultant data can be very large, and therefore becomes expensive to store and complicated to manage (Burykin et al 2011). Additionally, access to data may be inhibited by industry vendors who may view data as proprietary information rather than understanding it to be the property of the patient (Wetzel 2018, Cosgriff et al 2019. Thus, collating archives of usable information in a clinical setting where comprehensive data collection is difficult presents a significant challenge for database administrators (Rodriguez et al 2018, Karhonen et al 2001. One solution to the problem of storage is to keep the data for only a certain period and then purge. The difficulty is that these datasets are incomplete as they contain high resolution data only for recent admissions. Another approach for physiological signals is to store variables derived from waveforms, for example heart rate derived from ECG, or respiratory rate derived from the respiratory impedance (RI) waveform (Holder et al 2015). The rationale behind this approach is that derived data values tend to be relatively low frequency (0.0003 Hz-1 Hz) (Brossier et al 2018), resulting in significantly lower storage requirements than if the original waveforms had also been retained. However, these systems do not permanently retain the raw waveforms from which they were derived and therefore implicitly assume that all the features of the waveform are known at the time of extraction. Additionally, data stored at lower temporal resolutions may not adequately capture the variability of the underlying signal (Eytan et al 2019, Horvat andOgoe 2019).
Derived variables may also be irreversibly affected by the algorithms that generate them, for example limitations of the inbuilt beat detection algorithms used in patient monitors may momentarily affect reported heart rate values (Berntson and Stowell 1998), or artefacts in an ABP waveform may momentarily affect the derived diastolic and systolic blood pressure readings (Li et al 2009).
Our research is predicated on the premise that there is additional information contained in the original waveform data beyond the standard suite of derived variables. Furthermore, analytical approaches will likely benefit from more robust approaches to the extraction of familiar variables, for example re-deriving heart rate from ECG waveforms (Manikandan and Soman 2012). However, in order to explore these new techniques, the waveform data must be retained at its original sample rate for future exploration.
Continuous high-frequency physiologic signals are characterized by being large in volume and high in variability. In the department of Critical Care Medicine at the Hospital for Sick Children (HSC), Toronto, Ontario, Canada, we use these signals for research with the goal to develop tools for real-time and online analysis available to clinicians at the bedside. To achieve this we first made the central philosophical decision that all of the physiologic data and waveforms are owned by the patient, not by the monitoring vendor or hospital. This meant that we capture and store these data permanently, and that the format and data access had to be such that it was readily discoverable (Lawrence 2017, Gil et al 2007. This is a unique approach to take, and in this report we describe the bespoke architecture that was developed to solve the problems of managing large volume time series physiologic data.

Previous work in this field
A variety of proprietary middle-ware platforms exist to collect and aggregate data from patient monitors and other devices, e.g. ViNES (Baxter, Deerfield, IL), Capsule SmartLinx (Capsule, Andover, MA), Bedmaster EX (Excel Medical Electronics, Jupiter, FL, USA), Bernoulli data collection system (Capsule Technologies Inc. Andover, MA) and iXellence (Ixitos, Berlin).
Numerous database systems and approaches have been developed specifically for storage of time series data (Bader et al 2017, Jensen et al 2017. Each of these systems is optimized for data with particular characteristics, for example floating-point or integer values; regularly sampled time series or aperiodic data; and long term vs short-term storage (Pelkonen et al 2015, Lautenschlager et al 2017, Yang et al 2014, Andersen and Culler 2016, Blalock et al 2018, InfluxDB 2019. Some time series data formats have been developed specifically with physiological data in mind (Brinkmann et al 2009, Moody 2019, Cabeleira et al 2018, Lee and Jung 2018, Trigo et al 2011, Bond et al 2011, Chen et al 2017 Other groups have leveraged a wide range of approaches to large scale storage of physiological waveform data, including ASCII files (Goldstein et al 2003, Rehm et al 2017, Liu et al 2012, Fanelli et al 2018, IBM DB2 relational database (Blount et al 2010), compressed binary data in floating-point format Jung 2018, Liu et al 2018), compressed .wav files (Blum et al 2015), Microsoft SQL Server (Matam et

Our approach
In 2015, when initially faced with the challenge of collecting waveform data from the Critical Care Unit at the HSC, many of these technologies and systems were unavailable, immature, or not otherwise suitable for our particular situation (De Georgia et al 2015). Our goal was to apply modern time series compression approaches to the problem of physiological waveform storage and retrieval.
As part of the design process we established that an ideal physiological waveform data database would have the following characteristics.
(a) Device Agnostic Data Collection-Signals should be able to be collected from a variety of devices, platforms, and vendors such that data may be synchronized and integrated into a unified collection. (b) Compact Representation-The ability should exist to compress waveform data to a small size that is cost-effective to store long term. (c) High Fidelity-Information should be stored directly to disk exactly as supplied from the monitors/sensors, with no loss of information. (d) Long Term Storage-Stored data should be retained indefinitely. (e) Indexing-A structure should be imposed that permits creation of metadata that facilitates quality control and research cohort identification without the need to access the raw data directly. (f) Efficient Retrieval-Data should be able to be retrieved rapidly for use in analysis and discovery. Programmatic data retrieval should be enabled and optimized for parallel and distributed computing applications.
We set out to design and deploy a physiological data collection system that exhibited all the above characteristics. We named the system AtriumDB TM as a reference to the atrium of the heart and the large open area at the Elizabeth Street entrance of the HSC.

Overview
Patients in a critical care setting are heavily monitored by a wide range of electronic sensors and devices (Rodriguez et al 2018, Zaleski 2015. These devices are primarily designed to display clinically relevant variables in real time to carers at the bedside. Secondary systems may be deployed to collect and store the data streams generated by such devices for later use. Such data collection systems consist of a device integration stage which collect data streams from bedside monitors and other devices and aggregate them into a unified data stream (Karhonen et al 2001). These aggregated data streams are often temporarily stored in an uncompressed text format for readability and simplicity; however such representations contain redundant information which can easily be compressed or removed. Efficient data storage requires data compression to reduce the size of the data and minimize data storage costs. While compressed data is compact, it may require specialized software to make it available for clinical and research purposes. This paper focuses primarily on the data compression, storage, and retrieval functionality provided by the AtriumDB system.

Device integration
We deployed ViNES Device Integration Software connected to 42 Philips monitors (Intellivue MP70, Philips, Amsterdam, Netherlands, software version J.10.50) in the critical care unit at the Hospital for Sick Children. Ancillary devices, such as ventilators, and regional saturation monitors (near-infrared saturations, or NIRS), were interfaced directly to the Philips monitors using the Intellibridge system. With this approach the monitor acts as an aggregator of data and allows simultaneous and synchronized export of both monitor data and information from these peripheral devices. The ViNES system software uses the Philips Data Export Protocol (Philips 2015) to request data from the RJ45 serial port at the back of each monitor, transmitting it to the ViNES server via a Lantronix Serial to Ethernet adapter. The ViNES server unifies the signals, converts the data to JavaScript Object Notation (JSON) format, and stores it in a RabbitMQ (RabbitMQ 2019) cache for retrieval by downstream software. Non-waveform physiological data are retrieved from the Philips Central Server via HL7 feeds by the Trajectory, Tracking, and Trigger (T3 TM ) Platform (Etiometry, Boston, MA). An overview of the system architecture is shown in figure 1.

Data collection
A service that polls the ViNES Rabbit Queue and retrieves the individual JSON messages was written in the PHP scripting language (PHP: Hypertext Preprocessor 2019). PHP was initially chosen because several of the team members were familiar with this programming language which enabled rapid prototyping. Each monitoring device would emit one messages containing multiple numeric data values for heart rate (HR), blood pressure (BP) etc every 1024 ms 9 , and one message for every 256 ms of each unique waveform signal. Each of the waveform messages (emitted once every 256 ms) contains between 32 and 128 values depending on the waveform sample frequency, and each numeric data message (emitted once every 1024 ms) contains multiple parameters such as heart rate, blood pressure, and respiratory rate. A typical throughput of the system in our 42-bed installation was around 600 JSON messages per second, with up to 1000 messages per second at peak data flow rates. An overview of the different kinds of signals, and the amount of each signal collected is presented in the appendix.
Typical collection rates in this system are around 100 million time-value pairs 10 per hour from an average census of 35 occupied beds. Peak collection times may result in data collection from all 42 beds which generates around 130 million time-value pairs per hour. Examples of some of the signals collected are shown at different time resolutions in figure 2.
The number of time-value pairs collected each week since the system's inception is shown in figure 3. Several notable stages of data collection can be seen in this figure. After the initial installation in February 2016, data was collected from a small group of four beds for a period of ten weeks to test the data collection system. After this trial additional beds were brought online throughout June to October 2016. By June 2017 all 42 beds had been connected to the system. In December 2017 we introduced a 'priority list' that allowed us to specify the relative importance of each signal that we collected. This reduced the number of redundant signals collected, for example preventing collection of both 500 Hz ECG and 250 Hz ECG from the same lead. Introduction of the priority list reduced the average rate of data collection by around 20%, but improved the overall continuity and quality of the signals being collected. Two months of data (October and November 2016) were irreversibly lost during a hardware failure event. Additional backup measures were introduced after this event to prevent similar incidents happening in the future. Other variations in figure 3 are due to minor hardware outages or changes in patient census and acuity.
JSON messages are examined and filtered upon retrieval from the Rabbit queue and cached on a secure server as an uncompressed text file, with a single file containing only messages from a unique bed-signal combination for that hour. At peak load this results in around 700 unique text files being generated each hour. At the end of the hour an asynchronous process, also written in PHP, parses each file and compresses it into a custom file format (see section 2.5). After the conversion to compressed file format, historical JSON files are moved to a different server for long term storage and backup. Temporary text files are deleted once this process has been completed.

Data compression
An ideal storage format must be both compact and highly accessible. A wide variety of different storage solutions were considered that would achieve these goals. For planning purposes, we estimated the rate of data ingest to be approximately 1.0 × 10 12 time value pairs per year. Traditional database approaches are not suitable for storing such large volumes of time series information (Stonebraker and Cetintemel 2005).
Recently there has been a push towards NoSQL type architectures for storage of time series data (Dunning and Friedman 2014). Such databases exploit inherent redundancies in sequences of values in the time series to provide more compact representations of the data. In these systems, intervals of contiguous data are compressed together into a single chunk or block. This facilitates a wider range of compression options, and allows indices to be more coarsely constructed, reducing overhead and leading to lower latency queries.
Medical devices sample and deliver physiological waveforms in a way that makes delta encoding particularly effective (Chang and Lin 2010). Physiological waveforms are often smoothed and pre-processed in the sensor or monitor such that iterative application of delta encoding results in a more compressible byte stream. Delta encoding records an initial value, and then only stores the difference between subsequent values (Shafer et al 2013, Engelson et al 2000, Singh et al 2015. This approach results in lossless compression when applied to arrays of integers (Lemire et al 2015). In our system, each signal type was profiled to identify the optimal sequence of pre-processing steps, for example selecting the number of iterations of delta encoding to apply.
We utilize a custom designed file format that segments the signals into a series of intervals of user defined length. Each individual block of data may be pre-processed in a variety of ways to decrease its entropy and therefore increase its compressibility (Kriechbaumer et al 2019). The pre-processed arrays are then further compacted using an open source compression algorithm.
This compression approach is similar to that employed by Blalock et al (2018). Each sequence of time-value pairs is first separated into two distinct arrays of equal length. One array contains the values while another array of equal length stores the time associated with each value in the first array. Times are stored explicitly so that gaps in the data can be faithfully represented. This differs from other approaches that implicitly store time by assuming a constant sample rate. The number of bytes required to store the delta encoded array is determined based on the range of values in the resultant array, and the most compact representation is used for each separate block of data. Binary representations of the delta encoded arrays are subsequently compressed using the BZip2 algorithm (Seward 2000). This additional step further reduces the size of each block's payload by applying run length encoding (RLE) and otherwise removing any remaining redundancy in the data representation. Alternative open source compression algorithms may be employed at this stage in place of BZip (for example, GZip, ZStandard, LZ4, Snappy etc) (Ziv et al 1977, LZ4 2019, zlib 2019, ZStandard 2019, Snappy 2019) which may result in slightly larger files and faster encode/decode operations. Compressed data files can be re-encoded using a different block size or using different compression options at a later time to meet the size and speed requirements of a particular situation.
Each time-value pair stored in AtriumDB consists of a millisecond precision timestamp coupled with an associated integer value. Signals that arrive as floating-point values are converted to an integer by multiplying by a power of 10, with the multiplying factor chosen such that the resulting integer is representative of the underlying precision of the data (Isenburg 2013). Ideally the raw integer values collected by the A/D converters on the various sensors are available directly or can be reverse engineered. These integer values may be subsequently re-scaled to a floating-point representation with relevant units, such as mmHg for pressure waveforms or mV for ECG. The scaling factors required for these conversions are stored as tables in a MariaDB database (MariaDB 2019).
Use of a 'compression ratio' to measure data compression is problematic because the signals exist in a number of formats before being compressed. It is therefore not clear which of these formats should be used for comparison, so the effectiveness of our compression is reported in bits per value 11 . This approach provides an objective way to assess the amount of compression achieved when averaged across a large amount of data from a given signal type.
The database currently comprises 194 different signal types, including 48 waveforms and 146 unique 1 Hz signals. The signals exhibit a wide range of compressibility as the compressed size is roughly proportional to the entropy of the signal (Balakrishnan and Touba 2007). Compression ratios achieved for some commonly collected waveforms are shown in table A1 and for other commonly collected physiological parameters in table A2 in the appendix. Labels used in these tables may be vendor-specific, and further work is required to unify nomenclature (Kasparick et

TSC file format
We designed a file structure that allowed multiple compressed segments of a signal to be stored in a single file. These segments, called blocks, can vary in size but typically contain between 1000 and 100,000 time-value pairs. The block size parameter is set globally for each signal type when the system is initialized. In our prototype system a block size of 10 minute duration was used for all signal types. In order to access any value (or any sequence of values) within a block the entire block must be decompressed. Use of smaller blocks facilitates more granular access but may result in an increase in handling and container overhead (Kriechbaumer et al 2019). Longer queries may span multiple blocks, allowing parallelized decompression. Multiple channels of data are supported in the file format, but not used in our current system. The files have a .tsc extension (where the tsc file extension stands for time series compression) The structure of the TSC file format is shown in figure 4.
Each block header contains all information required to decompress that individual block. In this way individual blocks can be passed across a network and decompressed on a remote system. This approach to data transfer reduces the load on the central server by offloading the computational load associated with de-compression to a client application or a distributed compute node.
Information stored in the block header includes the number of elements in the stored arrays, the size of the compressed payload in bytes, the initial values required for reversing the delta encoding, the number of iterations of delta encoding applied, the number of bytes required to store each value in the delta encoded array, and the type of open source algorithm employed in the secondary compression stage. Summary statistics such as the number of values, maximum, minimum, mean and standard deviation of the values, are also stored in the block headers. The main file header lists the number of blocks, the start and end times of each block, and the location (in bytes) of the beginning of each block of data in the file. A single block may contain up to 256 channels, and a single TSC file may contain up to 65,536 blocks.
Although this data format was designed for high-frequency physiological data (i.e. waveforms), it can also be used for lower frequency data ( ≤ 1 Hz). Storage of data in this format results in one compressed file per device-signal-hour. Each of the compressed files is stored on a secure server, with a folder structure that is organized by date. Details regarding each compressed file are recorded as a row in a MariaDB relational database. Metadata about the file, such as file size, number of values, statistical properties of the values in the file, and the start and end times of the data stored in that file, are also recorded in the database.
DeviceName is used as the primary key in the database instead of patientID. This approach ensures that retrospective adjustments can be made to the admission/discharge information without altering the folder structure or file names. Patients are mapped to a particular bed space as a function of time via a separate table in the MariaDB database. 11 Throughout this paper, we use the term bits per value to refer to the average number of bits required to store each time-value pair across the database (or a given subsection thereof). One bit is 1/8 th of a byte.

High fidelity data
By design, AtriumDB aggregates and stores all information collected in the form in which it was provided by the medical devices, including its flaws and inconsistencies. Information is stored in the same data types that the monitors and other medical devices produce, and lossless compression is employed. In this sense 'high fidelity' means that the information is stored in its raw form, with the goal of minimizing loss of information up to the point of storage on disk. Stored data are immutable, so the database represents an indelible record of what was collected. The utility of this approach is outlined in section 4.1.
Over time algorithms may be developed to detect and characterize known data quality issues. These algorithms are implemented in-line at the time of retrieval and may be modified or improved over time. These signal quality indices (SQIs) are discussed in section 4.2.

Data intervals
AtriumDB uses the concept of data intervals to delineate contiguous areas of interest in the time series data. Each interval is a period from t1 to t2 for a given bedspace-signal combination. Intervals are useful abstract data types as they often form the foundation for subsequent data requests.
Although interval may be specified by a user, the most common method of interval generation is via an automated process within the database API, particularly when accessing large portions of the database. For example, a user or computer program may ask for a list of time intervals where 'continuous' 500 Hz ECG Lead II waveform is available for a particular patient. Because the stored data contains many small gaps, this request may return dozens, or indeed thousands of intervals, each indicating a segment of time where data is guaranteed to be available. In this sense the intervals are a property of the data, as opposed to a user defined period. These data structures are explicitly available to users of the database by design. An interval may be used as a foundation for a subsequent data request, via an automated or manual process. The size of the gaps in the data may range from very small (a few hundred milliseconds) to very large (minutes or hours), so the definition of 'continuous' may be variable and context dependant. When generating lists of intervals AtriumDB allows the user to specify a 'gap tolerance' parameter that defines which gaps are considered to be significant (see section 2.8).

Data continuity
Gaps in physiological signals may be due to a variety of reasons including the intrinsic absence of data, networking issues, removal of sensors for cleaning, or as the result of a particular sampling approach. A suite of standard SQIs are applied during the data ingest process, these include identification of flatline signals and clipping waveforms (see section 4.2). Other custom quality indices are being actively developed and applied retrospectively. These SQIs may be used as constraints when selecting data that result in removal of potions of the signal, introducing additional gaps.
Some subsequent use cases may not be affected by small amounts of missing data, for example a gap of 500 ms duration, but might be adversely affected by longer gaps. With this in mind, portions of data with small gaps may still be considered to be 'continuous' in some contexts.
AtriumDB parametrizes continuity by defining a concept called 'gap tolerance' . The AtriumDB API can return a list of continuous intervals where the required continuity (as specified using the gap tolerance parameter) is guaranteed. For example, if we request all intervals of 'continuous' data with a gap tolerance of two seconds, then the database will return a list having time periods from t1 to t2 where each of those intervals is guaranteed not to contain any periods longer than two seconds that do not contain any data. A smaller gap tolerance results in a larger number of intervals, and vice versa.

Composite signals
Physiological models or analytical techniques may require multiple signals as inputs. These models are unable to operate unless all necessary signals are available in the input data. For a given application we can request a list of time intervals where all required input signals are available, and where they meet the quality criteria required for that analytical technique. We refer to these groups of signals as a 'composite signal' . A composite signal comprises two or more signals from a given bedspace, each of which may have a different specified gap tolerance. Queries to the API can define the list of required signals, and their required gap tolerance, and optionally also constraints as defined by one or more SQIs (see section 4.2).
A hypothetical example of a composite signal is shown in figure 5. It shows the intervals of continuous time for blood pressure, heart rate, and an ECG waveform, with a gap tolerance of 10 s, 60 s, and 60 s respectively. The composite intervals in this example are defined by the intersection of the input components' continuous time intervals. These intervals can optionally be further subdivided by providing a parameter that describes the maximum number of time-value pairs that an interval may contain, or by specifying the maximum interval duration. Intervals of this kind, containing multiple signals, may be used as a substrate for training machine learning models.

Data retrieval
Once we identify intervals of time where data of sufficient quality and continuity is available, a secondary request may be issued to the API requesting the raw data from those times. This two-step process eliminates the risk of requesting data from a time where no information was available or was of insufficient quality. In the current implementation data requests can be made using a Representational State Transfer (REST) based API, and data are returned via HTTP in JSON format.
A secondary method of accessing the database utilizes a Software Development Kit (SDK) written in the Julia programming language (Bezanson et al 2017). The Julia SDK interfaces easily with other programming languages such as C, C++, R, and the Python programming language (Lobianco 2019). This interface is designed for high-performance computing applications where the speed of data delivery may be the rate limiting factor. Data blocks are decompressed in parallel across many CPU cores, and data arrays are decompressed directly into RAM, bypassing the computationally expensive step of generating JSON and transmitting it across a network. This approach to data access is between 10 and 100 times faster than the HTTP based API.

Data labels
Many machine learning applications rely on data 'labels' that define a particular event or outcome such as true states of clinical outcomes or the true disease phenotypes (Xiao et al 2018). For example, mortality is often used as a label, dividing a cohort into two groups, patients who survived, and those who did not. Supervised machine learning techniques rely on labelled data for model training, creating a need for collection of, or automatic generation of labelled events that can be used to provide context to physiological signals (Komorowski 2019). Labeling of clinical data is often expensive, labor intensive, and consensus is difficult to obtain (Johnson et al 2016).
Labels for training a supervised machine learning model can be generated from a wide variety of sources. They can be collected in real time at the bedside, for example as part of an ongoing project aimed at recording cardiac arrests that occur within the unit (Tonekaboni et al 2018). They can come from the EHR and include such labels as diagnosis, age, length of stay, readmission, and surgery procedure. In some cases, a small retrospective training dataset is manually labeled through visualization of the physiological data. This approach was used for example when building a training dataset for automated labelling line access events in the arterial blood pressure waveforms (Nagaraj et al 2019).

Ethical framework
We have taken the approach that data housed within AtriumDB is the property of the patient and is fully discoverable. Data is stored losslessly in its raw form and is not purged, meaning that no bias is introduced during the data collection process. These data are available for clinical use so consent is not obtained. Physiological data are merged with an admit, discharge and transfer (ADT) data feed from the electronic health record (EHR) to correctly associate data with a patient. AtriumDB is housed within the clinical data center at the Hospital for Sick Children behind the hospital firewall and is protected in a similar manner to other clinical systems. The database only accessible by authorized individuals using appropriate security credentials.
Secondary research utilization of AtriumDB is regulated by a governance structure similar to that of a biobank. This governance structure covers the use of anonymized or anonymous patient information for research purposes. Following de-identification, the data is time shifted in a manner that is patient or study specific in order to prevent inadvertent re-identification of patients based upon temporal associations. Keys permitting re-identification of patient data by re-association of a subject identifier with a medical record number, and re-alignment in time, are retained, allowing re-identification of the data should the need arise. This governance mechanism facilitates clinical research, as well as data-driven exploration and hypothesis generation.

Results
Physiological waveform collection via ViNES/AtriumDB began in the critical care unit in February 2016, ramping up to having all 42 beds connected by April 2017. Data collection has continued since that time, and as of December 2019 the database contained over 720 000 patient-hours of data, collected from over 5300 patients. These data are represented as 2.8 × 10 12 time-value pairs stored in approximately 14.3 million compressed files. The dataset comprises almost 200 unique signal types, including waveform data for the complete duration of each patient's stay.

Data compressibility
As of December 2019, the database is 820 GB in size and growing by around 285 GB per year. At this rate of collection, we can continue to store the complete dataset at full resolution. With a small disk footprint it becomes affordable to store the data on high-performance disk drives. We utilize an array of solid-state drives (SSDs) to store the compressed data files which further improves the input/output (I/O) performance of the database.
The average compression rate for all signals combined was calculated by dividing the sum of the sizes of all TSC files by the total number of time-value pairs stored in the database. Using this approach resulted in an average of 2.57 bits per time-value pair in the installation at the HSC. Similarly, we calculated compression rates for each signal type for each by averaging the across the entire database for that particular signal. ECG Lead II sampled at 500 Hz for example is compressed to 2.34 bits per value, or around 526KB per hour. Compressibility of selected signals is shown in tables A1 and A2 in the appendix.

Retrieval performance
The HTTP based API was load tested and shown to be able to deliver approximately 800,000 time-value pairs per second to a single endpoint application. This is fast enough to provide data to a web based visualization, or for exporting small amounts of data for explorative purposes. This is not indicative of the maximum throughput of the system however as the server is able to process many requests simultaneously. Database performance was stress tested by setting up multiple applications to simultaneously request data via the HTTP based API. During this test ten separate computers were programmed to simultaneously and repeatedly request a random hour of 500 Hz ECG data from the system. The maximum throughput of the system under this parallel load was approximately 5 million time-value pairs per second via the HTTP based API. This scenario better simulates the server conditions in a multi-user scenario, or a distributed computing environment.
Data access rates using the Julia library are influenced by a wide range of factors including the number of CPU cores available and the performance characteristics of the data storage medium. Under a typical query load data access rates of 50 million time-value pairs per second can be achieved using a desktop computer with eight cores reading from a local cache of TSC files housed on a SSD.

Discussion and future directions
Most ICUs offer 24/7 monitoring of vital signs but lack the capability for data collection, integration, and analysis of physiological data (Mann et al 2020). Recent advances in computing have enabled the long-term storage of high fidelity physiological signals (Dunning and Friedman 2014). The availability of large, high-frequency physiological datasets has opened up new possibilities for deep learning and other data driven approaches (Hong et al 2019, Naylor 2018, Wang et al 2019.
The application of artificial intelligence and machine learning in healthcare has grown rapidly in recent years. These models are data driven, and model training times are often dictated by data access speed. A major limitation in utilizing high-frequency physiologic waveforms for research and clinical purposes is the lack of a scalable architecture that can store these data permanently while also providing data retrieval rates that are fast enough to allow practical exploration and/or utilization of the entire dataset.
AtriumDB addresses these limitations by providing cost effective data storage coupled with high-performance data access. The availability of a large, readily accessible, high throughput data storage engine for physiological signals leads to faster model training times, and enables larger cohorts to be used during the model training phase.
It takes time to accumulate a critical mass of data to apply to research question, and now that we have a corpus of information established, numerous projects are underway assessing the utility of the data for research purposes. For example see Eytan et al (2016), Tonekaboni et al (2018, ), Nagaraj et al (2019, Goodfellow et al (2018), .

Data utility
Because the data streams are collected and stored exactly as they were produced in the critical care unit, they provide an ideal substrate for development and testing of 'real time' data algorithms, as retrospective data can be replayed to simulate a streaming data environment. In this way any model developed with the intention of deployment in a streaming data environment can first be tested retrospectively to see how it deals with the wide variety of signal quality and data availability issues that are present in a real ICU.
Our experience has been that having access to a large dataset with realistic artefacts lends itself to the process of anomaly discovery and understanding. Anomaly detection and removal algorithms can be iteratively developed and tested on a broad range of data conditions (see section 4.2). The approaches developed during this process can be deployed in-line as data is exported from the database or applied prospectively as a pre-processing step to streaming data being used as input for real time algorithms.
The initial prototype of AtriumDB was deployed without a user interface. The goal was to expose a highly functional API that would facilitate rapid development of multiple visualization tools, data exploration tools, and user interfaces. For example . Several proof of concept applications have been developed, and a range of user interfaces are being developed that will provide clinicians and researchers the opportunity to request data directly or visualize signals via a web-based portal.

Signal quality indices
Physiological signals exhibit a wide variety of artefacts and noise, meaning that data may be unusable for some analytical and modelling purposes (Nizami et al 2013, Takla et al 2006. There are many approaches to assessment of physiological signal quality, for example see Clifford et al (2012), Elgendi (2016). We designed the AtriumDB system to be able to store SQIs at scale, and to have them readily available when selecting data. For example, our group has been developing approaches for real time detection of artefacts from streaming data. One example of this is automated identification of line access artefacts in arterial blood pressure waveforms (Nagaraj et al 2019). By adopting this approach, we can optionally utilize one or more SQIs as constraints or indexes while requesting data or scoping research projects.

Automatic generation of retrospective cohorts
If the data input requirements for a given algorithm are known, they can be specified as data availability, quality, and continuity constraints. This information enables AtriumDB to provide a report on how many patients match those constraints and provide a distribution of how many hours of data are available for each of those patients. The generalizability of a model can be explored through this process, allowing a researcher to examine the trade-off between a highly specific model that may have onerous input requirements, and a model that is designed to run on whatever data streams are most commonly available.

Creating derivative databases
Subsections of the database relating to a specific study or project can be exported and stored in a secondary database using the TSC file format. Parameters such as block size and compression approach, can be optimized for each specific use case. Physiological signals in the derivative database may be optionally cleaned, pre-processed, and de-identified, producing an optimal data distribution format for research groups who are focusing on a small portion of the database (Ferrao et al 2016).

Limitations
AtriumDB is optimized for high-performance retrospective access. Data becomes available to users approximately one hour after collection, so a separate system must be used for 'real time' applications. Applications that require access to real-time, streaming data can pull messages directly from a separate and dedicated RabbitMQ feed. Data for currently admitted patients are stored in an InfluxDB database (InfluxDB 2019) which allows very recent data to be queried if required.
Deploying this system at a different hospital would require development of numerous bespoke components, for example to integrate laboratory and admission/discharge information from a different source than used at the HSC. System initialisation also requires installation of new wiring systems and servers to facilitate data collection and integration.

Future directions
The speed at which data can be exported can be a rate limiting step in a high-performance computing environment, some aspects of the system may be rewritten in a compiled language such as C/C++ for faster performance.
Several projects are underway that will allow researchers and clinicians to pro-actively label events at the bedside, providing further context about patient care and providing labels that can be used for supervised machine learning. Now that we have established a proven and stable architecture, a longer term goal is to allow clinicians at the bedside to access the data real time, and to integrate the data seamlessly into predictive models that can be incorporated into online clinical protocols that enhance decision making for critically ill patients in the intensive care unit.

Conclusions
AtriumDB is a time series storage engine that was designed from first principles and specifically optimized for physiological waveforms. The highly compact data representation facilitates continuous collection and storage of physiological waveforms and other information collected in an ICU environment. The web based API provides versatile data access, while the SDK facilitates rapid programmatic retrieval of data for use in high-performance computing applications. The system has been used to compile a large dataset that is representative of the streaming data environment in an ICU, and as such can be used to meaningfully test and tune 'real time' algorithms on retrospective data before deployment.