Data integration and analysis for circadian medicine

Data integration, data sharing, and standardized analyses are important enablers for data‐driven medical research. Circadian medicine is an emerging field with a particularly high need for coordinated and systematic collaboration between researchers from different disciplines. Datasets in circadian medicine are multimodal, ranging from molecular circadian profiles and clinical parameters to physiological measurements and data obtained from (wearable) sensors or reported by patients. Uniquely, data spanning both the time dimension and the spatial dimension (across tissues) are needed to obtain a holistic view of the circadian system. The study of human rhythms in the context of circadian medicine has to confront the heterogeneity of clock properties within and across subjects and our inability to repeatedly obtain relevant biosamples from one subject. This requires informatics solutions for integrating and visualizing relevant data types at various temporal resolutions ranging from milliseconds and seconds to minutes and several hours. Associated challenges range from a lack of standards that can be used to represent all required data in a common interoperable form, to challenges related to data storage, to the need to perform transformations for integrated visualizations, and to privacy issues. The downstream analysis of circadian rhythms requires specialized approaches for the identification, characterization, and discrimination of rhythms. We conclude that circadian medicine research provides an ideal environment for developing innovative methods to address challenges related to the collection, integration, visualization, and analysis of multimodal multidimensional biomedical data.

timodal, ranging from molecular circadian profiles and clinical parameters to physiological measurements and data obtained from (wearable) sensors or reported by patients. Uniquely, data spanning both the time dimension and the spatial dimension (across tissues) are needed to obtain a holistic view of the circadian system. The study of human rhythms in the context of circadian medicine has to confront the heterogeneity of clock properties within and across subjects and our inability to repeatedly obtain relevant biosamples from one subject. This requires informatics solutions for integrating and visualizing relevant data types at various temporal resolutions ranging from milliseconds and seconds to minutes and several hours. Associated challenges range from a lack of standards that can be used to represent all required data in a common interoperable form, to challenges related to data storage, to the need to perform transformations for integrated visualizations, and to privacy issues. The downstream analysis of circadian rhythms requires specialized approaches for the identification, characterization, and discrimination of rhythms. We conclude that circadian medicine research provides an ideal environment for developing innovative methods to address challenges related to the collection, integration, visualization, and analysis of multimodal multidimensional biomedical data.

K E Y W O R D S
chronomedicine, data science, data integration, data visualization, time-series data 1 | INTRODUCTION AND OBJECTIVE The circadian clock system is a network of tissue timekeepers ("clocks") that regulate most aspects of human physiology 1 and thus generate 24-h rhythms in many biological processes. The emerging field of circadian medicine leverages the interaction between the circadian clock system and human physiology in health and disease for prevention, diagnosis, and treatment. 2 To advance circadian medicine, there is a particularly high need for coordinated and systematic collaboration between researchers from different disciplines to uncover relationships between circadian biology and health, as well as to translate them into clinical practice. Data relevant to circadian medicine are multimodal, ranging from molecular circadian profiles to clinical parameters (also including what is often called physiological measurements) and data obtained from (wearable) sensors or data reported by patients. 3 As rhythms are central to circadian medicine, large parts of the data are collected in the form of time series.
To study relationships between circadian properties and physiology, the circadian clock needs to be profiled across both time and space, 4 all the while confronting the heterogeneity of humans and their clocks. This means that relevant data types must be integrated, visualized, and analyzed across different temporal resolutions ranging from milliseconds and seconds to minutes and up to several hours. 5 Important challenges encountered when integrating such types of data include the need for specialized storage solutions, a lack of common standards, transformations needed for intuitive visualizations as well as data protection. On the data analysis side, challenges include the identification of periodic patterns, their characterization (e.g., in regards to phase, amplitude, period, and waveform), as well as their differential analysis across conditions. When studying human circadian rhythms there are additional challenges due to the variation of clock parameters in different dimensions and the inability to measure relevant clock tissue repeatedly, which means standard approaches need to be modified and populationbased approaches need to be adopted. The aim of this article is to provide an overview of the state-of-the-art in data integration and analysis for circadian medicine, to highlight challenges and possible directions for future work.

CIRCADIAN MEDICINE
Research into and applications of circadian medicine require the measurement of potentially heterogeneous clock parameters and their influences on a range of physiological processes. As illustrated in Figure 1, this requires the integrated analysis of a broad spectrum of different data types.
Due to the cell-autonomous gene-regulatory basis for circadian rhythms, molecular data (omics data, reporter assays) are essential to quantify circadian rhythms. Indeed, not only have circadian rhythm disruptions been associated with human pathologies 6 but also variations in clock-related genes and their protein products have been associated with increased or decreased risk of disorders. 7 Timing of drug administration 8 and timing of clinical procedures 9 can determine the clinical outcome, and mining of routine clinical data 10 can help generate new hypotheses for reverse translation. Wearable sensors as well as questionnaires administered through mobile devices are essential to capture patient-reported data on circadian rhythms, environmental factors, sleep quality, and functional status outside of the healthcare setting. 3

| Molecular data
Modern molecular measurement technologies are key to identifying and explaining variation between individuals. 11 In circadian medicine, molecular data are usually collected over a 1-or 2-day period with sampling intervals of between 2 and 12 h. Common types of molecular data include genomics, gene expression/transcriptomics, proteomics, metabolomics, and lipidomics data as well as time series from reporter assays. While transcriptomic and proteomic data provide a relatively focused readout of the clock output, metabolomic and lipidomic data provide a much more dynamic readout, which aggregates the molecular and metabolic state of the cell/tissue. Circadian rhythms are present in almost all cell types in humans 12 and regulate most physiological processes. 1 Thus, measurements from different tissues are relevant for studying different pathologies adding yet another dimension to the data. Although relatively new in circadian research, single-cell sequencing technologies allow profiling across cell types within a tissue, 13 only some of which might be relevant to a pathology adding yet more dimensions to the molecular data. Omics data are also a great source to mine for new clock components 14 and biomarkers for different clock parameters, such as the circadian phase. [15][16][17] Circadian tissue cultured in vitro have also been shown to retain the circadian properties of the human donor. 18 Thus, time series recordings from cell/tissue cultures transfected with a clock gene/protein reporter are another source of circadian molecular data. In summary, various types of molecular data are important for studying interactions between circadian rhythms and environmental, behavioral, and lifestyle patterns as well as disease development and progression. 5

| Clinical data
Clinical data refers to all data from healthcare processes, clinical trials, and patient registries including data on diagnosis, treatment, drugs, laboratory tests, and physiological monitoring data collected at varying frequencies. 19 While vital parameters, such as heart rate, core body temperature 20 (CBT), and blood pressure are usually measured in intervals of several minutes to hours, other measurements require a higher frequency and precision, such as electrocardiograms (ECG) or electroencephalograms (EEG), which have sampling rates of several milliseconds (see also next section). Circadian rhythmicity is commonly determined in the clinical setting from CBT measurements 21,22 with the minimum of CBT serving as a phase marker. However, this requires a sufficiently high sampling rate of CBT. 23 Measurement of hormones relevant to circadian rhythms, cortisol, and melatonin, from saliva, urine, or blood is common for chronobiological profiling of subjects, which is essential for human circadian analysis (see Section 5.4); the current gold-standard phase marker, dim light melatonin onset (DLMO), 24 is based on melatonin. Similarly altered peaks or troughs of circadian blood pressure rhythms have long been used to flag worsened cardiovascular outcomes. 25 Thus, clinical data collected at different frequencies and over different lengths of time are essential for research in circadian medicine and the translation of its results.

| Sensor data
Continuous data relevant to circadian medicine can come from many sources, ranging from patient monitoring systems in critical care to wearable sensors capturing data on circadian rhythms and environmental factors outside of healthcare settings and research facilities. 5 Analogously, there is a broad scale of measurements that can be obtained, including information on light and sound, vital parameters, CBT, or EEG with sampling rates ranging from milliseconds to minutes or hours. Wearable sensors can measure many of the same physiological parameters collected under clinical conditions with a considerably lower subject burden allowing for longer monitoring times. 26 For studying sleep and circadian rhythms, there is a wide range of metrics that can be collected from such devices, such as measurements of light exposure, geographical location, body temperature, heart rate, skin conductivity, and blood oxygen levels, which are all associated with the circadian system. 27 Actigraphy uses accelerometry to record activity patterns and body movements, such as exercise, and can be used to predict circadian phase as accurately as the clinical gold-standard DLMO assay. 28 For detecting sleep-wake patterns actigraphy is a sensitive and reliable instrument that has high agreement rates with polysomnography. 29 Wearable heart rate data might also provide a stable but independent phase marker compared to DLMO. 30

| Patient-reported data
Patients can report their perceived health status in combination with data on circadian rhythms during their daily life in questionnaires. These patient-reported outcomes (PROs) can be generic, capturing functional status and quality of life, or be disease-specific and tailored toward specific conditions and symptoms. The PROs can be scored in so-called PRO measures (PROMs) in validated questionnaires. 31 The common validated circadian and sleep assessments questionnaires are the Horne-Östberg Morningness-Eveningness, Seasonal Pattern Assessment, Pittsburgh Sleep Quality Index, and the Munich Chronotype Questionnaires, 3 which are also commonly used for chronobiological profiling. Increasingly, such feedback from subjects or patients is obtained via custom-designed or generic wellness mobile apps. 5 While patient-generated data provide important information on outcomes of circadian medicine, they are subjective by nature and care must be taken in using and interpreting them. 32

CHALLENGES
The comprehensive integration of relevant data types is an essential foundation for performing the multidimensional and multimodal data analyses required for the study of human circadian rhythms and the implementation of related medical applications. Figure 2 illustrates the major steps required for this purpose.
Data integration challenges range from efforts to access relevant data sources through existing interfaces to data standardization and harmonization as well as to provide F I G U R E 1 Illustration of primary data types in circadian medicine. appropriate visualization and analysis capabilities. Along all those dimensions, there are cross-cutting requirements that need to be considered, 33 such as data quality management. 19 Furthermore, circadian medicine is centered around data from patients and study subjects, and hence, requires adequate information security and privacy protection measures to be put in place. 34

| Molecular data integration
Compared to other types of data, individual types of omics data are relatively well standardized with formats such as FASTQ being able to represent a wide range of molecular measurements 35 and projects such as Ensembl providing systems and ontologies of stable identifiers for genes, transcripts, proteins, and exons. 36 However, there is a lack of standards for multi-omics data. 37 In regard to tooling, a wide variety of solutions for multi-omics integration, analysis, or visualization have been developed and made available to the community. 38 However, not many tools specific to circadian data integration and visualization exist. A more general example is the cBioPortal, an open-source solution that allows users to explore and analyze large-scale cancer genomics data sets. It has recently been adapted to visualize data regarding the circadian regulation of cancer hallmarks. 39 Specific solutions for circadian medicine exist, but they often focus on a limited set of studies from certain labs and cannot easily be extended. Important examples are CircaDB, 40 which is a database of circadian gene expression profiles, and CircadiOmics, 41 a large data repository and web-based analytics tool for high-throughput omic circadian time series data. When aiming to support a broad range of circadian medicine projects, the integration with phenotypic data, the particularly high security and data protection requirements for many types of human omics data, 37 and the temporal alignment of multimodal data pose particular challenges. Additionally, the storage and processing of omics data are also very resource intensive.

| Clinical data integration
Clinical data often suffers from a high degree of structural heterogeneity and a lack of standard terminologies. 42 The reasons for this are manifold, ranging from the autonomy with which data are collected in different healthcare processes and research projects, to a lack of support by vendors and lack of awareness among researchers. While approaches such as common data models exist to cope with this challenge, data are usually transformed into such representations retrospectively, potentially leading to information loss due to fundamental heterogeneity that cannot be easily resolved. 19 A common approach for integrating data for research purposes is to establish data warehouses, which are specific types of databases designed for analytical processing of heterogeneous data. To reduce the efforts required to transform data into common representations, pay-as-you-go approaches can be utilized. 43 These approaches are based on the principle of not aiming for a fully integrated dataset from the beginning, but instead incrementally harmonizing it into a common representation as it is needed, for example, for a circadian medicine research project together with domain experts. This implies that efforts are invested when needed, leading to a more efficient integration process. Modern interoperability standards, like Health Level 7 (HL7) Fast Healthcare Interoperability Resources (FHIR), 44 have the potential to improve the homogeneity of clinical data, as they can foster their structured and standardized collection at the source and simplify their exchange with other clinical information systems and research contexts. 45 Another issue, particularly important to circadian medicine research, is that timestamps of clinical samples or measurements might not be semantically consistent (e.g., time of sample collection, analysis, or result) making a joint analysis across centers and healthcare institutions challenging. Additionally, systemic biases, like the timing of clinical samples (usually collected in the morning, and at other times only due to urgent clinical concerns), or dosing time of medication may exist due to clinical processes and are confounding factors that need to be considered when integrating and interpreting population patterns. 46

| Sensor data integration
A particular challenge in regard to the integration of sensor data is the need to calibrate scales of measurements from different devices, analogous to what needs to be done to make laboratory values comparable across institutions. The challenge is aggravated by the rapid development of sensor technology, algorithms and methods, which need to be considered when integrating data from different devices, studies, and populations. 3 Also, the standardization of the collected data can be an issue. 47 As with the other data types, multi-modal integration is necessary to put sensor data into context with other critical sources of information to facilitate their interpretation. In the context of remote patient monitoring and wearable sensors, raw sensor data (e.g., accelerometry) might be not easily accessible to many consumer-oriented devices due to proprietary components. 3,27 In addition, it is important to apply data quality control and normalization techniques, as well as to validate consumer-oriented device data with their clinical equivalents before incorporation into clinical research. 5 Due to the high frequency of the collection process, sensor data can also lead to unusual storage requirements. These challenges, while typical for data from wearable technologies and often relevant in study contexts, also apply to sensor data collection and integration in clinical routine settings.

| Patient-reported data integration
Capturing and integrating patient-reported data, which provide insights into quality-of-life and environmental factors, come with a range of associated challenges. On the data level, there is heterogeneity in how PROs are measured through various types of questionnaire instruments and are evaluated by the research community as well as heterogeneity in the data representations of questionnaire software and their licensing models. 3 Recent standardization efforts, such as the inclusion of questionnaire structures into the HL7 FHIR model 45 and the representation of items from the Patient Reported Outcomes Measurement Information System (PROMIS) 48 as Logical Observation Identifiers Names and Codes (LOINC) 49 provide an important step forward. Still, the joint use and common analysis of patient-reported data usually require researchers to align regarding the instruments used upfront. 50 Collecting patient-generated data through apps and web-based surveys also comes with specific infrastructural requirements and information security challenges. 5 The distinguishing characteristic of these types of data is that they are collected through an information channel into the domestic environment of the patients and study subjects. Such connections can be complex to implement for healthcare institutions, which often use highly protected and isolated IT environments due to the sensitivity of the data managed. This means that different security perimeters must be bridged during integration, which requires particular attention. 51 Furthermore, the data might be influenced by various types of self-reporting bias 52 or privacy concerns, which might lead to the patients providing incomplete or inaccurate information.

SERIES DATA INTEGRATION, VISUALIZATION, AND PROTECTION
As outlined in the previous sections, studying circadian mechanisms requires integrating different types of data captured at different temporal resolutions. Additionally, since the study of circadian rhythms requires measurements to be collected over time, as well as over "space", that is, sample from different tissues, which are relevant to different aspects of physiology, large amounts of data are collected for the same modality. This leads to special requirements for biomedical integration architectures, which are often centered around relational data representations that are well suited for capturing longitudinal clinical data or around specific types of data stores, for example, Clinical Data Repositories for HL7 FHIR or systems for images or omics data.
Storing, querying, and analyzing multimodal time series data with different temporal resolutions is better supported by specialized data representations or specific database management systems, such as time series databases. Bringing both worlds together can be challenging 53 with solutions ranging from modules for relational database management systems, such as TimescaleDB for PostgreSQL 54 to specialized database management systems, such as InfluxDB. 55 Also, integrating multimodal data is complicated by the fact that no common standards are able to capture all types of data in an integrated manner. 56 Since physiology is time-dependent, data obtained from different sources, collections, and modalities need to be temporally aligned and registered to allow for a meaningful analysis, which is especially true for molecular and sensor data collected in clinical settings, as well as from wearables technologies. Further, as they likely include data points sampled at different frequencies, specialized transformation steps are needed to convert variable-length time series into common, fixed-frequency representations without loss of information. 57 On the visualization side, a wide range of scalable open-source solutions have become available in recent years, many of which were originally developed for displaying monitoring and telemetry data about computer systems. However, effective visualizations can be challenging, because the display of time already accounts for one dimension of the 2-dimensional image, even making, for example, the visualization of a single RNA-seq experiment with multiple features difficult.
An important example of an open-source solution is Grafana, 58 which has recently also been adopted for visualizing a wide range of biomedical information. Examples of applications of Grafana in the healthcare and biomedical research context include the work by Drake et al, 59 which focuses on health-related data from smartphone sensors, the work by Cruz et al. 60 on real-time processing and visualization of intensive care data, and the work by Çalhan et al. 61 , which focuses on patient monitoring. An example visualization and comparison of circadian rhythms with Grafana is provided in Figure 3, showing log 2 relative gene expression levels for CRY1, CRY2, DPD, FKBP5, NR1D2, PER1, and PER2 across four healthy male subjects. 17 Using this and related frameworks for circadian medicine data may need the inclusion of dimensionality reduction techniques considering time dependencies. 62 For example, composite phase deviations (CPD) density plots have been developed to support the visualization of circadian misalignment. 63 Data protection is an important orthogonal topic when human biomedical data are collected, shared, and integrated for research purposes. Due to the need to integrate different types of data of the same individual as well as data about cohorts of selected individuals, integration architectures for circadian medicine require services for securely managing the identity of patients and study participants 34 as well as consistent pseudonymization 64 and data-linkage functionalities. While these requirements F I G U R E 3 Visualization of relative gene expression levels series across multiple subjects using Grafana. The data shows the gene expression in monocytes of four young healthy males under constant routine conditions. 17 can be fulfilled using central services typically available at research institutions, there are some specific challenges arising from the nature of the data required. For example, secure communication channels are needed to collect patient-reported data and data from wearable sensors in remote settings (see above). 5 Also, as medical data are highly sensitive, care must be taken to preserve the individuals' privacy. 65 For example, sharing time series data with other researchers in a privacy-preserving manner can be challenging if its dimensionality is high, 66 however, even coarse data does not guarantee privacy. 67 We therefore expect that research on circadian medicine will particularly benefit from innovative Deep Learning-based data synthetization mechanisms, that is, methods to produce artificial data from original data by training a model to reproduce its essential properties, 68 for time series data which is currently a very active field of research.

CIRCADIAN MEDICINE
Circadian rhythms are defined as (i) endogenously generated periodic patterns that repeat approximately every 24 h under constant conditions, (ii) entrain to external stimuli, and (iii) are temperature compensated. 69 While these rigorously defined rhythms have been validated in model organisms, studies in humans typically test (i) and assume the rest hold true as well; no study to our knowledge has tested (i), (ii), and (iii) simultaneously in humans. Moreover, it is debatable whether rhythms under constant conditions ("circadian") or under normal periodic environments of light, sleep, feeding, and temperature ("daily/diurnal") are more relevant to circadian medicine or indeed biomedical research. We therefore take a broader view of circadian medicine in this review as encompassing physiology and medicine associated with the circadian clock rather than the strict definition given above. The dynamic nature of this phenomenon poses not only challenges to diagnostics but also to experimental design, data collection, and data analysis of circadian studies. The analysis of data generated typically involves the identification, characterization, and discrimination of rhythms, as illustrated in Figure 4A.

| Experimental design
The study of dynamic phenomena such as circadian rhythms requires collecting time series of measurements. Experiments on circadian rhythms have to make important decisions regarding the total number of time samples, the spacing between samples, and the number of rhythm periods to sample as well as the total number of biological (and technical) replicates. At the same time, ethical, practical, and cost constraints must be accounted for. Standard power calculations are therefore not suited for the design of circadian experiments. This is vital as statistically underpowered studies cannot conclusively test the desired hypotheses or detect desired effects. The analysis approaches and their statistical properties (see Section 5.2) too must be borne in mind to calculate effect and sample sizes.
There are consensus guidelines for the design of circadian experiments in model organisms. 71 Sample sizes can be calculated for such situations using computational tools, such as TimeTrial 72 for transcriptomic data. Nevertheless, key challenges remain to the design of human circadian medicine experiments due to the unavailability of effect sizes needed for power calculations and the lack of statistical tools to incorporate the heterogeneity of human clocks into the calculations.

| Identification of rhythms
The first step in any circadian analysis is to determine whether measurements show patterns of near 24-h periodicity. The standard approach to test this hypothesis requires a time series of measurements, that is, a sequence of measurements made at preferably regular intervals. The measurements must be made at sufficiently small sampling intervals and often multiple cycles of the periodic pattern are required to draw reliable conclusions. The approaches to identifying rhythms differ based on the type of data. Behavioral assays, such as actigraphy, produce long recordings at relatively high resolution, while omics time series data are typically short and often span only one cycle of the rhythm.
For long high-resolution time series, classical approaches include the 2 -periodogram 73 and a standard periodogram via a Fourier transform. A modified Fourier approach termed Lomb-Scargle periodogram 73 even works with missing time points and can be considered the best option. Wavelet analyses that resolve rhythms in both frequency and across the length of the time series can be applied to measurements that vary across the recording length, which is very common and often of interest in biological settings. Wavelet analysis needs multiple cycles of data at high resolution, but the approach is effective. Implementations of this sophisticated approach are available in easy-to-use software packages, such as CIRCADA 74 and pyBOAT. 75 For short time series, the statistical power can be improved with biological replicates. For this purpose, Cosinor analysis 76 is a fast, robust, and highly flexible method that is based on linear regression. The only drawback of this approach for rhythm detection is the need for the user to specify the frequency (or cycle period) of interest. Cosinor analysis further assumes a sinusoidal waveform with normally distributed noise in the measurements. 77 More complex waveforms can be identified by Cosinor analysis using more frequencies at the cost of estimating more parameters (which needs more measurements) and ambiguously defined circadian parameters. 77 However, rhythms with harmonic periods (6, 8, and 12 h) are also 24 h periodic patterns and are often overlooked. These possibly physiologically important 78 rhythms can be detected using a different cosinor formulation. 79 Two other methods circumvent some or most of these assumptions. The very popular JTKcycle algorithm 80 assumes only that the waveform is sine-like. A related approach, RAIN 81 and an improvement on JTKcycle's statistical properties, eJTKcycle, 82 relax this assumption even further to allow other asymmetric waveforms. Since the previous three approaches are only concerned with the ordering of values, they do not explicitly incorporate measurement uncertainty, a shortcoming that the BooteJTK algorithm rectifies in the context of genomewide expression studies. 83 The latter four approaches can search for rhythms over a range of periods and can be applied to any data type with almost no knowledge of the underlying distribution of the measurements. The computational burden of calculating p-values increases steadily from JTKcycle, eJTKcycle, and BooteJTKcycle to RAIN. For non-programmers, Nitecap 84 provides webbased interactive access to these tools including visualization albeit for simpler, standard study designs.
We also highly recommend filtering identified rhythms using a biologically informed amplitude threshold to consider effect size (amplitude in our case, see Section 5.2) and hence biological relevance in addition to p-values. Small effect sizes also result in failed validation. 85

| Characterization of rhythms
After their detection, rhythms need to be characterized by determining their parameters. As is shown in Figure 4A, the canonical circadian parameters are the rhythm period, amplitude, acrophase or peak phase, and mesor or magnitude. Typical rhythm detection methods provide an estimate of the period, with different accuracies depending on the data. The remaining rhythm parameters are mathematically well-defined only when a sinusoidal waveform is assumed. They can be estimated using the cosinor method introduced in the previous section, with the period either obtained from rhythm detection or fixed at 24 h.

| Discrimination of rhythms
Circadian medicine research requires comparisons of circadian rhythms between different conditions. Intuitively, it appears that performing rhythm identification in each condition and then comparing the results is correct. However, it was recently shown that this approach is flawed and that it results in a high rate of false positives. 70 This flaw is exacerbated in high-throughput studies, where the comparison is often visualized using Venn diagrams, as is illustrated in Figure 4B. Thus, readers might use Venn diagrams as a flag to more closely inspect the method used for comparison. To control false positives at the desired level, the parameters of the circadian rhythms (see previous section) must be directly compared across conditions. In the case of two conditions, this can be achieved within either a hypothesis testing or a model selection framework. In the former, the null hypothesis is tested that the coefficients of the cosinor fits to the two conditions are the same. In the latter, a collection of linear models representing the four categories of rhythmic patterns in the two conditions (a gain of, a loss of, a change of, or unaltered rhythms) are fitted and the "best" category (model) based on a Bayesian model selection criterion is selected. The former and latter approaches are available in different implementations, [86][87][88] whose relative merits are discussed elsewhere. 70 Both approaches are implemented in an easy-to-use R package compareRhythms, 70 which also features amplitudebased effect-size filtering of the identified rhythms for improved biological relevance.

| Rhythm analysis in humans
Circadian rhythm analysis in humans brings its own challenges because humans and their internal clocks are heterogeneous (as compared to isogenic animal models). Even in healthy individuals, circadian clock parameters vary across age, sex, genetic background, 89 seasons, and environmental light exposure. 90 There are several reasons to account for this heterogeneity: (i) to quantify this feature of the clock, (ii) to eliminate it in order to get at the underlying circadian phenomena, and (iii) to personalize circadian medicine. Under disease, the combination of disease-clock interactions, psychological stress, effects of therapeutics, comorbidities, Zeitgeber-disrupted environments (e.g., intensive care units 91 ), and altered rhythmic behaviors can further compound the heterogeneity. The standard approaches described in the previous sections therefore need to be modified in several ways in the context of human studies. Since studies in humans are often longitudinal, measurements from each subject are correlated, unlike (typical) samples from model organisms. As one solution, cosinor analysis can be made to account for correlations using linear mixed models. 16 Rhythms can be averaged across study subjects to reduce the impact of noise and assess a population rhythm if the study cohort is sufficiently homogenous. In circadian studies, this homogeneity extends to the ability to align the internal time of subjects by correcting the sampling times to characterize the underlying common circadian rhythm. 16 This often requires a complete chronobiological profile of the subjects. When the subjects under study are highly heterogeneous and chronobiological profiling is infeasible, circadian rhythms can be analyzed at the individual level using amplitude as the metric. 92 This latter approach makes fewer assumptions and allows stratification of the cohort according to chronobiological factors, but is less statistically powerful.

| Population approaches for human circadian analysis
The other key challenge in studying human rhythms is our inability to repeatedly sample most tissues from single individuals. If one settles for population rhythms instead, only one tissue sample per human would be needed. This depends critically on the circadian variation exceeding the inter-individual variation, which has been shown to be the case in human skin. 16 In fact, such single tissue biopsies are available in many public databases, such as The Cancer Genome Atlas, or are part of tissue banks. Of course, accurate time labels for these samples are still needed to construct a population-level time series. Unfortunately, such time labels are generally missing as they were not collected with circadian studies in mind.
In 2017, cyclic ordering by periodic structure (CYCLOPS), 93 a linear auto-encoder with a circular node, was used to reorder unlabeled human samples to construct population-level time series datasets from postmortem human tissue expression data. 12 The key idea is to "code" a sample consisting of multiple features (e.g., the transcriptome measured at a single time) as two numbers that are constrained to lie on a circle and represent the time (or phase) of the sample. Although the authors test their approach on mouse data and a single human dataset where the ordering is known, it is unclear if and how well their approach works in general. Moreover, prior biological data from mice need to be used to filter the search among conserved rhythmic genes, which makes this approach potentially biased. Recently, the fact that this human tissue dataset included multiple samples from the same human was used to improve the time labeling by means of an expectation-maximization algorithm. 94 Similarly, Oscope 95 was designed to find oscillatory genes in unsynchronized single-cell RNA-seq data by finding pairs of oscillating genes that can provide good circular ordering followed by clustering of the pairs and nearest-neighbor insertion to find time labels for individual samples. The main drawback of this approach is the combinatorial growth in the search space for good pairs of genes for ordering. Thus, not only are improved computational algorithms necessary to reconstruct rhythms from population-sampled data despite the above-described heterogeneity, but validation studies must also be undertaken to evaluate the true accuracy of these reconstructions.

| Novel biomarkers for human circadian clock analysis
In recent years, several laboratories have developed methods to read the circadian clock in humans, in particular its phase (i.e., chronotype). Molecular biomarkers identified using machine learning can quantify the circadian clock phase using only one or a few biosamples (for reviews, see Münch and Kramer 15 and Dijk and Duffy 96 ). This is a major advance because traditionally the analysis of the human circadian system required the acquisition of long time series of markers, such as melatonin or CBT, which required specific laboratory protocols to exclude environmental and behavioral influences on rhythmicity (e.g., the constant routine (CR) protocol 97 ). These protocols are still the gold standard but have the disadvantage that they are time-consuming, expensive, and therefore not suitable for larger studies, nor for everyday clinical use. To date, it remains unclear how robust the developed innovative biomarkers are and they still need to be validated against gold-standard methods in different patient populations and under varying non-laboratory conditions. Moreover, it is imprecise to refer to "reading the circadian clock" in humans because there is not a circadian clock but rather a network of tissue clocks. These tissue clocks are separated by fixed time (phase) differences that possibly vary with health status, environmental conditions (e.g., during jet lag), and between individuals. Current phase biomarkers measure the clock in peripheral tissue (blood, skin) and infer the central clock time (time according to DLMO) or wall time assuming a fixed phase relationship between them. The tissue in which time is in fact measured and which time ought to be measured for a certain application of circadian medicine remain open. Finally, only the circadian phase has received attention thus far, and circadian amplitude (and associated biomarkers) might be at least if not more important; for example, there is a loss of circadian central clock amplitude with aging 98 in mice.

| SUMMARY AND CONCLUSION
In this article, we have presented an overview of requirements, available solutions, and open challenges for data integration and analysis for studying, implementing, and advancing circadian medicine. Considering the types of data particularly relevant to research in this field and the fact that those need to be captured and analyzed in the form of time series with various resolutions and rhythmic properties, there is a range of challenges associated with building adequate research platforms and handling associated data quality issues. On the data integration level, these challenges range from a lack of standards that can be used to represent all necessary data types in a common, interoperable form, to challenges associated with data storage, the need to perform timedependent transformations for integrated visualizations, and to data protection requirements. On the data analysis level, challenges exist in regard to the determination of periodicity in repeated-measures designs and differential rhythmicity analysis, when rhythms are not sinusoidal. Moreover, specific challenges in the study of human circadian rhythms relate to the heterogeneity of clock parameters and the inability to repeatedly sample most human tissues.
Many of these challenges point toward important future research directions for data platforms supporting circadian medicine. These are also highly relevant for translating circadian medicine into clinical practice. While this review focuses mostly on clinical research, there are related technical and organizational challenges that need to be considered in clinical routine processes and systems pointing toward synergies that could be leveraged. For example, performing circadian medicine research with clinical routine data requires accurate and consistent (across institutions and systems) timestamps for sample collection, analysis, and result reporting. The same is true for integrating circadian assessments into healthcare processes and for implementing a range of other clinical mechanisms, such as alert systems that are based on lab measurements. As with other novel types of information with clinical relevance, integrating circadian data into electronic health record systems and aggregating information relevant to clinical workflows can be challenging. As a consequence, flexible data standards are needed that can be used to integrate new data types while keeping up with medical developments and automated decision support systems must be developed and validated to provide only information that is relevant at the point of care. In the opposite direction, electronic health record systems already store a wide range of data that is relevant to circadian medicine research and pragmatic interfaces and pay-as-you-go approaches are needed to make it available in this context. Furthermore, data collected for circadian medicine is highly sensitive medical data collected not only in the clinical setting but also in other environments using various sensors. Consequently, adequate information security needs to be maintained for example in terms of establishing secure communication channels, and data protection challenges need to be resolved to ensure individuals' privacy while preserving linkability requirements.
We point out that, to the best of our knowledge, chronobiological analyses are currently not used in routine healthcare. Even in the research context, due to the reasons outlined above, there are currently significant limitations to the application of rhythm analysis to humans and only a few papers describe analysis approaches that are closely associated with translational applications. We conclude that circadian medicine, as an emerging field with collaborations between researchers and clinicians from different disciplines, provides an ideal environment for developing innovative methods for solving challenges associated with the collection, integration, visualization, and analysis of multi-modal biomedical data.