Are ultra-short heart rate variability features good surrogates of short-term ones? State-of-the-art review and recommendations

Ultra-short heart rate variability (HRV) analysis refers to the study of HRV features in excerpts of length <5 min. Ultra-short HRV is widely growing in many healthcare applications for monitoring individual's health and well-being status, especially in combination with wearable sensors, mobile phones, and smart-watches. Long-term (nominally 24 h) and short-term (nominally 5 min) HRV features have been widely investigated, physiologically justified and clear guidelines for analysing HRV in 5 min or 24 h are available. Conversely, the reliability of ultra-short HRV features remains unclear and many investigations have adopted ultra-short HRV analysis without questioning its validity. This is partially due to the lack of accepted algorithms guiding investigators to systematically assess ultra-short HRV reliability. This Letter critically reviewed the existing literature, aiming to identify the most suitable algorithms, and harmonise them to suggest a standard protocol that scholars may use as a reference in future studies. The results of the literature review were surprising, because, among the 29 reviewed papers, only one paper used a rigorous method, whereas the others employed methods that were partially or completely unreliable due to the incorrect use of statistical tests. This Letter provides recommendations on how to assess ultra-short HRV features reliably and proposes an inclusive algorithm that summarises the state-of-the-art knowledge in this area.


Introduction:
The dynamic modulation of heart rate (HR) is controlled by the several voluntary and involuntary mechanisms, including respiration, thermoregulation and the interaction of the sympathetic (which has a response time in the order of a few seconds) and parasympathetic activities (which works much faster: response time 0.2-0.6 s) [1]. Those modulations result in HR fluctuation or variability in time. Whereas the measure of HR is a static index of autonomic input to the sinus node, which does not provide direct information on sympathetic or parasympathetic functions, HR variability (HRV) analysis provides a quantitative assessment of cardiac autonomic regulation [2].
According to [3], HRV refers to the time series of the interval variation between consecutive heart beats and it can be analysed in time, frequency and non-linear domains [3,4]. Common HRV features extracted from HRV excerpts are reported in Table 1. HRV analysis can be performed on 24 h nominal recordings (referred as long-term HRV analysis), 5 min recordings (referred as short-term HRV analysis) or shorter recordings [3], which in this review is referred as ultra-short term HRV analysis. Since clear guidelines on ultra-short HRV analysis are not available yet, this review aimed to explore to what extent ultra-short HRV features can be used to estimate short-term ones, which are still to be considered as a benchmark for HRV analysis. In medicine, and particularly in clinical trial design, in order to cope with this kind of problem, the concept of a surrogate endpoint (or marker) was introduced [5,6]. A surrogate measure is a marker, which is used to estimate a real clinical endpoint, when this is undesired (e.g. death) or when it cannot be directly observed or measured. Several regulatory bodies (e.g. FDA and NICE) have started to accept evidence from clinical trials that show a direct clinical benefit in using surrogate markers. Proving whether a marker is a good surrogate of a real clinical outcome can be quite difficult, and a combination of appropriate statistical and correlation tests is required. Although a rich literature has been produced to answer this question, still some authors demonstrated to be confused and the sentence 'a correlate does not make a surrogate', first used by Fleming et al. [5], became a mantra in this field. In fact, there is a common misconception that if a marker correlates with the true clinical outcome, it can be a valid surrogate endpoint, replacing the true clinical outcome. However, a much stronger condition than correlation is required to be sure that a surrogate is valid and can be used to replace a real clinical outcome. Another common misconception is that a marker X can be considered a good surrogate of a clinical outcome Y if statistical null-hypothesis tests demonstrate no-significant differences between X and Y. This is a major misconception because statistical differences may reveal themselves only in particular conditions (e.g. when a sufficient number of measures are observed). In addition, both correlation and statistical tests are often used improperly (e.g. parametric tests used for nonnormally distributed features).
From the theoretical point of view, it should be well-known that some HRV features lose significance if computed in ultrashort term [3]. For instance, it is recommended that spectral analyses are performed on stationary recordings lasting at least 10 times more than the slower significant signal oscillation period. In the case of short-term HRV analysis, the slower significant oscillations in the so-called low-frequency (LF) power spectrum bandwidth have a period of 25 s (i.e. frequency of 0.04 Hz). Thus, in order to measure the entire LF power spectrum of HRV excerpts (i.e. including the slower components) at least 250 s of HRV signals are required. In the same manner, in order to compute the high-frequency (HF) power, at least 1 min is required [3]. Therefore, LF and HF power spectra computed in excerpts shorter than 1 min lead to erroneous results. As far as non-linear HRV features, less has been explored in the existing literature. Moreover, approximate entropy (ApEn) has shown to be unreliable in excerpts lasting <3 min [7].
The demands of ultra-short term HRV analysis for monitoring individual's health and well-being status in real life is significantly increasing, especially in relation to wearable sensors or mobile applications. Out of the lab, in fact, the conventional 5 min HRV recordings might be unsuitable, due to the real-time requirements.
In fact, ultra-short recording may allow continuous and quasi-real-time monitoring of an individual's well-being status (i.e. mood, attention, and stress levels) [8]. Many apps and wearable devices are being released into the market, claiming to do HRV analysis in real time (from 10 s to 1 min). Although there is a clear need for such technologies, unfortunately, two problems remain unsolved: (i) there are not yet clear guidelines on how to analyse HRV in the ultra-short term and which ultra-short HRV features can be considered as good surrogates of short-term ones; (ii) there is no clear algorithm to identify reliable ultra-short HRV features for the detection of an event.
In analogy with evidence-based medicine, this Letter provides a critical review of the state-of-the-art methods used to assess ultrashort HRV validity, providing key recommendations on how to assess ultra-short HRV features that are good surrogates of shortterm ones. As described by Grant et al. [9], there are different typologies of literature reviews. According to our previous experiences [8,[10][11][12], the typology of the review is strongly depended on the heterogeneity and the quality of the available published literature, which in this case did not allow a more mathematical pooling (e.g. a meta-analysis). Nonetheless, several authors gave effective methodological contributions, although in a fragmented manner. Consequently, this Letter aimed to harmonise these contributions in a comprehensive algorithm that can be useful to guide scholars in future studies.

Methods and materials:
Relevant studies on the use of ultrashort HRV analysis were first identified and selected by searching on PubMed and OvidSP databases. Articles were searched using Boolean combinations of the following keywords or their equivalent medical subject heading terms: heart rate variability, HRV, ultrashort, and very short. Title, abstract, and full text were chosen as fields of the search. However, due to the lack of guidelines on how to analyse HRV in the ultra-short term, the nomenclature used in many scientific papers was very heterogeneous, if not misleading. For instance, many studies performing HRV analysis on segments <5 min did not use the tag 'ultra-short term' or did not mention the length of HRV excerpts analysed (i.e. ultra-short, short-or long-term analysis) in the study. Therefore, a linear search of references of retrieved articles was required and performed.
The heterogeneity and quality of available literature led us to conduct a state-of-art review to address the current concerns and offer a comprehensive perspective on the issue [9].
To limit the linear search, the following criteria were utilised: papers published in the last 15 years (since 2003), focusing on healthy and non-pregnant adult humans. Shortlisted papers were considered suitable for this review if they met the following criteria: † the subjects were human beings over 18 years old; † HRV was analysed on excerpts <5 min; † HRV features were extracted with reliable methods and reported with sufficient statistical quality [3].
3. Results and discussion: Since 2003, 29 papers  were identified as shown in Fig. 1. An overview of the methods employed in the shortlisted 29 papers to assess the validity of ultra-short HRV features is synthetically reported in Fig. 1, whereas the characteristics of the reviewed studies are reported in Table 2.
In fact, 11 identified studies [15,20,21,25,29,31,32,35,36,38,41], including the five mentioned in the previous sentence [20,25,35,38,41], performed only a partial assessment either using only statistical significance or performing only correlation tests. In fact, three of 11 studies [20,32,41] employed statistical significance tests to prove that there were no statistically significant changes in HRV features in short versus ultra-short term, assuming short-term HRV analysis (i.e. 5 min) as a benchmark. They concluded that ultra-short HRV features were good surrogates of short-term ones if no-significant differences were observed, using a significance threshold >0.05 ( p > 0.05).
Unfortunately, this result is arguable because, although a p-value <0.05 is conventionally used to support the hypothesis that two distributions are significantly different, it is well-known that no conclusions can be drawn for p-value >0.05, as detailed in [42]. For instance, two distributions could result in a p-value >0.05 because of their cardinalities. In particular, one of those three studies [20] also assessed ultra-short term HRV features in two conditions (i.e. rest and stress) using a non-parametric test ( p < 0.05) to find the shortest duration needed to distinguish the two conditions. Nevertheless, also, in this case, the results are arguable as the study [20] explored only those HRV features judged as good surrogates if no statistically significant changes in short versus ultra-short term were observed using a p-value >0.05. Furthermore, one study [21] used one-way analysis of variance (ANOVA) to determine which HRV features (i.e. those computed at 220, 150, 100 or 50 s) could discriminate between rest and stress sessions with p < 0.05. However, due to the nature of HRV features, which are non-normally distributed (especially in the frequency domain), a non-parametric test should have been used instead, or HRV features should have been log-transformed before using the ANOVA test.
On the other side, seven studies [15,25,29,31,35,36,38] employed only correlation tests to prove that ultra-short term HRV features behaved as short-term ones; in fact, they concluded that ultra-short HRV features were good surrogates of short-term ones if significantly correlated with their equivalent short HRV features. As anticipated in the introduction, this result is arguable because as stated by Fleming et al. [5], 'a correlate does not make a surrogate', although an appropriate correlation test is the first step for the identification of a good surrogate.
Only two studies [30,37] performed both statistical significance test and correlation analysis. Unfortunately, also in these two studies, the statistical significance analysis consisted of only observing if the p-value was >0.05, which is not a suitable method for the reasons discussed above.
Employing invalid statistical significance analysis led to unreliable results, especially regarding frequency HRV features. In fact, Baek et al. [37] and Salahuddin et al. [41] computed very low frequency (VLF) in 270 and 50 s although, as reporting also in [3], VLF is only reliable in long-term HRV analysis. De Rivecourt et al. [25] and Salahuddin et al. [41] employed only correlation analysis and an inaccurate statistical significance test (i.e., p > 0.05), reported that LF and HF are reliable in segments lower than 30 s, whilst at least 250 and 60 s are necessary for LF and HF, respectively [3].
Finally, only one study [33] investigated in a more rigorous way the validity of ultra-short HRV features. In fact, Munoz et al. [33] compared 10, 30, and 120 s HRV features with 5 min ones, using Pearson's correlation test (after having normalised HRV features with log-transformation), Bland-Altman plots and Cohen's d statistical test. Unfortunately, Munoz et al. reported the results on only two time domain HRV features under one condition (i.e. resting) and it was not clear if other features were computed but not reported or not computed at all. In the first case, a correction to the p-value should be employed too [43,44].
Hence, among the 29 identified papers, one paper justified the adoption of ultra-short HRV features with a rigorous method but reporting only on two time domain HRV features. Conversely, seven papers did not provide any justification, eight papers based their choice on unreliable articles, 11 papers performed only a partial assessment (i.e. either statistical significance or correlation tests) and two papers performed a complete assessment (both  statistical significance and correlation tests) but using statistical significance tests improperly. Overall, none of the 29 studies has proposed a valid method to identify reliable subsets of ultra-short HRV features or surrogates of the short-term HRV features to allow the detection of the event of interest (i.e. two different conditions). Therefore, future studies in this area are required. Independent of the methods used (e.g. statistical test and correlation, only statistical or correlation analysis) and their rigor (e.g. the parametric test used for non-normally distributed features, p > 0.05), the reviewed studies presented other methodological ambiguities. Twenty studies investigating ultra-short HRV analysis in two conditions (e.g. rest versus stress), compared ultra-short HRV features inter-group (e.g. 'HRV features at 1 min during rest versus stress' compared with 'HRV features at 5 min during rest versus stress') without performing intra-group (e.g. 'HRV 1 min at rest' versus 'HRV 5 min at rest') comparisons. In fact, inter-group (i.e. comparing HRV features between two conditions among different lengths) and intra-group comparisons (i.e. comparing coherence of HRV features at different lengths in the same condition) should be performed using the proper statistical tests and correlation analyses. This is fundamental in order to judge the inner validity of the technique.
Overall, the reviewed literature highlighted that some valuable methodologies are available and already in use, but in a very fragmented way, resulting in improper or inaccurate practices. This body of evidence can be summarised and standardised in an algorithm, as represented in Fig. 2.
The algorithm represented in Fig. 2 highlights that authors cannot just use statistical or correlation tests to explore whether ultra-short HRV features can be considered good surrogates of short-term ones. Before performing statistical tests, authors should consider if features are significantly correlated at different time scales. The significant correlation suggests that there is a significant association. Nonetheless, this association could be biased. The Bland-Altman estimates this bias and how it diverges with the increase of the short-term feature's magnitude (i.e. benchmark). According to this test, two features are considered not biased, if the dispersion of their mean difference remains within a conventional threshold [i.e. 95% line of agreement (LoA)] [45]. Once a correlation has been proven and bias excluded, the statistical significance can be explored. Munoz et al. [33] proposed the use of the Cohen's d statistics to quantify the agreement of HRV features at different time scales relative to their within-group variation [46]. Therefore, according to the proposed algorithm, a feature can be considered a good surrogate if correlated, non-biased and significantly in agreement among them.
The algorithm reported in Fig. 2 can be further articulated in the case in which the ultra-short HRV features are non-normally distributed (Fig. 3). As far as correlation tests, there are several nonparametric tests, which have been proposed. Alternatively, HRV features can be log-transformed before using a parametric test. The Bland-Altman test is parametric too, as it calculates the 95% LoA around the mean. In the case of non-normally distributed features, authors should use the same test, but investigate the dispersion around the median, and not the mean, when computing the 95% LoA. Finally, Cohen's d statistics assumes the normal distribution of input features, therefore it is strongly recommended to apply a log-transformation to HRV features before applying this test. Alternatively, Cliff's delta statistics should be used for non-normally distributed data as it is a non-parametric effect size measure that quantifies the amount of difference between two groups of observations beyond p-values interpretation [47].
In case two different conditions are explored (e.g. stress versus rest), both Figs. 2 and 3 require a further adjustment. In fact, in analogy to the best available medical practice [48], scholars should follow the algorithm proposed in Fig. 4, proving that: † ultra-short HRV features behave as short-term ones in the same conditions (i.e. at rest or during stress), intra-group assessment; † ultra-short HRV features maintain different behaviours in the two conditions at different lengths (i.e. if StdNN diminishes during stress, this change should be observed both at short and ultra-short term) and inter-group assessment.
As the first step, surrogate features have to be correlated with benchmark ones (i.e. short-term HRV) both in a control condition (e.g. rest phase) and during the event to be detected (e.g. stress phase). This can be verified using intra-group correlation analysis at different time lengths, i.e. in the same condition. For instance, StdNN (as well as any other HRV feature) extracted from 5 min excerpts during rest (or stress), has to be significantly correlated with StdNN extracted from any shorter 5 min excerpts during rest (or stress). Fig. 2 Standard algorithm to assess if ultra-short HRV features can be considered good surrogate for short-term ones when investigating one condition (e.g. only at rest). rho: correlation coefficient; p-val: p-value associated with correlation analysis; LoA: line of agreement in Bland-Altman plot As a second step, visual investigation of bias between means (or medians for non-normally distributed features) has to be performed via Bland-Altman plots in each condition.
As the third step, the set of surrogate features has to preserve a large portion of information of the event to be detected (i.e. significance test at each time scale and/or trend analysis). This can be verified using inter-group statistical tests at each time length but in the different conditions. Therefore, scholars should verify, using a non-parametric test (unless HRV features are log transformed or normally distributed), which ultra-short HRV feature maintains statistical evidence that the median significantly differs in the two different conditions ( p < 0.05) across time period windows. Pereira et al. [21] attempted to investigate which HRV features could discriminate between rest and stress using ANOVA for each selected time period window.
The fourth and last step, the trends of the HRV features (i.e. if HRV features decrease or increase during stress) should remain consistent across time lengths. In fact, a HRV feature can be assumed to maintain the same behaviour across different time lengths if the statistical significance test has p-value <0.05 between the control and the experimental conditions at each time scale and if the ultra-short HRV features trend consistently changes between the control and the experimental conditions with the equivalent short HRV feature (e.g., if MeanNN decreases significantly during stress at 5 min [10] this significant trend has to be consistently maintained at shorter time lengths). Once these four steps have been performed, it can be assumed that an ultra-short HRV feature is a good surrogate for the equivalent short one, if: † the ultra-short HRV feature maintained the same behaviour between control and experimental conditions as the benchmark; † the ultra-short HRV feature was highly and significantly correlated (e.g. correlation coefficient greater than a given threshold (e.g. 0.7) and p-value <0.05), with the corresponding short feature in both control and experimental conditions.

Conclusion:
This review demonstrates that there is a clear lack of rigorous methods to assess the validity of ultra-short HRV features in a control situation and to identify reliable ultra-short HRV features. One of the reasons could be the lack of clear algorithm guiding scholars in proving how to identify good surrogates. Therefore, this Letter proposed, in analogy with evidence-based medicine, three algorithms, which scholars may follow to assess whether ultra-short HRV features can be considered good surrogates of short-term ones. Recommendations are given in this regard: which method should be used in each step, when intra-group or inter-group correlation and statistical tests are required, and whether those tests should be parametric or non-parametric.