Extended approach to sum of absolute differences method for improved identification of periods in biomedical time series

Time series are a common data type in biomedical applications. Examples include heart rate, power output, and ECG. One of the typical analysis methods is to determine longest period a subject spent over a given heart rate threshold. While it might seem simple to find and measure such periods, biomedical data are often subject to significant noise and physiological artifacts. As a result, simple threshold calculations might not provide correct or expected results. A common way to improve such calculations is to use moving average filter. Length of the window is often determined using sum of absolute differences for various windows sizes. However, for real life biomedical data such approach might lead to extremely long windows that undesirably remove physiological information from the data. In this paper, we:• propose a new approach to finding windows length using zero-points of third gradient (jerk) of Sum of Absolute Differences method;• demonstrate how these points can be used to determine periods and area over a given threshold with and without uncertainty.We demonstrate validity of this approach on the PAMAP2 Physical Activity Monitoring Data Set, an open dataset from the UCI Machine Learning Repository, as well as on the PhysioNet Simultaneous Physiological Measurements dataset. It shows that first zero-point usually falls at around 8 and 5 second window length respectively, while second zero-point usually falls between 16 and 24 and 8–16 s respectively. The value for the first zero-point can remove simple measurement errors when data are recorded once every few seconds. The value for the second zero-point corresponds well with what is known about physiological response of heart to changing load.


Method details
Signals from sensors usually contain some amount of noise that can have a negative impact on further data analysis. Usually, such signals are pre-processed with a form of a low-pass filter to remove unwanted elements. One of the typical and simplest low-pass filters is a moving average filter. In this filter a window of a certain length is continuously applied through the signal and value for the current time step is substituted for average of current and adjacent steps. The length of the window is decided either using domain's rule of thumb or by using Sum of Absolute Differences (SAD) approach. However, SAD approach might lead to too large window length for some biomedical signals, e.g. heart rate.
In the regular SAD approach, a point is identified at which the SAD curve flattens. This point is then selected as the window length. We propose an extension to the SAD approach with use of third gradient, also called jerk, of the SAD curve. Zero-points of the third gradient become suggested window lengths. First point would be considered a conservative value, which preserves most information, but might also not remove all the noise. While second zero-point would be considered a liberal value, which guarantees that most noise is removed while still preserving most important information.

General approach to sum of absolute differences
In SAD approach, a set of filtered versions of the signal is calculated for varying window lengths. Then each of filtered signals is subtracted from the original one, point by point. Absolute values of each point difference are then summed. This results in set of sums, each for different window length. These sums when plotted form a curve that usually flattens from a certain window length, suggesting that there is no further filtering effect for longer windows. This window length is then selected as the optimal for filtering the signal for further analysis.
However, SAD approach might lead to too large window length for some biomedical signals, e.g. heart rate. As we demonstrate in Figs. 1-5 using PAMAP2 [5] dataset and in Figs. 12-14 using "Simultaneous physiological measurements with five devices at different cognitive and physical loads" dataset from PhysioNet database [11 , 12] the resulting window length would become 210-230 or 490-590 s respectively for each dataset using the established approach. It is not only intuitively too long, but also would mask physiological response to a change in physical exertion that becomes visible in less than a minute [9] . Beyond that point a change can still be observed, but it is then related to cardiac drift when maintaining a level of increased exertion [4] . In such a case, selecting window length based on the SAD approach would result in significant loss of information in the available data.
The loss of information is clearly visible in the examples. In Fig. 1 a significant increase in heart rate at point P1 from the original dataset completely disappears after applying Moving Average (MA) filter with window length 230 s, obtained using regular SAD approach. A drop in heart rate value at Heart rate, sum of absolute differences, and heart rate with moving average filter with window length based on usual SAD approach, for Person 1.

Fig. 2.
Heart rate, sum of absolute differences, and heart rate with moving average filter with window length based on usual SAD approach, for Person 2. point P2 is represented, but its lowest value is around 25 bpm higher than in the original dataset. Such difference is physiologically significant.
In Fig. 2 we focus on two significant spikes in heart rate at points P1 and P2. Their value is around 40 bpm higher than the surrounding values. However, after applying MA filter with window length 230 s, the curve tells a completely different story. Points P1 and P2 are part of a consistent heart rate increase. As a result, filtering might lead to a complete change in the interpretation of these data.
In Figs. 3-5 we further observe how the wrong choice of window length alters characteristic points of the signal, to an extent that these points blend with the remaining signal. It happens both for spikes and drops. For point P1 in each of these three figures we notice that a strong spike in heart rate is greatly reduced. While still visible, it hardly stands out of the general trend. For point P2 in each of these three figures a significant drop is reduced to a minor one. Reduction in relative value, to the spikes that often surround such drops, ranges between 50 and 30 bpm.
While it might seem that a general shape of the heart rate curve is preserved after applying MA filter with a window length based on regular SAD approach, we observe that many significant details are lost. This observation is consistent with typical physiological response of heart rate that should be visible in less than a minute [4] . What suggests, therefore, that a window length for a MA filter should not be longer. Heart rate, sum of absolute differences, and heart rate with moving average filter with window length based on usual SAD approach, for Person 3.

Fig. 4.
Heart rate, sum of absolute differences, and heart rate with moving average filter with window length based on usual SAD approach, for Person 4.

Zero-Points of third gradient (Jerk)
If we consider continuous function g(x) , its derivative can be defined as its rate of change with respect to variable x . If function g(x) is a function of one variable, its derivative is the same as partial derivative, that can be calculated as a dot product of the gradient of the function g(x) and unit vector u [1] . In case of uniformly distributed time series with step equal to 1, we can assume that the unit vector u is equal to 1. It leads us to conclusion that in case of discrete function g(i) with step h = 1 , its derivative is equal to the gradient of this function.
The first gradient of the discrete function g(i) is calculated as presented in Eg. (1) [2] : The calculated gradient may be burdened with an error defined by the limiting behavior of a step function h , denoted by O(h 2 ) . In case of regular sampling of the discrete time series g(i) with h = 1 the gradient of g(i) can be calculated as shown in Eq. (2) : We can detect finer changes in the behavior of SAD curve using zero-points of the third gradient of the curve. The third gradient of a discrete function g(i) can be considered as the third derivative of g(i) , should g(i) be a continuous function. It can be understood more intuitively by comparison to the third derivative of position in physics, which is known as jerk, the time derivative of acceleration  SAD_curve.append((n, sum(abs(TS-moving_avg(TS,n)))) 4 3_gradient : = gradient(gradient(gradient(SAD_curve))) 5 ZP : = where_diff(sign(3_gradient))[:3] [3] . For example, the effect of acceleration is the feeling of being pressed into the seat of a car. If the acceleration is not constant, we can talk about jerk that is felt as an increasing or decreasing force on the body.
Calculation of zero crossing point of the third gradient of function g(i) can be expressed by Eq. (3) : If the subsequent zero-points are located less than 3 time-steps apart, then we suggest combining them with the resulting value calculated as a ceiling of an arithmetic mean of these zero-points ( Eq. (4) ).
In Algorithm 1 we present an outline of a simple algorithm to calculate first three zero-points of third gradient of SAD curve for a given signal TS . First, in addition to the input signal, we specify N as a maximum length of SAD curve. In line 1 we create an empty list that will later hold values of sum of absolute differences for different window length of a moving average filter. These values are calculated in lines 2 and 3.
In line 4 we calculate third gradient of the SAD curve. Definition of gradient function is provided earlier in this section and implementation of such function is available in major data analytics tools, e.g. in NumPy. Finally, in line 5, we extract only the sign of third gradient values and check for change in this sign. We select first three such places in the curve. The algorithm can be further extended by combining zero-points located closely to each other.

Selecting moving average window size and calculating typical measures
Later, in method validation, we demonstrate that the first zero-point of jerk corresponds to 5-8 s long window for moving average filter. It is therefore safe to assume that any signal variation in that range would be related to simple measurement errors when data are recorded once every few seconds. Second zero-point of jerk corresponds to 16-24 s window for moving average. In that range physiological effect should already be visible [9] .
As a result, there are at least two values that could be applied in further processing. We suggest two possible approaches to how these values can be used. First approach uses one of these two zero-points of jerk as the window size for a moving average filter and any further calculations are performed in a standard way on the filtered signal. Second approach is approximate and combines the two values together with an uncertainty factor inspired by the interval type-2 fuzzy set footprint of uncertainty [10] .
Two of the most typical cumulative measures used to evaluate the condition of a person based on the heart rate trace are: (1) longest period above threshold value of HR, and (2) area under the plot for HR greater than the threshold value. We present how these measures are formally calculated, before using them for validation.
In order to calculate the longest period above the threshold value thr of HR we define S to be the set of periods s m for the time series HR , where each s m is a series of subsequent values from HR having values HR ( i ) ≥ thr ( Eq. (5) ). Total number of periods s m in given HR series is assumed to be equal to M .
Single element s m of set S has length l m that is defined by Eq. (6.) : We define L as the set of lengths l m . The longest period above the threshold value is denoted as l max and is the maximum value of a set L ( Eq. (7) ): Energy is considered as the area A under the plot above the threshold value and is calculated as shown in Eq. (8) : Number of elements in the HR sequence is defined as I and n is the number of samples in sequence HR greater or equal to thr value. Area under the plot can be expressed in beats: /60. In case of the area under the plot and above the threshold value, it is equal to the number of total beats over the threshold.
In Fig. 6 we present a typical plot of type-2 fuzzy set membership function with footprint of uncertainty applied to two zero-points of third gradient of SAD curve.
We define length adjusted by the uncertainty factor as given in Eq. (9) : where: l 1 -length of the longest segment above the threshold for sampling time equal to 1st zero-point of jerk l 2 -length of the longest segment above the threshold for sampling time equal to 2nd zero-point of jerk We define uncertainty factor in Eqs. (10) and 11 : where: a = 1 b 1 -distance between 0 and point P 1 b 2 -distance between points P 1 and P 2 P 1 and P 2 -window sizes corresponding to 1st and 2nd zero-point of jerk For example, for P 1 = 5, P 2 = 22, l 1 = 950 , and l 2 = 1298; we have b 1 = 5 and b 2 = 17 and consequently: We define area adjusted by the uncertainty factor in Eq. (12) : where:

Method validation
We demonstrate the validity of the proposed method in three different ways. First, we show that most important characteristics of the heart rate curve are preserved, when using first or second zeropoint of third gradient of SAD as the window length for a MA filter. We also make a connection between values of the zero-points and physiological phenomena related to heart rate development. Subsequently, we demonstrate improved results in calculating longest period over a given threshold value, based on the filtered data. Finally, we discuss an impact of MA filtering on integral of heart rate over a given threshold value. It is another common statistic, that might be impacted in a non-obvious way by filtering.

Part 1 of validation -shape of HR curve
We observed earlier, in Figs. 1-5 , that applying MA filter with window length obtained using regular SAD approach leads to loss of significant details in heart rate signal. Moreover, such window length appears also to conceal physiological phenomena.
We applied the modified SAD approach using zero-points of third gradient, as defined in the previous section, to the same PAMAP2 data. First three zero-points were calculated. For the first zeropoint values ranged from 5 to 8 s for different persons. For the second zero-point values ranged from 16 to 24 s. For the third zero-point values ranged from 21 to 32 s.
It is interesting to notice that all these zero-points lie with the physiological response range for the heart rate [4] . First zero-point is at the very beginning of that range, so it can ensure that all physiological phenomena will still be visible in the filtered data. Window with the length corresponding to the second zero-point offers more smoothing, since it is almost triple the value, but it still fits well within the physiological range. Using the third zero-point might still be useful, but in some cases the window might be considered too long. Especially, when it exceeds 30 s.
Based on these observations, first and second zero-points seem to be of most interest. With first zero-point being a safe choice. While the second zero-point is potentially optimal, balancing smoothing effect with detail preservation.
In Figs. 7-11 we apply the modified SAD approach to the same persons from PAMAP2 dataset as in Figs. 1-5 . Data filtered with MA filter with window length selected using standard SAD approach are presented in the left part of each figure. First three zero-points of third gradient of SAD curve are calculated and presented in the middle part of each figure. Finally, the filtered signal using second zero-point is presented in the right part of each figure.
We can notice, that in each case the shape of the heart rate curve is well preserved. All characteristic points that were misrepresented using standard SAD approach, are represented correctly with the modified SAD approach.   9. Heart rate, sum of absolute differences with third gradient, and heart rate with moving average filter with window length based on second zero-point of third gradient, for Person 3 from PAMAP2 dataset.
In Fig. 7 a significant increase in heart rate at point P1 from the original dataset is well represented applying Moving Average (MA) filter with window length of 8 s, obtained using modified SAD approach. A drop in heart rate value at point P2 is represented with little difference in the absolute value.
In Fig. 8 we focus on two significant spikes in heart rate at points P1 and P2. Their value is around 40 bpm higher than the surrounding values. This characteristic is preserved in the signal filtered with window length of 25 s, corresponding to the second zero-point of third gradient of the SAD curve. So, the interpretation of these data is preserved after the filtering.
In Figs. 9-11 we further observe how the window length corresponding to the second zero-point preserves characteristics of the signal. Both spikes and drops maintain the right shape. For point P1 in each of these three figures we notice that a strong spike in heart rate is preserved. For point P2 in each of these three figures significant drops are reduced only by a few bits per minute.
The proposed approach was further validated using PhysioNet dataset "Simultaneous physiological measurements with five devices at different cognitive and physical loads". Results for three example subjects are presented in Figs. 12-14 . In this case, window length calculated using standard SAD approach was even longer, ranging from 490 to 590 s. We can observe very similar loss of detail as in the case of PAMAP2 dataset. Many potentially useful heart rate fluctuations are lost, while only general trend of heart rate is preserved. Fig. 10. Heart rate, sum of absolute differences with third gradient, and heart rate with moving average filter with window length based on second zero-point of third gradient, for Person 4 from PAMAP2 dataset. Fig. 11. Heart rate, sum of absolute differences with third gradient, and heart rate with moving average filter with window length based on second zero-point of third gradient, for Person 5 from PAMAP2 dataset.
Window length based on the 2nd zero point of 3rd gradient of SAD curve ranged from 8 to 16 s. The resulting heart rate signal preserved all the major variations of the original, while removing the noise.
Based on the shape of the heart rate curve, we conclude that using modified SAD approach as proposed in the previous section, preserves characteristic points including shape, absolute value, and relative values.

Part 2 of validation -longest period
Heart-rate related measures are one of the most commonly used markers in monitoring performance, fitness and fatigue of a person during physical effort [6] . One of the methods of evaluating person's fitness and performance, as well as planning physical training, is based on heart rate zones [7] . Heart rate zones are determined using predefined percentage of maximum heart rate of a person. In automatic analysis of time spend during extended physical effort in certain heart rate zone it is crucial to obtain the whole period of HR above given threshold. For unfiltered HR signal such periods may be erroneously divided into shorter segments due to HR sensor inaccuracy or a single sample below the selected threshold. Application of the proposed method allows to minimize or eliminate such errors.
For an average middle age person zone 2 of the heart rate, corresponding to the middle exercise intensity, is a typical aerobic target. The bottom value of heart rate in zone 2 is equal to the 60% of a maximum heart rated. We approximate this value to be equal to 105 bpm for average young to Heart rate, heart rate with moving average filter with window length based on usual SAD approach, sum of absolute differences with third gradient, and heart rate with moving average filter with window length based on second zero-point of third gradient, for Person 1 from simultaneous physiological measurements dataset. Fig. 13. Heart rate, heart rate with moving average filter with window length based on usual SAD approach, sum of absolute differences with third gradient, and heart rate with moving average filter with window length based on second zero-point of third gradient, for Person 2 from simultaneous physiological measurements dataset.      threshold value for the second zero-point of the third gradient. This increase is equal to 90%, 36% and 35% respectively. Including the uncertainty factor UF defined in the previous section, this increase was lower, equal respectively to 8%, 1.7% and 1.8% for persons 1, 2 and 5, as showed in Table 1 .
In case of person 1 we can observe a drop in the heart rate around 2600s., however the heart rate drops slightly below the threshold, by about 1-2 bpm. The behavior of the HR curve suggests that this decrease is insignificant and should not influence the period of heart rate above the threshold of 105 bpm. In case of the original signal and signal filtered using window length corresponding to the first zero-point of the third gradient, this decrease in the heart rate is reducing the period of the HR above the threshold by almost half, from about 43 min to 22 min, what can be crucial during training of a middle aged person.
Cases for person 2 and person 5 are similar, in terms of a slight and short decrease of heart rate, to person 1. In both cases the HR drop below the threshold is by 1-2 bpm and should not be considered in the determination of the value of the longest period above 105 bpm. The increase of this period for persons 2 and 5 for filtered signal with window length equal to the second zero-point is not as large as in case of person one, but still significant, equal to around 5 min. For persons 3 and 4 the proposed method did not influence the length of the longest period of heart rate above the threshold. Both HR curves show moments of decreased activity, lasting for few seconds, before the considered period in case of person 3, and after the considered period in case of person 4.
It is clearly visible that the longest period above the threshold value calculated for window length equal to 230 is significantly longer in case of persons 1, 2, 3 and 5. For person 1 this value is very similar to the result calculated for the window length corresponding to the second zero-point of the third gradient. However, in case of persons 2 and 5 these values cannot be taken into consideration in analysis of the heart rate in terms of person's performance and fatigue. As we can observe on the first plot of Figs. 16 and 19 , there are segments of reduced HR: for person 2 this significant decrease of heart rate is present in segment between 20 0 0s. and 280 0s, while in case of person 5 the period of rest occurs between 2600s. and 2900s. In Table 1 we can see that the increase of the period length for person 2 is by 15 min in comparison to the result for the window length based on the second zeropoint and by 20 min in comparison to the value obtained for the original signal. In case of person 5 these differenced are even more significant, equal respectively to 21 min and 27 min. Differences at this level of values make filtering of the signal with window length of 230 unreliable.
The method was also validated using Physionet dataset "Simultaneous physiological measurements with five devices at different cognitive and physical loads". The threshold was set to 95 bpm to evaluate the heart rate signal recorded during the period of five-minute walking on the treadmill at a moderate speed. As it can be observed in Figs. 20-22 , the signal is very irregular. In case of the original signals the elevated pulse was evaluated to last for less than 3 min in all three cases. For the signals processed using filters based on the second zero-point of the third gradient the resulting longest period increased by 38%, 70% and 53% for persons 1, 2 and 3 respectively. Including the uncertainty factor UF defined in the previous section, this increase was lower, equal respectively to 6%, 68% and 17%, as showed in Table 2 .
The longest period above the threshold value calculated for window length equal to 540 for the three analyzed signals is presented in Fig. 23 . It can be observed that the signal is significantly smoothed, and its level decreased, what resulted in identification of the region of interest to the incorrect signal part, namely the region when the persons performed uphill walking on the treadmill. Therefore, consideration of the longest period above the threshold, as well as the area under the HR curve and above the threshold in case of the window of such large length is pointless.

Part 3 of validation -area under the curve
One of the parameters recommended in analysis of serial measurements in medical research is area under the curve. It can be interpreted as a kind of weighted average of the analyzed signal and can be used for finding and evaluating patterns among groups of monitored patients [8] .The area between the HR curve and the threshold line was calculated for the original signal, signal with Moving Average filter with window length based on first and second zero-point of third gradient, and length equal to 230. Obtained values are presented in Table 3 . We can observe that for the original signal and signals with applied Moving Average filter with window length based on first zero-point of third gradient, are almost the same, with the decreasing tendency for increasing the window length. The reason for     Table 4 .
Obtained results show that proposed method of filtering does not affect significantly the area under the HR curve and this parameter can be used in analysis of the time series. On the other hand, we can see that filtering affects the amplitude of the signal and applying Moving Average filter with  long window is not recommended as it significantly affects the area under the curve. This problem might manifest often when measuring heart rate under a load lasting more than several minutes. Due to aforementioned cardiac drift, average heart rate will increase with time even without changes to the exercise load. It will result in the time series of heart rate becoming non-stationary. What will lead to distortion of measurements if a Moving Average filter with long window is applied.
The area under the HR plot and above the threshold value, as well as the total area under the HR plot, were evaluated also for the PhysioNet Simultaneous Physiological Measurements dataset and are presented in Table 5 and Table 6 . It is clearly visible that for all the analyzed signals there is an increase of the area under the HR plot and above the threshold value as the length of the longest period above the threshold increases as a result of applying the filter based on the first and second zero-point of the third gradient.
The values of the total area under the HR plot have slightly decreasing tendency as the size of the window increases, however this decrease is not very significant for the filters based on the first and second zero-point of the third gradient. It is much greater in case of the window size of 540 and is equal to 20-30%, as it can be seen in Table 6 .

Concluding remarks
Sum of absolute differences is an established approach to selecting window length for a moving average filter. The underlying properties of a process generating the data might, however, render results of such approach detrimental to further analysis for some signals. Biomedical signals, such as heart rate, are a good example, since underlying physiology is often well understood. As we demonstrated regular SAD approach results in long windows that undesirably remove physiological information from the data.
The method outlined in this paper provides a formalized alternative to the regular sum of absolute difference approach. It extends it with calculation of zero-points of third gradient, sometimes called jerk, to detect finer variations in the shape of SAD curve.
We demonstrated validity of this approach on the PAMAP2 Physical Activity Monitoring Data Set, an open dataset from the UCI Machine Learning Repository, as well as on the PysioNet Simultaneous Physiological Measurements dataset. It showed that first zero-point usually falls at around 8 and 5 second window length respectively, while second zero-point usually falls between 16-24 and 8-16 s respectively. The value for the first zero-point can remove simple measurement errors when data are recorded once every few seconds. The value for the second zero-point corresponds well with what is known about physiological response of heart to changing load [9] .