Beat-to-beat heart rate estimation fusing multimodal video and sensor data

: Coverage and accuracy of unobtrusively measured biosignals are generally relatively low compared to clinical modalities. This can be improved by exploiting redundancies in multiple channels with methods of sensor fusion. In this paper, we demonstrate that two modalities, skin color variation and head motion, can be extracted from the video stream recorded with a webcam. Using a Bayesian approach, these signals are fused with a ballistocardiographic signal obtained from the seat of a chair with a mean absolute beat-to-beat estimation error below 25 milliseconds and an average coverage above 90% compared to an ECG reference.


Introduction
Unobtrusive acquisition of biosignals for health-and wellness applications has experienced increasing popularity in recent years [1][2][3][4]. In particular, monitoring of the heart rate and its variability outside the classical scenarios such as hospitals and sleep laboratories is an active area of research. It offers great medical potential, as the heart rate variability (HRV) has a wide range of applications from work stress analysis [5] to the prediction of sudden cardiac death in chronic heart failure patients [6].
While unobtrusive measurement modalities greatly increase comfort and the number of application scenarios, they are normally easily disturbed by motion artifacts and have a lower signal to noise ratio (SNR) compared to clinical modalities such as the conductive electrocardiogram (ECG) or the finger-attached photoplethysmogram (PPG). Thus, when analyzing unobtrusively acquired measurement data, episodes that contain no valid information can occur and must be excluded from subsequent processing. At the same time, the more data is excluded, the bigger is the chance that important information is missed. Thus, an unobtrusive measurement system is often evaluated in terms of accuracy and coverage, as one can often only be improved at the cost of the other.
To overcome this, the fusion of biosignals obtained from multiple channels and multiple modalities has proven to be a promising approach. The basic idea is that each acquired biosignal originates from a single mutual source [7], see Fig. 1. This source, the human heart, can be modeled to create a virtual signal that triggers several physical responses. With different sensors, these multimodal signals can be acquired, for example, a differential electric potential (ECG), a change in optical property (PPG), and a mechanical impulse (ballistocardiography, BCG). While the signal morphology as well as the relative peak location differs, the beat-tobeat interval is the same in each channel.
In unobtrusive monitoring, video-based methods play an increasingly important role due to the low cost and ubiquitous availability of sensors in the form of, for example, webcams and smartphones [1,2]. Additionally, since no direct contact to the subject is necessary, no concerns in terms of safety and hygiene arise. At the same time, this makes video-based methods very susceptible to motion artifacts. Moreover, some sort of illumination and a direct line of sight is required, which complicates applications like sleep monitoring. Here, other modalities that can easily be integrated into the mattress, such as capacitive electrocardiography (cECG) [4] or BCG [3], have proven to allow the accurate and robust determination of beat-to-beat intervals (BBI) for HRV analysis. Vast portions of daily life in post-industrial countries are spent sitting, often in front of computers. At the same time, diseases of the cardiovascular system as well as negative stress due to excessive workload pose serious threats to the public health. Thus, constant unobtrusive monitoring of seated subjects has great potential and several concepts have been proposed. Most of those, however, are based on single modalities or do not allow HRV analysis with beat-to-beat accuracy. According to the definition introduced above, the video signal obtained from a subjects head is by itself a multimodal biosignal, as it contains the involuntary head motion [8] that originates from the mechanical impulse of the heart as well as the changes in optical property of the skin [9] caused by the superficial perfusion. Noise and artifacts might influence channels by varying degrees: While the BCG and the pulse-related head oscillations are very susceptible to motion artifacts, advanced video processing methods allow the extraction of PPG signals even if the subject is moving [10]. On the other hand, a remote PPG signal is more difficult to obtain from subjects with a higher melanin concentration, i.e. a darker skin. It is reasonable to assume that there is no influence of melanin content on the head motion tracking and there is obviously none on the BCG signal. Stationary, periodic changes in illumination can be compensated in remote PPG sensing [11], while non-stationary lighting conditions are probably harder to deal with in motion estimation and especially in remote PPG sensing. The signal of a mechanically coupled BCG sensor in the seat of a chair is obviously insensitive to the lighting conditions but might be corrupted by motion of the lower extremities. Thus, it can be assumed that even if no strong artifacts are present, exploiting redundancies in multimodal signals can help improve coverage and accuracy in the procession of unobtrusively acquired biosignals with a low SNR.
In this paper, we bridge the gap between video-and contact-based unobtrusive monitoring modalities by multimodal sensor fusion of video and BCG data. We demonstrate that beat-tobeat intervals can be estimated with an average absolute error below 25 ms and coverage above 90 % when compared to an ECG reference. We further show that even the fusion of only the motion-and photoplethysmographic component of the video data greatly improves coverage while maintaining high accuracy.
The paper is structured as follows: In the next section, the measurement setup as well as the algorithmic details are described. Results and discussion are presented in section 3, the paper is concluded in section 4.

Method and materials
The measurement system setup is visualized in Fig. 2 webcam "C260" by Logitech international S.A., Apples, Switzerland, was used. Images were acquired at f s,video = 30 Hz with a resolution of 800 by 600 pixels. Regular environmental light consisting of a mixture of sunlight and light from fluorescent tubes installed in the ceiling was used for illumination.
The ballistocardiographic signal was obtained by placing an "EMFi transducer" mat (L-Series by Emfit Ltd, Vaajakoski, Finnland) on the seat of a chair. A custom built analog amplifier / bandpass with a passband of 0.01 to 200 Hz and gain of 24.61 dB was used for analog signal conditioning. Analog to digital conversion was performed using the "NI USB-6212" (National Instruments, Austin, Texas, USA) data acquisition systems (DAQ) at the sampling frequency f s,DAQ = 1000 Hz.
For reference, a single-channel ECG was recorded using the "IntelliVue MP70" patient monitor (Philips, Amsterdam, The Netherlands). Its analog output was sampled in parallel to the BCG signal and QRS-peaks were extracted using the "Open Source ECG Analysis Software" by EP Limited (http://www.eplimited.com/).
Four healthy volunteers with fair skin were asked to sit perfectly still in front of the camera (Trial 1), to perform a reading-task from the computer screen without motion (Trial 2), and to read from the computer screen without further instructions (Trial 3). In every trial, recordings were performed for two minutes. An additional fifth volunteer with darker skin and extremely low resting heart rate participated once to demonstrate the algorithms wide range of detectable intervals. The average heart rates for all participants and trials as determined by the ECG are listed in Table 1. In the following, an algorithm that performed very successfully in the beat-to-beat analysis of multichannel BCG-data [3] is recapitulated and augmented by an adaptive Gaussian prior. As argued above, cardiac signals acquired through different modalities might exhibit various waveforms while the underlying periodicity is the same in all channels. It is further reasonable to assume that one heart beat will show great similarity to the one before and that the interval between beats can be determined by analyzing each signals self-similarity. Various metrics for the assessment of self-similarity exist. Here, the short-time autocorrelation (STA) function is widely used. Let x(n) be a time-discrete signal and be an analysis window with index i centered around n i (Fig. 3(A)). The index is omitted in the following derivation for better readability. Commonly, the STA for each lag η for a window of constant length L is given by see also Fig. 3(B). This implies that for each candidate lag η, L − η samples are considered. This approach can be used successfully if the average interval length within the window is of interest. If, however, the interval between exactly two consecutive beats is of interest, the lag-adaptive short-time autocorrelation (LASTA) ensures that the exact number of samples necessary for each candidate lag η are considered, see also Fig. 3(C). Another metric to assess self-similarity is the modified average magnitude difference function (AMDF) used in speech processing for pitch extraction [12]. The modified AMDF also uses the lag-adaptive window and is inverted to assume larger values for lags that indicate more self-similarity ( Fig. 3(D)). As a third metric, the maximum amplitude pairs (MAP) function can be considered as indirect peak-detection, Like LASTA and AMDF, the MAP function assumes large values for lags that indicate selfsimilarity, see Fig. 3

(E).
To exploit the different noise characteristics of the three estimators and allow multimodal signal fusion, a Bayesian approach is chosen. This can be achieved by interpreting the estimator results as an a-posteriori probability density function (PDF) in a loose statistical sense. Thus, p(η|S LASTA ), p(η|S AMDF ) and p(η|S MAP ) represent the probability that η is the correct interval length according to the respective estimator. Through linear shifting and scaling of each estimator result, the properties of a proper PDF, i.e., positivity and unit area under the curve, can be achieved. If we assume a uniform a-priori distribution p(η), we get by applying Bayes theorem. Finding the optimal interval η opt is thus reduced to finding the maximum of the multiplication of the scaled estimator outputs, see Fig. 3 Visualization of the different self-similarity measurements and their fusion. In (A), a window centered between two beats of an artificial signal with an interval of η = 200 samples is shown. Since the window contains multiple beats, the STA (B) will be dominated by the two 150-sample intervals surrounding the interval of interest. On the other hand, LASTA (C), AMDF (D) and MAP (E) will show a peak at the correct location. Fusing the three estimators (F) results in an even more distinct peak. All curves are in arbitrary units, the quality metric Q is defined as the ratio of the peak height to the area under the curve. towards a multimodal, N-channel setting [13], where each estimator is calculated for each of the N channels, is straightforward, To determine whether or not an estimated interval is reliable, the quality metric i.e., the ratio of the peak height to the area under the curve, is calculated. If p(η|S Fusion ) does not shows a clear maximum, Q is small. Only intervals with Q > Q th are accepted and thus, the choice of Q th determines the trade-off between coverage and accuracy as introduced above.
Here, a fixed Q th = 0.3 is used unless noted otherwise. The window length L determines the maximum interval, i.e., the minimum heart rate that can be detected. In this approach, it was set to 1500 ms, corresponding to 40 BPM. The shift period of the moving window was 200 ms and a low-pass filter with a passband of 10 Hz was applied to all signals before interval estimation. Moreover, the maximum detectable heart rate was set to 140 BPM. Apart from this, no temporal constraints are put on the algorithm and it is thus capable to process severely arithmetic data. However, applications like stress or workload monitoring exist, where the HRV itself can be expected to lie in a non-pathologically high range. In this case, an interval has a high probability to lie within a certain range determined by the proceeding intervals. Here, we assume an adaptive Gaussian prior, with the exponentially weighted moving average of the previous estimated intervals Here, α = 0.7 was used and σ was chosen to be 0.05 seconds, which corresponds to 3 beats per minute. Choosing a smaller σ would make the algorithm more robust towards outliers. At the same time, it would limit its ability to process severely irregular cardiac signals and track fast changes.
The one-dimensional BCG-signal only needs to be bandpass filtered before interval estimation. On the other hand, the two-dimensional video signal needs to be preprocessed more intensely. First, the face is identified using the Viola-Jones face detector [14] in the first frame. In subsequent frames, the KLT feature tracker [15-17] is used to ascertain the face position and determine head motion. The latter consists of a cardiac component and of motion artifacts. To extract the signal of interest, the approach described in [8] is used. Principal component analysis (PCA) is performed and the first five principal components are analyzed via Fast Fourier Transformation (FFT) to find the most harmonic signal, i.e., the signal most likely to contain the cardiac component termed "Video Motion". A more recent approach relying on the discrete cosine transformation [18] was not found to improve the result.
To extract the photoplethysmographic information from the video-stream, several approaches are described in the literature [9,19]. Recent developments include the use of independent component analysis (ICA, [1]), PCA [20] or a combination of the RGB-channels and were implemented and evaluated on our data. The best results, however, were obtained by the straight forward approach of taking the spatial average of the green channel of a tracked region of interest placed on the subjects forehead as illustrated in Fig. 4. Other methods, for example including ROIs on the subject's cheeks, showed no improvement. To reduce the respiratory component and high frequency noise, a bandpass filter with a passband of 0.7 to 3.6 Hz was applied. This channel is termed "Video PPG" in the following.

Results and discussion
For all 13 two-minute recordings, six scenarios were evaluated. Interval estimation was performed using and finally 6. all three signals assuming the adaptive Gaussian prior described above (Fusion all w. Prior).
In Table 2, all results are listed in terms of mean absolute interval estimation error (E abs ) in milliseconds as well as coverage in percent. The information is presented graphically in Fig. 5. Additionally, a gross analysis is performed, i.e. all 1733 intervals of all recordings are analyzed together. The result is presented in terms of coverage as well as median, 5 th and 95 th percentile of the absolute interval error in milliseconds in Fig. 6. In Fig. 7, the Bland-Altman-Plot for the scenario "Fusion all w. Prior" displaying all intervals is shown. Finally, for the same scenario, one recording with excellent coverage and error as well as the recording with the highest mean absolute error are displayed in Fig. 8. Several observations can be made. First, all three modalities show very different results. In the gross statistic (Fig. 6), Video Motion shows the lowest coverage. The coverage of BCG and Video PPG as well as the median error is comparable, while the 95 th percentile is more than five times higher for BCG. Comparing the single recordings in Table 2 further reveals that there is no optimal modality per se: The best error and coverage is achieved via BCG for Trial 1 #1 and via Video PPG for Trial 1 #2. For Trial 3 #1, Video Motion achieves the lowest error, BCG the highest coverage.  Table 2.  If multiple channels are fused, coverage is always improved. The mean absolute error of the fused signal is always improved compared to the worst channel but might be inferior to the optimal signal in some cases, while in others, the fused result might be better than any single channel. When averaging all recordings, the mean error as well as coverage is improved beyond the individual modalities if all are fused together. Additionally, when introducing the adaptive Gaussian prior, coverage as well as mean error could be further improved. Examining this scenario in the Bland-Altman-Plot (Fig. 7), an insignificant bias of +4.37 ms can be determined. Additionally, no systematic difference between the estimation of relatively small intervals of subjects 1 to 4 (about 900 ms) and relatively large intervals of subject 5 (about 1400 ms) can be observed. Encouragingly, Fig. 8 reveals that even in the worst recording of this scenario, the average heart rate as well as its trend can be tracked accurately.
Analyzing subject 5 in detail, some interesting observations can be made: While the results of the BCG signal almost perfectly resemble the mean values in terms of error and coverage, the video-results are far worse. Although Video PPG shows a slightly better-than-average error, coverage is only 9% while the average for this modality is 50%. This can be explained by the subjects darker skin tone. It does, however, not explain why the error for Video Motion is almost an order of magnitude worse than the average. Analyzing the morphology of the signal reveals that for this particular subject, one heartbeat is often represented by two very similar peaks. This causes the estimator to wrongly estimate the half interval, although the signal shows the main peak at the location of the true average heart rate in the Fourier spectrum. In pitch estimation, this is known as the "octave error", which leads to a very high mean absolute interval estimation error. Encouragingly, the fusion of all three modalities reduces the error for subject five to 24 ms with a coverage of 82%.
To further analyze the influence of motion on both video-based interval estimations, Fig. 9 shows E abs and the coverage for Video Motion and Video PPG over the mean feature movement. Subject 5 is excluded since it only participated in one trial and the low performances using the video-based channels can be considered outliers as argued above. Feature movement is defined as the standard deviation over time of the horizontal coordinate of each feature point. Over all features, the mean is calculated; r indicates the correlation coefficient. One can see that a positive correlation between mean feature movement and E abs as well as a negative correlation between mean feature movement and coverage exists for Video Motion. For Video PPG, no correlation can be observed. Thus, a negative influence of (even arguably small) motions on the tracking of cardiac-related head-oscillations can be shown, while no such influence exists on the video PPG signal. The influence of the quality threshold Q th is demonstrated in Fig. 10 for BCG, Video PPG and Video Motion. The trade-off between accuracy and coverage is visible. Moreover, the influence of Q th clearly depends on the modality to be analyzed. Using

Conclusion
In this paper, a novel multimodal sensor fusion approach for unobtrusive vital sign monitoring is presented. The motion and photoplethysmographic component originating from cardiac activity are extracted from a webcam video stream and fused using a Bayesian approach. This improved the coverage of the beat-to-beat interval estimation from 25 % (only motion) and 50 % (only PPG) to 75 % while maintaining a low error of 32 ms compared to an ECG reference. This is a very promising result, considering the sample time of the video signal was 33 ms. By fusing an additional ballistocardiographic signal unobtrusively acquired with a mat placed on the seat of a chair, coverage and error were further improved to 84 % and 25 ms, respectively. As no constraints are put on beat-to-beat intervals except a maximum interval length, severely arrhythmic data could be processed in theory. We could show that our interval estimation approach can exploit redundancies in multimodal biosignals even if individual SNRs forbid beat-to-beat interval estimation. A coverage of 90 % at only 24 ms average absolute error could be achieved by introducing a novel adaptive Gaussian prior. Although this constraints the modified algorithm, it only limits its ability to process cardiac signals exhibiting fast changes in HR, i.e., cases with a very high HRV. If the HRV lies in a normal range or is reduced, this constraint does not negatively affect the estimation. This is promising, as it is often a reduced HRV that is associated with an unhealthy state [5,6].
As of now, the approach has only been tested on a small group of healthy volunteers. To test its ability to correctly estimate beat-to-beat intervals in severely arrhythmic data in practice, a large study containing healthy and non-healthy subjects is necessary. Moreover, several motion scenarios as well as subjects with various skin tones should be included. To further improve numeric results, more advanced algorithms for video preprocessing could be evaluated. Thus far, no improvement could be noticed by implementing several methods proposed in the literature for preprocessing the color information over using the green channel. Hence, we suspect that more room for improvement lies in the advancement of the facial tracking, as this would result in more reliable motion estimation and stable photoplethysmographic information simultaneously. Finally, more unobtrusive measurement modalities such as the cECG should be integrated into the measurement setup to further increase accuracy and coverage.