Detail-preserving pulse wave extraction from facial videos using consumer-level camera

: With the popularity of smart phones, non-contact video-based vital sign monitoring using a camera has gained increased attention over recent years. Especially, imaging photoplethys-mography (IPPG), a technique for extracting pulse waves from videos, conduces to monitor physiological information on a daily basis, including heart rate, respiration rate, blood oxygen saturation, and so on. The main challenge for accurate pulse wave extraction from facial videos is that the facial color intensity change due to cardiovascular activities is subtle and is often badly disturbed by noise, such as illumination variation, facial expression changes, and head movements. Even a tiny interference could bring a big obstacle for pulse wave extraction and reduce the accuracy of the calculated vital signs. In recent years, many novel approaches have been proposed to eliminate noise such as ﬁlter banks, adaptive ﬁlters, Distance-PPG, and machine learning, but these methods mainly focus on heart rate detection and neglect the retention of useful details of pulse wave. For example, the pulse wave extracted by the ﬁlter bank method has no dicrotic wave and approaching sine wave, but dicrotic waves are essential for calculating vital signs like blood viscosity and blood pressure. Therefore, a new framework is proposed to achieve accurate pulse wave extraction that contains mainly two steps: 1) preprocessing procedure to remove baseline oﬀset and high frequency random noise; and 2) a self-adaptive singular spectrum analysis algorithm to obtain cyclical components and remove aperiodic irregular noise. Experimental results show that the proposed method can extract detail-preserved pulse waves from facial videos under realistic situations and outperforms


Introduction
Long-term and regular monitoring of vital signs such as heart rate (HR), heart rate variability (HRV), blood oxygen saturation (SpO2), blood viscosity and blood pressure (BP) are important for the early warning of cardiovascular diseases. Currently, the gold standard techniques to measure the vital signs are based on contact sensors or specific devices such as electrocardiogram (ECG) probes, blood pressure cuffs and pulse oximeters. However, these sensors or devices are inconvenient for daily use and may cause skin irritation in case of treatment for newborns.
In the past decade, researchers have focused on non-contact detection methods and found that the change of blood volume in the capillaries underneath the skin leads to small change in the skin color which can be captured by consume-level cameras. This discovery directly motivated the development of imaging photoplethysmography (IPPG), a technique for extracting pulse waves from videos [1][2][3]. In 2011, Poh et al. first realized facial heart rate detection using a webcam [11], since then, non-contact heart rate detection based on IPPG technology has become a research hotspot.
The main challenge for accurate pulse wave extraction from facial videos is that the color change of the face due to cardiovascular activities is subtle and many factors could contaminate the pulse wave and make the extraction task difficult, for example, both environmental illumination variations and subjects' motions could affect the gray value of the face region significantly, such as the flicker of light, the inner noise of the digital camera, the facial expression changes and the unconsciously shaking of head. Recent years, many new methods have been proposed to detect pulse wave which are mainly focused on HR estimation and overlooked the preservation of pulse details [9][10][11][12][13][14][15]. Yet, effective methods for facial videos under realistic situations are scarce. Li et al. proposed an anti-interference method called normalized Least Mean Square (NLMS) adaptive filter which could rectify the illuminance variation [16], but a smooth rectifier in background is critical for establishing the desired signal as the input of the NLMS adaptive filer, which is hard to realize practically. Afterwards, some novel methods were proposed, in 2014, Yu et al. proposed the filter bank method and achieved satisfying accuracy of HR estimation [17], this method divides the bandwidth from 0.8 Hz to 4 Hz into 16 sub-bands (0.02 Hz for each sub-band), then use the sub-bands to bandpass the source signal and get 16 filtered components, finally, the component with the highest energy is selected as pulse wave and HR can be estimated according to the central frequency of the corresponding sub-band. Due to the usage of 0.02Hz narrowband, the pulse wave extracted by filter bank approaching sine wave and cannot be used for the measurement of many vital signs such as blood viscosity. In 2015, Kumar et al. proposed the Distance-PPG method which combines skin-color change signals from different tracked regions of the face using a weighted average, where the weights depend on the blood perfusion and incident light intensity in the region [18]. Distance-PPG method has excellent anti-noise performance, but the pulse wave extracted by Distance-PPG could not see obvious dicrotic wave. Another shortage is that Distance-PPG algorithm is time-consuming which makes it unsuitable for real-time vital signs calculation on smart phones. To the best of our knowledge, no method has been demonstrated to be able to extract detail-reserved standard pulse wave from facial videos under realistic conditions.
In this paper, a new method is proposed to realize detail-preserving pulse wave extraction from facial videos under realistic situations. Firstly, the subject's facial region is tracked to remove rigid motion disturbance, and the original pulse wave is obtained by calculating the mean value of green channel pixels in the facial region. Afterwards, baseline cancelation, spike smoothing and five-point cubic smoothing algorithms are performed to remove baseline offset and high-frequency random noise. The self-adaptive singular spectrum analysis algorithm is then adopted to remove irregular noise caused by non-rigid motions and illumination changes while keep useful details reserved. Experimental results show that the proposed method outperforms state-of-the-art method in real time HR (RTHR) estimation and detail-preserving. Furthermore, our method has already been transplanted to smart phone and realized very accurate HR calculation. Our fitness diagnosis APP (android version) will be released in the near future.

Method
Pulse wave is one of the most important human body signals which contains abundant physiological information and can be used for diagnosing cardiovascular diseases or mental sickness [4][5][6][7]. As shown in Fig. 1, the pulse wave features such as ascending limb, descending limb and dicrotic wave are important for assessing cardiovascular condition [8], for example, the gradient of ascending limb is closely related to systolic blood pressure, and the sharpness of dicrotic wave is associated with arterial stiffness. Therefore, accurate pulse wave extraction from facial videos with useful details reserved is crucial for non-contact vital sign monitoring and is also the focus of this paper.
The overall framework of the proposed method is shown in Fig. 2 with an example of low signal to noise ratio (SNR). This section will explain our method in detail to show how it works.

Raw pulse acquisition
Face detecting and tracking are essential for raw pulse acquisition as head movements and background noise could affect pulse wave significantly. Hence the Viola-Jones face detector of OpenCV is adopted to detect the face rectangle for each video frame [20] and only the pixels in the face rectangle participating the construction of raw pulse.
The selection of color channel is crucial too. There are mainly three color models applied in raw pulse acquisition: Red-Green-Blue (RGB), Hue-Saturation-Intensity (HSI), and YCbCr where Y stands for luminance component and Cb and Cr refer to blue-difference and red-difference chroma components respectively. For HSI model, only the H channel can be used for pulse extraction and it is motion sensitive. According to Sahindrakar et al., YCbCr produces better results in detecting HR than HSI with limited movements [21]. Among these three models, the most robust model is still RGB and the green channel of RGB gives the strongest signal-to-noise ratio (SNR) [22][23][24][25]. Therefore, the raw pulse is constructed by calculating the mean value of the green channel values of pixels in the detected face rectangle.

Preprocessing
Even if the face region is tracked precisely, it may still be challenging to extract clean pulse wave during motion. As shown in Fig. 3(a), the average intensity within the facial region could experience a significant sudden drop or rise and the cause of such a shift is often due to facial expression changes or a motion-induced camera focus change [26]. To address this problem, several useful algorithms are applied including baseline cancelation, spike smoothing and five-point cubic smoothing.

1) Baseline cancelation:
This algorithm is mainly used for eliminating the trend and sharp drop or rise in raw pulse. Firstly, the raw pulse is cut into several M-point non-overlapping segments and calculate the mean value of each segment. Then, detrend the signal by making the mean value of each segment equals to the mean value of overall raw pulse. According to our experiments, it is suggested to use signal sampling rate as the value of segment length M. The signal after baseline cancelation is shown as Fig. 3(b).

2) Spike smoothing:
Although the signal after baseline cancelation is an approximation to the ideal pulse signal, one spike noise, which comes from the demean of sharp drop, exists. Hence the spike smoothing algorithm is proposed to deal with it. First, the raw pulse is cut into several M-point non-overlapping segments and the standard deviation of each segment is calculated. If the standard deviation of a segment exceeded two times of overall standard deviation, the corresponding segment is considered highly noisy and a compression method is employed to smooth the segment. Equation (1) is the core of the compression method: where CR is the compress rate used to smooth the noisy segment, sd i is the standard deviation of the noisy segment, and sd is the overall standard deviation of the pulse signal. It is obvious that the calculated CR increases with the increase of sd i . All pulse wave values in the noisy segment are multiplied by (1-CR) to achieve spike smoothing, and the actual effect of the proposed spike smoothing algorithm is shown in Fig. 3(c).

3) Five-point cubic smoothing:
This algorithm is used to remove high frequency random noise while keep the original curve characteristics unchanged. Firstly, a cubic polynomial is used to fit experimental data: Then compute the undetermined coefficients in Eq. (2) by defining a least square formula shown below: To make φ(a 0 , a 1 , a 2 , a 3 ) minimum, we compute partial derivative for a k (k = 0,1,2,3) and let the partial derivative be zero, the following equation will be obtained: The undetermined coefficients a k (k = 0, 1, 2, 3) can be calculated from Eq. (4), and the five-point cubic smoothing formulas below are obtained by substituting a k (k = 0, 1, 2, 3) into Eq. (2). Notice that Eq. (5), Eq. (6), Eq. (8) and Eq. (9) are used for the four points in both ends, and Eq. (7) is used for all the other points.
Five-point cubic smoothing algorithm utilizes polynomial least square to approximate sampling points, it can produce much better results than conventional moving average filter in detailpreserving. The signal after five-point cubic smoothing is shown as Fig. 3(d).

Self-adaptive singular spectrum analysis
A new method that combines singular spectrum analysis with self-adaptive function (self-adaptive SSA) is proposed to remove irregular noise and extract clean pulse wave with useful details reserved. Self-adaptive SSA is a global analysis method based on phase space reconstruction which decomposes original signal into multiple variable components [27] and choose appropriate components automatically to reconstruct pulse wave. The main steps of the proposed self-adaptive SSA method include trajectory matrix construction, singular value decomposition, self-adaptive components selection, and signal reconstruction. Through the use of self-adaptive SSA, main periodic components in original signal could be extracted to reconfigure pulse wave.

1) Trajectory matrix construction:
Suppose the total length of time series x(t) is N, the window length used to construct trajectory matrix is L, and define K = N-L + 1, then the trajectory matrix can be represented as: It is obvious that the trajectory matrix is a Hankel matrix. The main concern here is the choose of window length L, and according to Mahmoudvand et al. [28], when L takes the median value of N, the corresponding singular value will get the maximum, so the literature recommended that in most cases the median value of N is the best choice for L.

2) Singular value decomposition:
This method has already been widely used in principal component analysis (PCA), pattern recognition, data compression, and so on. The singular value decomposition of trajectory matrix X is given as: where d is the number of non-zero singular values of X, and it is obvious that d = rank(X) ≤ min(L, K). Here, λ 1 is the largest singular value, and the value of λ i decreases as the index i increases, this means components with lower index number contributes to the origin signal more. U i and V i are the left and right singular vectors of X respectively.

3) Self-adaptive components selection:
It is generally believed that the main energy of a signal is concentrated on the first r (r < d) larger singular values, while the smaller singular values are regarded as noise components. Figure 4 shows an example of original signal and its 5 components reconstructed from the former 5 singular values respectively, and it is obvious that the first component has the highest energy, while the last component has the lowest energy. The purpose of components selection is to determine an appropriate value for r, so that the noise-free pulse wave can be reconstructed from the former r singular values. If r is too small, the useful details like dicrotic wave will be lost, and if r is too large, the noise reduction performance will be poor, thus the exact augmented Lagrange multiplier algorithm (EALM) is adopted to achieve self-adaptive components selection [29]. High-dimensional data can be considered sparse, and its sparsity is reflected in its low rank property, thus trajectory matrix X can be approximated by its best low rank matrix S. Suppose X is contaminated by slight Gauss noise E and sparse big noise W, and ||E|| F ≤ δ where ||E|| F represents the nuclear norm of E, then matrix denoise problem can be described by the following optimization problem: where m, n denotes the number of rows and columns of matrix X respectively, ||W|| 0 denotes the zero-norm of matrix W. To solve this optimization problem, the following Lagrange function is defined: where µ is the punishment coefficient, and the initial value of Lagrange multiplier Y can be represented as: where sgn is signum function, ||X|| ∞ denotes the infinite-norm of X, and ||X|| 2 denotes the 2-norm of X. The following EALM algorithm could be used to iteratively solve Eq. (13): The convergence condition of EALM algorithm is that L(S, W, Y, µ) derives S equals to zero and H(X)≤δ. Reference [30] proved theoretically that the Lagrange multiplier Y is sufficient to guarantee the linear convergence of the EALM algorithm when the function H(X) is continuously differentiable. When the algorithm converges, the output matrix S k is the best low rank matrix to approximate the source matrix X, and it is recommended to reconstruct pulse wave with the first rank(S k ) singular values, in other words, the appropriate value for r is rank(S k ). Notice that EALM algorithm requires µ to iterate from a smaller positive number so that noise matrix W can be well estimated. For an extreme counterexample, if the initial value µ 0 is infinite, the algorithm converges in the first iteration and the noise matrix W is not estimated at all. The coefficient ρ determines the convergence rate which should be set based on the balance between time and accuracy of the algorithm, and the typical value for ρ is between 1.1 and 2.

4) Signal reconstruction:
Signal can be reconstructed from the former r larger singular values using the matrix RCA shown in Eq. (15). RCA is also called reconstruction matrix, which is defined as the product of first r columns of U and first r rows of V T . The dimension of matrix RCA is L×K, and furtherly define L* = min(L, K), K*=max(L, K), and y i,j represents the value of column j in row i of matrix RCA, then the reconstructed signal y rc can be obtained by: After the above steps, the denoised pulse wave could finally be obtained and Fig. 5 shows the actual performance of the proposed self-adaptive SSA method. Here, the total length of pulse signal is 160, the window length L is set to 80, and the number of singular values used to reconstruct pulse wave is 8 (r = 8). After using self-adaptive SSA, the irregular noise caused by motions or illumination changes are reduced significantly, while useful details like dicrotic wave preserved.

Experimental setup
MAHNOB-HCI is a public database which contains videos of 30 subjects recorded at 61 fps with a resolution of 780×580 pixels, and the synchronized ECG signals are provided as the ground truth [19]. Our method is evaluated on MAHNOB-HCI database for the following reasons: 1) it contains abundant facial videos recorded under realistic situations, and both illumination variations and subjects' motions are involved; 2) ground-truth HR values are recorded by ECG simultaneously; 3) it is a public database that can be easily accessed by all researchers who would like to do further comparison fairly [19].
The performance of our method as well as other state-of-the-art methods [9,11,16,17,18] are compared on average HR, and real time HR (RTHR) using MAHNOB-HCI database. One video corresponds to one average HR calculated by our improved peak detection algorithm which uses both amplitude and time thresholds to pick out main peaks in pulse wave and compute the mean value of peak to peak intervals for HR estimation. RTHR values are obtained from a 5-second sliding window (4-second overlap) also using our improved peak detection algorithm.
The Pearson correlation coefficient, pulse rate variability (PRV), ascending limb precision, descending limb precision, dicrotic duration precision and dicrotic amplitude precision are used as the indicators to evaluate detail preserving ability. Here, ground-truth pulse waves are recorded from earlobe at 200 Hz using a PPG sensor, while facial videos are recorded at 30 fps simultaneously using a consume-level camera of Logitech.
Furthermore, our method is also evaluated on smartphone to verify its accuracy in practice. The video recording parameters for smartphone is set to 30 fps with a resolution of 1280×720 pixels, and the ground truth HR are recorded simultaneously using the pulse oximeter of Heal Force.

Experimental results on the MAHNOB-HCI database
MAHNOB-HCI database contains abundant facial videos recorded under realistic situations, and both illumination variations and subjects' motions are involved which makes pulse wave extraction challenging. In this section, 208 videos are picked from MAHNOB-HCI database for experiments, and the time duration of each video varies from 13 seconds to 20 seconds. Each subplot gives the mean error between estimated RTHR sequence and ground truth RTHR sequence, and the ground truth RTHR sequence is obtained from synchronized ECG signal.
In order to evaluate the performance of RTHR estimation, a 5-second sliding window (4-second overlap) is adopted to get RTHR sequence, and the consistence between the estimated RTHR sequence and ground truth RTHR sequence is assessed using Bland-Altman plot as shown in Fig. 6 where our method and other state-of-the-art methods are compared together. The two dotted lines in each subplot represent the confidence range [µ-1.96σ, µ+1.96σ], and only the points between the dotted lines are considered highly credible. Results in Fig. 6 show that our method outperforms other state-of-the-art methods in RTHR estimation with 2 values slightly out of bounds. The result of Kumar et al. [18] is also acceptable with 5 values slightly out of bounds.
The proposed method is also compared with other state-of-the-art methods on average HR estimation, and Table 1 shows the mean absolute error (MAE), standard deviation (SD), root mean square error (RMSE) and Pearson correlation coefficient ρ between estimated HR and ground truth. Experimental results show that our method achieves superior performance to the methods in [9,11,16,17] and similar to that of Kumar et al. [18] under realistic situations in terms of average HR estimation. Furthermore, the performance comparison for different skin tones is also studied. MAHNOB-HCI database contains samples of different countries which covers nearly all kinds of skin tones and thus can be divided into white, olive and black according to the shade of color. 12 subjects are picked out (4 white, 4 olive and 4 black) for tests and Fig. 7 shows the agreement between measured and ground-truth RTHR of different skin tones. Although the white skin tone category got the best agreement, the proposed method performs well for the other two categories too, which verified its effectiveness in processing low SNR signals. Fig. 7. Comparison of HR measured from consumer-level camera and from ground truth pulse oximeter using the proposed method for different skin tones: white, olive and black.

Experimental results of detail-preserving capability
As for detail-preserving ability, the extracted pulse wave of our method is compared with three state-of-the-art methods [16,17,18]. Methods in [9,11] are both based on blind source separation which combines signals in red, green and blue channels to get one independent component, but any interference would affect the three channels at the same time, thus methods in [9,11] would unlikely be able to recover pulse wave when head motions or illumination changes are involved. Therefore, we believe our method outperforms [9,11] in real mobile scenarios and thus not making experimental comparison with them. Figure 7 is the test scenario for evaluating detail-preserving ability.
In this experiment, 5 individuals were tested (3 males and 2 females) and each subject received 10 tests with slightly head motions and facial expression changes permitted. The calculated Pearson correlation coefficient is used to evaluate pulse recovery ability of each method, and the higher the coefficient value, the more accurate the extracted pulse wave is. The root mean square error (RMSE) of PRV estimation is also compared to show how accuracy the method could achieve. Ascending limb precision (P u ), descending limb precision (P d ), dicrotic duration precision (P t ) and dicrotic amplitude precision (P h ) are defined in Eq. (17), Eq. (18), Eq. (19) and Eq. (20) respectively.
where n is the number of tests for each subject. U p and U g represent the time duration of ascending limb of extracted pulse wave and ground-truth respectively, while D p and D g represent the time duration of descending limb of extracted pulse wave and ground-truth. T p and T g are the time duration between systolic peak and dicrotic peak of extracted pulse wave and ground-truth respectively. AR p and AR g are the amplitude ratio of systolic peak to dicrotic peak of extracted pulse wave and ground-truth. Note that, the single period waveforms with no dicrotic wave will be discarded when calculating P t and P h . It is evident that P u and P d are related to the accuracy of overall shape, while P t and P h can characterize the location and amplitude accuracy of dicrotic wave respectively. Table 2 shows the test result of each method and Fig. 9 gives an example of extracted facial pulse wave using each method. Experimental results in Table 2 and Fig. 8 demonstrated that our method performs best in detail-preserving with relevance of 0.91 and smaller values for the parameters RMSE, P u , P d , P t and P h achieved. Due to the usage of narrowband filter, the pulse wave extracted by Yu et al. approaching sine wave. Method of Li et al. is acceptable with inapparent dicrotic wave reserved. The extracted pulse wave of M. Kumar has obvious dicrotic waves but exists distortion compared with ground-truth.

Experimental results on smartphone
Our algorithms were also transplanted to mobile phone (HUAWEI P20) and 5 individuals were tested (3 males and 2 females). Everyone was asked to do two sets of tests: one for stationary test where the face of subject keeps motionless, and another for dynamic test where the subject is free to speak and move head. Each subject received 10 stationary and dynamic tests respectively, and the ground truth values are recorded by pulse oximeter of Heal Force simultaneously. Figure 9 shows how the subjects were tested. Experimental results are shown in Table 3, our method achieved satisfying accuracy of HR estimation on smartphone in both stationary and dynamic scenarios. For stationary scenario, the mean absolute error (MAE) of tested subjects is lower than 2 bpm which reaches medical standard, and for dynamic scenario, the MAE is about 3 bpm which is acceptable for daily HR monitoring. The developed android APP will be online soon.

Conclusion
In this paper, a new method for detail-preserving pulse wave extraction from facial videos is introduced. The method consists of preprocessing procedure and a newly proposed self-adaptive SSA algorithm. Our experiments are based on a public database called MAHNOB-HCI so that all researchers who would like to do further comparison fairly could access this database easily. Experimental results show that the proposed method outperforms state-of-the-art methods in average HR, RTHR, and detail-preserving. The mean absolute error of the estimated HR using our method is 2.05 bpm while that of best-performance contrasted method is 2.47 bpm. As for detail-preserving capability, the pulse wave extracted by our method shows a very high correlation of 0.91 with ground-truth. Another contribution of the proposed method is that it enabled the non-contact estimation of atrial fibrillation, heart rate variability, blood pressure, as well as other physiological parameters that require standard pulse wave on portable devices. Our approach has already been implemented on smartphone (android version) and achieved satisfactory HR estimation. The developed android APP will be online soon.
Although the proposed method meets the requirement of accurate pulse wave extraction under realistic situations, the presence of large motions, such as intense head swing, is still a remaining challenge and a large motion disturbance detection and disposal algorithm should be included in further work.