A wavelet-based decomposition method for a robust extraction of pulse rate from video recordings

Background Remote photoplethysmography (rPPG) is a promising optical method for non-contact assessment of pulse rate (PR) from video recordings. In order to implement the method in real-time applications, it is necessary for the rPPG algorithms to be capable of eliminating as many distortions from the pulse signal as possible. Methods In order to increase the degrees-of-freedom of the distortion elimination, the dimensionality of the RGB video signals is increased by the wavelet transform decomposition using the generalized Morse wavelet. The proposed Continuous-Wavelet-Transform-based Sub-Band rPPG method (SB-CWT) is evaluated on the 101 publicly available RGB facial video recordings and corresponding reference blood volume pulse (BVP) signals taken from the MMSE-HR database. The performance of the SB-CWT is compared with the performance of the state-of-the-art Sub-band rPPG (SB). Results Median signal-to-noise ratio (SNR) for the proposed SB-CWT ranges from 6.63 to 10.39 dB and for the SB from 4.23 to 6.24 dB. The agreement between the estimated PRs from rPPG pulse signals and the reference signals in terms of the coefficients of determination ranges from 0.81 to 0.91 for SB-CWT and from 0.41 to 0.47 for SB. All the correlation coefficients are statistically significant (p < 0.001). The Bland–Altman plots show that mean difference range from 5.37 to 1.82 BPM for SB-CWT and from 22.18 to 18.80 BPM for SB. Discussion The results show that the proposed SB-CWT outperforms SB in terms of SNR and the agreement between the estimated PRs from RGB video signals and PRs from the reference BVP signals.


INTRODUCTION
Remote photoplethysmography (rPPG) is a non-contact optical method that measures the intensity of the light reflected from the skin by the means of a digital camera. Formation of the rPPG signal is thought to be attributed to the pulsatile variations of the light absorption due to the pulsatility of the blood volume inside capillaries, which are due to the pumping of the blood by the heart. However, the results of some research intensity and spectra of light source, skin-tone, skin composition and motion artifacts (Wang, 2017). General framework for the PR extraction from the video recordings consists of five steps: definition of the region-of-interest (ROI) that includes skin pixels, pre-processing of the raw RGB signal extracted from the ROI, extraction of the rPPG signal, post-processing of the extracted rPPG signal and PR estimation.
An important issue regarding the feasibility of the implementation of rPPG in realworld applications is its high sensitivity to different sources of noise. Most of the existing algorithms derive pulse signal from three-dimensional RGB signals and therefore only two distortions can be eliminated by linear combination of all three color channels which is usually not sufficient. This can either be solved by using a camera with multiple bands (McDuff, Gontarek & Picard, 2014) or by algorithmically increasing the dimensionality of the given RGB signals (Wang et al., 2017c). The latter is done by decomposing the raw RGB signals into multiple components by the means of Fourier Transform. The distortion signals are then suppressed in each of the components and the resulting signals are combined into a single one-dimensional rPPG signal. This algorithm is known as Sub-band rPPG (SB) (Wang et al., 2017c).

Outline of the paper
In this paper, a novel algorithm called Continuous-Wavelet-Transform-based Sub-Band rPPG (SB-CWT) is proposed. SB-CWT uses a different filter bank for the decomposition of the RGB signals than the SB algorithm, namely Continuous Wavelet Transform (CWT) using the analytic generalized Morse wavelet. In CWT, the target signal is correlated with the shifted (translated) and compressed/stretched (also termed dilated or scaled) versions of the analyzing wavelet. By varying the values of the scale and translation parameters the CWT coefficients are obtained. Multiplying these coefficients with their corresponding scaled and translated wavelets provides the constituent wavelets of the analyzed signal. Wavelet transform has the potential to increase the number of the degrees of freedom of the distortion elimination in comparison to the Short Time Fourier Transform (STFT) that was applied in SB. Additionally, wavelet transform provides more accurate information about the time and the frequency of an analyzed signal than STFT. This is achieved by varying the aspect ratio (i.e., ratio between the length of the time window and width of the frequency band) (Ganesan, Das & Venkataraman, 2004). The proposed algorithm is evaluated on the publicly available dataset MMSE-HR  and compared with the state-of-the-art SB algorithm (Wang et al., 2017c).
The structure of the article is as follows. In the next section, Materials and Methods, the used MSSE-HR dataset is presented, the state-of-the-art SB algorithm used for evaluation of the proposed algorithm is described, the proposed SB-CWT is presented step-by-step, preprocessing steps of the reference signals are shown and the evaluation metrics are defined. In the Results and Discussion sections, the experimental results are presented and discussed. The conclusions are derived in the last section.

Dataset description
In our research the subset data from the Multimodal Spontaneous Emotion database (Zhang et al., 2016), which includes 2D frontal face video recordings and blood volume pulse (BVP) measurements-MMSE-HR -was used. This dataset consists of 102 RGB image sets and corresponding physiological signals of 40 subjects (23 females and 17 males of various ethnic/racial ancestries) performing six different tasks. Table 1 lists all the tasks together with the emotions arousing from them. Image sets (image size: 1,040 Â 1,392 pixels) were recorded with RGB 2D color camera (frame rate of 25 fps) of Di3D dynamic imaging system (Dimensional Imaging Ltd, Glasgow, Scotland). BVP signals were measured using Biopac NIBP100D (Biopac System, Inc., Goleta, CA, USA) with the sampling rate of one kHz and collected by Biopac MP150 data acquisition system (Biopac System, Inc., Goleta, CA, USA). All the recorded data were synchronized.

Sub-Band rPPG (SB) method
The proposed SB-CWT method is benchmarked against the SB algorithm (Wang et al., 2017c). This method uses Fourier Transform for transforming the raw RGB signals into the frequency domain. The signals are then projected onto the plane termed as POS (Wang et al., 2017a), the axes of which most likely encapsulate the pulsatile region. The exact projection direction is then localized using the alpha-tuning (De Haan & Jeanne, 2013). Projection to POS together with alpha-tuning serves for the extraction of the rPPG (pulse) signal from the decomposed raw RGB signals. Next, the individual sub-band pulse signals are weighted using the ratio between the pulsatile amplitude and intensity variation amplitude as the weighting function. Finally, Inverse Fourier Transform is used for the transformation of the signal back into the time domain.

Choosing the analyzing wavelet
The general framework of the proposed SB-CWT algorithm is shown in Fig. 1 and the schematic representation of the algorithm is shown in Fig. 2. The framework consists of 13 steps all of which are explained throughout this section. The first question that emerges in any wavelet-based research is which wavelet is the most appropriate for a given application. In the case of using CWT there are several candidate analytic wavelets, such as Morlet, derivative of Gaussian, Shannon, log Gabor, etc. It was shown that there is a wavelet superfamily called generalized Morse wavelets in which all of the aforementioned wavelets can be subsumed (Lilly & Olhede, 2012). Furthermore, it is even possible to objectively define which wavelet within this family is the most appropriate one to be used (Lilly & Olhede, 2012). General Fourier-domain form of the generalized Morse wavelets is: where U(v) is the unit step function, a βc is the normalizing constant, v denotes frequency, β is the compactness or the order parameter and c is the symmetry or the family parameter. Increasing β results in the increased number of oscillations in the time domain and consequent narrowing of the wavelet in the frequency domain. Symmetry parameter c defines the shape of the wavelet in the frequency domain. The Eq.
(1) holds for v > 0 withĉ bg v ð Þ ¼ 0 elsewhere (due to the unit step function). Generalized Morse wavelets are controlled by two parameters, β and c. The next step was therefore to adjust their values. Symmetry parameter c was set to 3, because the family of wavelets with this c value possesses a few desirable properties, which make it recommended for general use (Lilly & Olhede, 2012). The c = 3 family is known as Airy family. Interested readers may refer to Lilly & Olhede (2012) for more details. The only parameter left that needed to be defined was β. This parameter can be defined as P 2 /c, where P 2 is the time-bandwidth product. In our example, P 2 value was defined for each of the four window lengths l (32, 64, 128 and 256 frames) in such a way that the value of the minimum wavelet bandpass frequency corresponds to the minimum frequency within the human heart rate band (40, 240) BPM. The β value was determined accordingly. The relation between a scale and a pseudo-frequency v a (this notation is used because the mapping between a wavelet scale to a frequency can be only done in a general sense) is defined as:  where f s cam is a camera frame rate, a denotes a scale and FF is the Fourier factor. The maximum scale (corresponds to the minimum wavelet bandpass frequency) can be calculated as N/2s t , where N is signal length and s t is time-domain standard deviation of generalized Morse wavelet. The latter can be approximated by ffiffiffiffiffiffiffiffiffi ffi P 2 =2 p . For the definition of s t refer to Eq. (21) in Lilly & Olhede (2009). Fourier factor is defined as 2p= b=g ð Þ 1=g . Figure 2 Schematic representation of the proposed SB-CWT method. The illustration shows the entire procedure of extracting the pulse rate signal from facial recordings using the proposed SB-CWT. Notations above the arrows refer to the steps of the proposed algorithm. Notation s2 refers to the face detection, s3 to the ROI size adjustment, s4 to the spatial averaging of the pixel values inside the ROI in each frame, s5-7 to the signal inversion, windowing (l = 256) and temporal normalization, s8 to the continuous wavelet transform, s9-10 to the projection of wavelet coefficients on POS plane and alpha tuning, s11 to the application of weighting function to the wavelet coefficients, s12-13 to the inverse continuous wavelet transform with scale dependent thresholding and overlap-adding. The numbering of all the steps closely follows the numbering in the framework of the proposed algorithm depicted in Fig. 1 For example, given v a = 40 min -1 and f s cam ¼ 25 fps, the maximum scale that corresponds to the v a equals 6.8224 is reached at P 2 = 11. The next required step was to define how finely the scales were to be discretized. This step was important in our application, because discretization indirectly defines the number of sub-bands. In CWT the base scale is usually defined as 2 1/ν , where ν is the number of voices per octave parameter. The name of this parameter refers to the fact that ν intermediate steps (i.e., scales) are needed if the scale is increased by an octave (i.e., if it is doubled). By increasing the value of ν, the scales are discretized more finely, which results in more sub-bands and consequently in more degrees of freedom for noise suppression. In the present research ν was set to 10, which means that the base scale was set to 2 1/10 . The reason for choosing ν = 10 (common values of ν are 10, 12, 14, 16 and 32) was that we did not want to increase the computational time. Increasing ν namely requires more computations, which results in longer processing time of the algorithm. The described procedure of the selection of the wavelet parameters is depicted by step 1 in Fig. 1.

Estimation of the pulse rate from video recordings
Viola-Jones face detector (Viola & Jones, 2001) was used to detect faces from the first image in each image set (step 2 in Fig. 1). The width of the box containing the detected face was resized to 60%, while the height was kept unchanged (Poh, McDuff & Picard, 2011) (step 3 in Fig. 1). This was done to decrease the number of non-facial pixels in the ROI. Position of the ROI was kept constant throughout the entire image set. Raw where C i denotes i-th spatial RGB mean color channel signal (also termed raw RGB signal), I i jk denotes pixel intensity value I of a pixel located in j-th row and k-th column in i-th color channel, i = 1, 2, 3 and denotes R, G and B channels, respectively. Spatial averaging reduces the camera quantization noise. In the next step, the raw RGB signals were inverted (step 5 in Fig. 1) in order to enable a better comparison with the reference pulse signals derived from the BVP signals. Pulse signals were namely obtained in reflection rPPG mode and therefore the signal reaches its peaks in each end-diastole phase, when the intensity of the diffuse-or body-reflection component is maximal. On the other side, the BVP signals from the MMSE-HR database were acquired using the measurement device based on the Penaz's principle (Penaz, 1973) or Vascular Unloading Technique. This is an approach where the digital blood flow is held constant by automatic inflation or deflation of the cuff attached to the finger. The applied pressure is directly related to the arterial pressure and is the highest during the ejection phase of the systole. Further processing was not carried out on the full-length raw RGB signals, but on the windowed signals using four different window lengths l = 32, l = 64, l = 128 and l = 256 (step 6 in Fig. 1). The windowed signals were temporally normalized (De Haan & Jeanne, 2013) (step 7 in Fig. 1) in order to remove the dependency of the raw RGB signals on the average skin reflection color. This normalization is defined as: where C n wi is the i-th temporally normalized windowed RGB color channel signal, C wi is the i-th windowed spatial RGB mean color channel signal, m(·) is the averaging operator, and i = 1, 2, 3 and denotes R, G and B channels, respectively. Temporally normalized windowed signals were then decomposed using the CWT (step 8 in Fig. 1). Since all C n wi signals are discrete, and the generalized Morse wavelet is defined in Fourier domain, the discretized version of the CWT expressed as an inverse Fourier transform was used: where a and b (b = 0, 1, 2 : : : N-1) are dilation and translation parameters, respectively, C n wi k ð Þ is the discrete Fourier transform of the i-th temporally normalized windowed RGB signal,ĉ Ã a k ð Þ is the complex conjugate of the Fourier-domain form of the wavelet c a , N denotes signal length and k denotes index of frequency (k = 0, 1, 2 : : : N-1). To ensure unit energy at each scale, the following normalization was used: where Dt ¼ 1=f s cam and v k ¼ 2pk=NDt: Combining Eq. (5) with Eq. (6) the CWT can be expressed as: The Eq. (7) was repeated for all scales a and all values b and the final result W i , where i = 1, 2, 3, was a matrix of wavelet coefficients with the number of rows equal to the total number of scales and the number of columns equal to the signal length N.
In order to obtain the pulse signal all the wavelet coefficients were then projected on POS plane (Wang et al., 2017a) (step 9 in Fig. 1). POS is based on the physiological reasoning and uses a projection plane that is orthogonal to the temporally normalized skin-tone direction. The POS algorithm was chosen because it offers the best general performance in extracting PR from RGB signals (Wang et al., 2017a). The axes of the POS projection plane (P p ) are defined as: Finžgar and Podržaj (2018), PeerJ, DOI 10.7717/peerj.5859 Projection of the wavelet coefficients on P p results in two projected components, S 1 and S 2 : The components S 1 and S 2 have in-phase pulsatile and anti-phase specular components which can be algorithmically separated (Wang et al., 2017a). This can be done by the means of the so called alpha-tuning (De Haan & Jeanne, 2013) (step 10 in Fig. 1), which can, for the case of the applied POS algorithm, be expressed as (Wang et al., 2017a): where a = jS 1 j/jS 2 j, where j·j denotes complex magnitude operator. In the case when the pulsatile variation is stronger than the specular (S 1 and S 2 are in in-phase), the strength of the resulting signal h is enhanced (Wang et al., 2017a). In the case when specular variations prevail (S 1 and S 2 are in anti-phase), the a moves the specular variation strength of one signal closer to the other, which results in the specular variation being cancelled out (due to the addition of two signals that are in anti-phase) (Wang et al., 2017a). In the next step, the obtained sub-band pulse signals were weighted using the global (scale-dependent) energy distribution function (step 11 in Fig. 1) defined as the sum of the original wavelet coefficients (the ones obtained by the CWT) at each scale (Bousefsaf, Maaoui & Pruski, 2013). The i-th weighted sub-band (i.e., at i-th scale) is defined as: where P w i denotes i-th weighted sub-band pulse signal, h ij denotes the CWT coefficient at i-th scale and j-th translation parameter, h i Ã denotes all CWT coefficients at i-th scale and n is the total number of translation parameters. After the weighting, the discretized Inverse Continuous Wavelet Transform (ICWT; step 12 in Fig. 1) together with scale-dependent thresholding was performed in order to transform the weighted sub-band pulse signals back into the time domain. Scale-dependent thresholding allows transformation over selected scales only. The inverses of the selected scales (i.e., frequencies) lie within the range of human heart rate band (i.e., 40-240 BPM). The transformation was performed by summing of the scaled weighted sub-band signals P w scaled over the selected scales only: where i runs from i min that is equal to the index of the scale that corresponds to the pseudo-frequency 240 BPM to the index of the last scale, which corresponds to the pseudo-frequency 40 BPM. The scaled weighted sub-band signals are defined as: where real(·) denotes the operator that takes only the real part of the complex input and C c is the admissibility constant defined as a bg =À b=g ð Þ (C denotes gamma function). This constant ensures that the wavelet of interest has zero mean. Final pulse signal was obtained by overlap adding zero-mean segments of the pulse signal ðPÞ using a sliding window with a sliding step of a one time sample (De Haan & Jeanne, 2013;Wang et al., 2017a) (step 13 in Fig. 1).
Average PR estimation from the obtained rPPG pulse signals was based on the average length of interbeat intervals (IBIs) of the entire pulse signal. The IBIs are defined as the intervals between the consecutive systolic peaks. These were detected using the derivative-based method for finding local maxima. The minimum peak-to-peak distance was set to 0.25 s, which corresponds to the upper limit of the human heart rate band (i.e., 240 BPM). This was the only constraint applied in the rPPG signal peak detection algorithm. Extraction of the pulse signal using the state-of-the-art SB method closely followed its original algorithm (Wang et al., 2017a). The only difference is that before the average PR estimation, the pulse signal was inverted due to the same reasoning as explained above.
To summarize, in the proposed SB-CWT the raw RGB signals extracted from the facial images were decomposed applying the CWT. Wavelet transform coefficients were then projected to the POS plane, the axes of which most likely encapsulate the pulsatile region (Wang et al., 2017a). Next, alpha-tuning (De Haan & Jeanne, 2013) was used to localize the exact projection direction. After that the projected coefficients were weighted using the global (scale-dependent) energy distribution function and finally inverted using the ICWT. The comparison of the algorithmic steps of SB-CWT and SB is shown in Table 2. From this table it can be seen that the two methods differ in signal decomposition step, weighting of the sub-band signals and in the transformation of the sub-band signals back into the time domain.

Processing of the reference signals
Pre-processing of the reference signals included the following steps: bandpass filtering, clipping and squaring. Applied bandpass filter was zero phase first-order Butterworth filter with a lower cut-off frequency of 0.67 Hz and upper cut-off frequency of two Hz (these frequencies correspond to 40 and 120 BPM). Filtered signals were then clipped in order to keep only the parts of the signals that are above zero. By subsequent squaring of the signals, noise and diastolic waves were suppressed, while the systolic peaks were emphasized. All the pre-processing steps served for a robust detection of systolic peaks, needed for the estimation of the reference average PR. This was done in the same manner as described above for the rPPG signals, except for the minimum peak-to-peak distance between two consecutive peaks being empirically set to 0.3 s.

Evaluation metrics
In order to evaluate the performance of the proposed algorithm several metrics were used: Signal-to-Noise ratio (SNR), Mean Absolute Error (MAE), Mean Percentage Error (MPE), Root Mean Square Error (RMSE) and coefficient of determination r 2 . SNR was defined as the ratio of the energy spectrum within five bins around the PR frequency (defined as the peak frequency in the spectrum of the reference BVP signal) and the remaining energy contained in the spectrum. Due to the fact that pulse signal is pseudo-periodic, SNR was calculated within a time interval defined by the 256-frames-long sliding window and sliding step of one frame. The chosen SNR metric is a modified version of the metric proposed by De Haan & Jeanne (2013). Mathematical expression of the chosen SNR metric is: where w t (f ) is a binary window, S (f ) is the spectrum of the rPPG signal obtained by the Fourier transform and f is the frequency in BPM. Mean absolute error measures the absolute average errors' magnitude. Its formula is: where PR rPPG i denotes the i-th average PR value obtained from rPPG signals, PR ref i is the i-th reference average PR value and n denotes the total number of facial recordings. Root mean square error is defined as the square root of the average of squared differences between the estimated ðPR rPPG i Þ and actual values ðPR ref i Þ: Mean percentage error is the average of the percentage error by which the estimated values ðPR rPPG i Þ differ from the reference values ðPR ref i Þ: Coefficient of determination r 2 defines the proportion of the variance in the dependent variable x predictable from the independent variable y. In the case of a linear least square regression, r 2 is equal to the square of the Pearson correlation coefficient r, which is a measure of the linear dependence of two variables: where PR rPPG and PR ref are the mean values of the estimated and reference PR values, respectively. The agreement between the reference pulse signal and rPPG signals was assessed using the Bland-Altman plots. In the original plot the differences between the results of the two methods are plotted against their averages (Bland & Altman, 1986). We however used the plots in which the differences between the results are plotted against the reference method (Krouwer, 2008).

RESULTS
We used 101 RGB image sets from the MMSE-HR database in our study. One image set was eliminated (subject F020 performing task T1), because of the corrupt BVP reference signal preventing the successful extraction of the reference pulse signal. The average size of the ROI was 712 ± 44 Â 427 ± 27 pixel (expressed as mean ± standard deviation). Figure 3 compares the performance of the state-of-the-art SB method and the proposed SB-CWT method for four different window lengths. Figure 3A shows the results of the full original implementation of SB and SB-CWT, whereas Fig. 3B shows the results when weighting (step 11 in Fig. 1) was not applied in neither of the algorithms. The removal of the weighting step was used to study only the effect of the number of sub-bands on the PR estimation. Results in Fig. 3A show that the proposed SB-CWT algorithm outperforms the SB algorithm from the perspective of the median SNR values for all the chosen window lengths. The lowest SNR for both algorithms is achieved at the shortest window length (6.63 dB for SB-CWT and 4.23 dB for SB). The SNR increases with the increasing window length and at l = 256 equals 10.39 dB for SB-CWT (6.24 dB for SB). The highest achieved SNR ratio of the SB (at l = 256) is exceeded by the SB-CWT at the shortest window length l = 32. Results of the paired t-test for all window lengths reject the null hypothesis that pairwise difference between SNR data from SB and SB-CWT has a mean value equal to zero (p < 0.001). These results indicate that there is a statistically significant difference between the SNR values of SB and SB-CWT. In the case when weighting function was not applied in neither of the two studied algorithms (Fig. 3B), SB-CWT still outperforms SB for all window lengths. The highest SNR for SB-CWT is achieved at l = 32 (5.94 dB) and for SB at l =256 (5.53 dB). The trend of increasing SNR with increasing window length from Fig. 3A is not present in Fig. 3B for SB-CWT. There is a statistically significant difference between the SNR values of SB and SB-CWT without applied weighting functions at l = 32, l = 64 and l = 128. Figure 4 shows the scatter plots examining the relation between the estimated average PRs obtained by both rPPG methods (denoted by PR rPPG ) and the reference average PRs (PR ref ) for different window lengths. Linear regression lines together with their equations and goodness of fit (coefficients of determination) are also shown. Figures 4A-4D show the results of the full implementation of both rPPG algorithms, whereas Figs. 4E-4H show the results of both studied algorithms without applied weighting function. Results in Figs. 4A-4D show that with the increasing window length the agreement between the PR rPPG estimated using the SB-CWT and the PR ref is getting higher. In the case of SB the described trend is disrupted at l = 256, since the agreement is lower than at l = 128. The proposed SB-CWT outperforms SB for all window lengths. Just as it was the case with the SNR, the SB-CWT at the shortest window length outperforms SB at the longest window length. All the correlation coefficients from which coefficients of the determination were calculated are statistically significant (p < 0.001). Results in Figs. 4E-4H show that the agreement between the PR rPPG estimated using the SB-CWT and the PR ref is decreasing with the increasing window length. In the case of the SB method, there is no evident trend with increasing window length when it comes to    Table 3. MAE, MPE and RMSE values decrease with the increasing window length. At l = 256, MAE, MPE and RMSE of the estimated PRs using SB-CWT are all more than 7 times lower than MAE, MPE and RMSE of the PRs estimated using SB (at l = 256). These results all indicate better performance of SB-CWT when compared to SB.
The computational times of SB-CWT and SB algorithms for the chosen facial recording of the subject F010 performing task 11 for all four window lengths are shown in Table 4. The chosen recording consists of 999 frames, which translates to the total duration of 39.96 s. This video was selected because its duration is equal to the median duration of all the used recordings from the MMSE-HR database. The computational times are expressed as mean ± standard deviation and calculated based on the times of five consecutive  executions of the entire code that extracts final pulse signal from the facial recordings.
The results show that SB-CWT is computationally more demanding than SB algorithm. All the computations and calculations were performed using the PC with an Intel Core i7 processor and 16 GB of RAM memory. The used software was written in MATLAB R2018a (MathWorks, Inc., Natick, MA, USA).

DISCUSSION
The main reason for comparing the proposed SB-CWT with SB was that SB outperforms POS algorithm (Wang et al., 2017c), whereas POS was reported to have the best overall performance in a large benchmark (Wang et al., 2017a). POS was benchmarked against the single wavelength method (G) (Verkruysse, Svaasand & Nelson, 2008), the method combining two channels (G-R) (Hülsbusch, 2008), algorithms. All these algorithms were the most commonly used state-of-the-art algorithms at that time.
The proposed SB-CWT algorithm was evaluated on the publicly available MMSE-HR dataset. There are numerous datasets that are in general suitable for the evaluation of the rPPG algorithms (see Table 5), but to the best of our knowledge MMSE-HR is the only one that offers uncompressed facial recordings. This type of recordings is needed for the evaluation of the methods that exploit model-based algorithms for rPPG signal extraction (as is the case with POS algorithm applied in both, SB-CWT and SB). The compression may namely eliminate the pulsating component of the rPPG signal that is related to the blood pulsations and can also pollute the signal with additional artifacts (Wang et al., 2017a). The main reason for publishing the list of datasets that can be used in rPPG studies is to encourage the rPPG community to start using them more commonly. These databases are unfortunately only rarely used by the rPPG researchers, even though most of the proposed algorithms are not based on model-based methods for rPPG signal extraction.
Results from Figs. 3A, 4A-4D, 5A-5D and Table 3 indicate that the proposed SB-CWT outperforms the state-of-the-art SB algorithm. The reasoning behind the better performance is multifaceted. Firstly, at l = 32, l = 64 and l = 128, SB-CWT segments the input rPPG signal on more sub-bands, which in turn increases the effectiveness of the separation of the pulse signal from noise components. SB-CWT (SB) offers 25 (5) sub-bands at l = 32, 26 (10) at l = 64, 26 (21) at l = 128 and 26 (43) at window length l = 256. It can be seen that with increasing window length the number of sub-bands in SB is increasing, while it does not change in SB-CWT. The only exception is at l = 64, when SB-CWT has one sub-band more than l = 32. This is however attributable to the discretization of scale values and their correspondence with the limits of human heart rate frequency band. Secondly, in SB algorithm the weighting function is based on the assumption that noise related to motion causes large intensity variations (Wang et al., 2017c). In the proposed SB-CWT, the weighting function however assumes that the amplitudes of the pulse signals are bigger than those of the noise signals (Bousefsaf, Maaoui & Pruski, 2013). Since the MMSE-HR database does not include very challenging recordings from the perspective of motion artifacts, the selected weighting function in the proposed SB-CWT is more likely more appropriate than the weighting function applied in the SB. This is partially confirmed by the fact that at l = 256 the number of sub-bands in SB-CWT is lower than the number in SB, but SB-CWT still performs significantly better.
In order to assess only the effect of the number of sub-bands on the performance of both algorithms, SB-CWT and SB methods were modified in the way that the weighting functions were not applied (step 11 in Fig. 1 was omitted). The results of modified SB and SB-CWT algorithms are shown in Figs. 3B, 4E-4H and 5E-5H. It is interesting that the median SNR value of SB-CWT at l = 256 is comparable to that of SB (see Fig. 3B), even though SB has 17 more sub-bands at that window length. It seems that the reason behind this phenomenon is that there are more degrees of freedom than independent distortions. When this is the case, increasing the number of sub-bands does not improve the performance of the algorithm. If weighting functions are applied (Fig. 3A), SB-CWT shows however significantly better performance than SB at l = 256. The reasoning behind this fact is that the weighting function applied in SB-CWT assumes that the pulse signal exhibits the strongest amplitude compared to noise signals, which is most likely the case in the studied facial recordings. The results from Figs. 4E-4H and 5E-5H also show that the agreement between the PR rPPG and PR ref does not increase with the increasing window length. This is due to the fact that the applied weighting functions in SB and SB-CWT serve for combining the individual sub-band pulse signals into a single output pulse signal. If there is no weighting applied, the noise-dominated sub-band signals are not suppressed from the individual sub-band pulse signals, which results in a noisier output pulse signal. This is more pronounced at longer window lengths, when the number of frequency bins and consequently the number of sub-band signals