Pulse rate estimation based on facial videos: an evaluation and optimization of the classical methods using both self-constructed and public datasets

Author contributions Chen JX and Wang X conceived the study; Wu CY and Zhou L carried out the experiment and wrote the manuscript; Wang X and Chen AP edited the manuscript; Wu CY, Chen Y and Chen AP recorded the facial videos and reference signals. And all authors read and approved the final manuscript. Competing interests The authors declare no conflicts of interest. Acknowledgments This work was supported by the Key Research Program of the Chinese Academy of Sciences (grant number ZDRW-ZS-2021-1-2). Peer review information Traditional Medicine Research thanks Zhao Chen, Gang-Gang Li and other anonymous reviewers for their contribution to the peer review of this paper. Abbreviations PR, pulse rate; HR, heart rate; HRV, heart rate variability; TCM, traditional Chinese medicine; PPG, photoplethysmography; ECG, electrocardiography; MSSD, Multi-Scene Sign Dataset; ICA, Independent Component Analysis; POS, plane-orthogonal-to-skin; BVP, blood volume pulse; ROIs, regions of interest; invFT, inverse Fourier transform; ROI, region of interest; SNR, signal-to-noise ratio; FFT, fast Fourier transform; STFT, short-time Fourier transform; MAE, mean absolute error; PCC, pearson correlation coefficient; TPR, true prediction rate; PURE, Pulse Rate Detection Dataset; COH, COHFACE; Chrom, chrominance-based method; MCI, hci-tagging database; DEAP, DEAP dataset; LED, light emitting diode. Citation Wu CY, Chen JX, Chen Y, Chen AP, Zhou L, Wang X. Pulse rate estimation based on facial videos: an evaluation and optimization of the classical methods using both self-constructed and public datasets. Tradit Med Res. 2024;9(1):2. doi: 10.53388/TMR20230704001.


Introduction
Pulse rate (PR) is a crucial parameter in traditional Chinese medicine (TCM) pulse diagnosis, as it provides valuable insights into the nature of cold and heat in diseases.According to TCM theory, a slow PR is associated with pathogenic cold, while a rapid PR is indicative of pathogenic heat.In most cases, PR aligns with the heart rate (HR) of individuals in good health, barring certain conditions like atrial fibrillation.HR, being a significant physiological signal, reflects the overall health status of the human body.A rapid resting HR has been identified as a robust predictor of underlying hypertension, metabolic disorders, and an important risk factor for the development of atherosclerosis, large arterial stiffness, and cardiovascular diseases [1].The correlation between resting HR and mortality is particularly pronounced in patients with cardio-cerebrovascular diseases [2].TCM practitioners traditionally assess PR characteristics through pulse palpation, the same as the use of HR monitors based on electrocardiography (ECG), both are not suitable for remote diagnosis and health monitoring purposes.Indeed, exploring remote measurement technologies for HR or PR holds significant importance in the advancement of remote diagnosis and treatment methods.By enabling non-invasive and convenient monitoring of HR or PR from a distance, these technologies can revolutionize healthcare practices, making them more accessible and efficient for patients and practitioners alike.Remote measurement technologies can bridge geographical gaps and provide healthcare services to individuals in remote or underserved areas, improving healthcare accessibility on a global scale.Moreover, they offer the potential for continuous and real-time monitoring of HR or PR, allowing for early detection of abnormalities and timely interventions.This proactive approach can lead to improved patient outcomes and reduced healthcare costs by preventing the progression of diseases and complications.In most studies, the prediction of the frequency of peripheral vascular pulsation cycles based on facial videos is commonly referred to as HR rather than pulse rate.To ensure consistency and facilitate comparative analysis, this article adopts the term "heart rate" to represent the frequency of peripheral vascular pulsation cycles in all future discussions and research.This decision aims to maintain uniformity in the field and avoid potential confusion arising from varying terminologies.
Traditional contact-based HR measurements, such as ECG and contact photoplethysmography (PPG), often require the expertise of a professional technician and can be uncomfortable for participants.In contrast, noncontact HR measurements offer a solution that eliminates the need for electrodes, chest straps, and clips, providing a more comfortable experience for individuals.In recent years, noncontact HR monitoring based on facial videos has gained significant popularity as a research field.The process typically involves two stages: extracting the blood volume pulse (BVP) signal from video images and calculating HR based on the BVP signal.The extraction of BVP signals relies on detecting small morphological changes, known as ballistocardiography, and capturing the light intensity changes in peripheral blood vessels caused by the heartbeat, referred to as remote PPG [1,2].By decomposing the color channels of the videos, the BVP signals are calculated as the average values of all pixels within the regions of interest (ROIs) in each frame.Noncontact HR monitoring from facial videos offers numerous advantages.It eliminates the need for physical contact, thereby enhancing user comfort and convenience.Moreover, it enables applications in various scenarios, such as telemedicine, remote health monitoring, and aerospace environments.The noncontact nature of this method also reduces the risk of infection transmission, making it particularly suitable for healthcare settings.However, noncontact HR monitoring also faces certain challenges and limitations.Factors such as variations in lighting conditions, motion artifacts, and facial expressions can affect the accuracy and reliability of the measurements.To address these challenges, robust algorithms and advanced signal processing techniques are required to extract accurate HR information from facial videos.Ongoing research and technological advancements aim to enhance the accuracy and usability of noncontact HR monitoring, further expanding its potential applications in healthcare and other domains.
The weak nature of the BVP signal, along with its susceptibility to ambient light changes and motion artifacts, has driven research efforts in video-based HR estimation towards algorithm optimization.Studies have focused on filtering out nuisance signals and extracting high-quality BVP signals [3][4][5][6][7][8].Additionally, to broaden the application scenarios, some studies have explored improvements in imaging equipment, such as high-precision cameras for long-distance scenes, infrared cameras for low-light or nighttime settings, neonatal intensive care unit monitoring, and fatigue driving monitoring [9][10][11][12].
While significant progress has been made through these techniques, variations in evaluation indicators, ROIs, color spaces, imaging equipment, and environments have led to difficulties in evaluating their practical application performances.Therefore, there is a need for systematic exploration and evaluation of different algorithms and research strategies for estimating HR from facial videos.Furthermore, the limited sample sizes (less than 20 participants) in 76% of previous studies have hindered the generalizability of the methods.Additionally, the inaccessibility of private datasets and codes in most studies has made it challenging to replicate and compare results [13,14].
In the present study, we comprehensively evaluated the factors influencing BVP signal extraction, including face detection and tracking, ROI selection, color channel separation, and original BVP signal extraction.We estimated HR from facial videos using four public datasets and our private dataset.To enhance performance, two optimization strategies were implemented.The waveform quality of the original BVP signal was improved through inverse Fourier transform (invFT), and the sliding threshold of the signal-to-noise ratio (SNR) was utilized to filter the signal in the low-quality time window.This study represents a valuable exploration based on artificial intelligence technology, aiming to integrate TCM observation diagnosis and pulse palpation diagnosis, thereby expanding the scientific connotation of TCM's "understand the condition of the disease by observation."(https://www.idiap.ch/en/dataset/cohface),hci-tagging database (MCI) (https://mahnob-db.eu/hci-tagging/)and DEAP dataset (DEAP) (http://www.eecs.qmul.ac.uk/mmv/datasets/deap/) [15][16][17][18].MSSD is a multimodal facial video dataset constructed under different motion scenarios and lighting environments.A total of 107 participants (37 males and 70 females; mean age 23.98 ± 4.61 years) were recruited from Beijing University of Chinese Medicine.Among them, 33 participants were recorded under a standard light source, while 74 participants were recorded under a non-standard light source.The camera was positioned approximately 50 cm in front of each participant.Two cameras running OpenCV 2 were used to simultaneously record participants' facial videos, and a finger-clip pulse oximeter (Contec CMS 60 or SOMNO V6) recorded the reference PPG signal at a frequency of 60 Hz.Each participant was asked to record approximately 3 minutes of facial video with the two cameras under four scenarios: resting, talking, deep breathing, and resting after exercise.The videos of resting after exercise were recorded after completing 200 steps on the stair climber.Additionally, HR measurements were taken using Omron's electronic blood pressure monitor (HEM-7320) before and after video recording.

Highlights
The MindVision industrial camera (MV-SUA134GC/MT) with a resolution of 1280 × 960 and the Aoni web camera (C27Pro full HD Video) with a resolution of 640 × 480 were used to record facial videos.The standard light source for video image recording was provided by a 26 cm ring light emitting diode (LED), which served as the only auxiliary light source and had specifications of D60, 5500K, and 95Ra8.This ring LED was commissioned by Shanghai Bengu Intelligent Technology Center.The non-standard light source was collected in an indoor environment where natural light and fluorescent lamps provided favorable lighting conditions, along with the ring LED.The collection setup is illustrated in Figure 1.After data cleaning, the MSSD dataset comprised 816 facial videos with corresponding PPG signals due to some anomalies in data collection and storage.All participants provided written informed consent before data collection.The Ethics Committee of Beijing University of Chinese Medicine approved this study (No. 2019BJZYYLL0101).
The combination of public datasets and MSSD dataset resulted in a total of 2,435 facial videos collected from 209 participants.Table 1 provides a detailed description of each dataset.The distribution of reference HR ranged from 45 to 135 bpm, with a mean ± standard deviation of 73.33 ± 11.42 bpm.Additional information can be found in Appendix A.  HR, blood oxygen, blood pressure Due to equipment limitations, the real-time frame rate of the videos in the self-collected dataset was approximately 10 fps.To standardize the frame rate across the dataset, cubic interpolation was applied to adjust the video data to a frame rate of 30 fps.This interpolation method helps to create smoother and more consistent video sequences for further analysis and processing.HR, heart rate; PPG, photoplethysmography; ECG, electrocardiography; MSSD, Multi-Scene Sign Dataset; PURE, Pulse Rate Detection Dataset; MCI, hci-tagging database; DEAP, DEAP dataset; COH, COHFACE.Submit a manuscript: https://www.tmrjournals.com/tmr

Data processing
The datasets underwent several processing steps, including face extraction, ROI processing, color space processing, signal pre-processing, BVP extraction post-processing, and HR comparison.The main research framework is illustrated in Figure 2. Face extraction was performed using a multi-task learning-based algorithm that utilized a 68-feature-point detection approach to detect and track faces [19].
For ROI processing, 8 rectangular ROIs and a fusion region were generated based on the 68 facial feature points.These ROIs included the forehead (ROI 1), glabella (ROI 2), left cheek (ROI 3), right cheek (ROI 4), nose (ROI 5), lips (ROI 6), chin (ROI 7), and the entire face (ROI 8).Additionally, ROI 1, ROI 3, ROI 4, ROI 5, and ROI 7 were combined to create a new fusion region (ROI 9), representing the corresponding organs according to TCM theory.Color space processing involved transforming the color facial videos from the RGB color space of the ROIs into HSV, Lab, and YUV (YCbCr) color spaces using OpenCV.Signal pre-processing and BVP extraction involved calculating the pixel-wise average value in the ROIs based on the specific color space, separating the original BVP signal of each color channel, and utilizing Algorithm 1 to extract the BVP.
Post-processing of the BVP signals was carried out using Algorithm 2 to enhance the BVP signals.HR comparison was performed using Algorithm 3 to compare different methods for estimating the HR of individuals.Furthermore, different time windows (5 s, 10 s, 15 s, 30 s, 45 s, and 60 s) were selected for calculating HR from the facial videos to investigate the effect of the time window on HR calculations.Extract BVP signal.Four different algorithms were used to extract BVP signals: fast Independent Component Analysis (ICA), chrominance-based method (Chrom), and Plane-Orthogonal-to-Skin (POS), and a proposed Single Channel algorithm as a baseline.In the RGB color space, the green channel (G) was considered to be the ideal choice for extracting the original BVP signal, as hemoglobin and melanin have a higher absorption ability for green light at around 550 nm [3,5,6,20].However, in other color spaces such as HSV, Lab, and YUV, there is no standardized criterion for color channel selection.Therefore, Algorithm 1 was employed to select the color channel with the highest SNR for extracting the original BVP signal in different color spaces [5].The aim was to identify the color channel that provided the BVP signal with the highest SNR, as a higher SNR indicates a better quality BVP signal, as shown in Algorithm 1. Enhance BVP signal.The BVP signal was subjected to several post-processing steps to enhance its quality.The median filter was first applied to detrend the BVP signal, removing low-frequency disturbances [21].Then, the Hull moving average filter was used to reduce high-frequency disturbances.To further refine the BVP signal, a Butterworth bandpass filter was applied.This filter suppressed frequencies outside the range of 0.7-3.0Hz, corresponding to the expected HR range of 42-180 bpm.However, despite these filtering steps, the BVP signal still had limitations in calculating instantaneous HR and heart rate variability (HRV).To improve the quality of the waveform signal, the invFT of the BVP signal was performed.The process involved the following steps: (1) Fourier transform of the BVP signal, (2) enhancing the spectral amplitude corresponding to the HR signal based on the SNR and suppressing the spectrum outside the HR range, and (3) performing the invFT of the modified spectrum, as shown in Algorithm 2. HR estimation.Two main methods were used to calculate HR based on the extracted BVP signal: frequency domain analysis and time-domain analysis.Algorithm 3 outlines the process for HR calculation using these methods.The time-domain analysis method calculates HR based on peak detection algorithms that analyze the intervals between successive peaks in the BVP signal.However, this method can be affected by the quality of the waveform signal.On the other hand, frequency domain analysis converts the BVP signal from the time domain to the frequency domain using Fourier transform techniques.This allows for the calculation of HR by analyzing the spectral content of the signal within the HR bandwidth.Three specific frequency domain analysis methods were explored in the study: FFT, STFT and Welch.These methods involve different techniques for Fourier transform and provide different insights into the spectral characteristics of the BVP signal.In addition to the frequency domain methods, peak detection algorithms were also utilized to calculate HR.These algorithms analyze the peaks in the BVP signal and determine the HR based on the time intervals between these peaks.The study aimed to compare the performance of these frequency domain analysis methods and peak detection algorithms in estimating HR from the BVP signal.

Statistical analysis
The study evaluated HR estimation from facial video using mean absolute error (MAE), pearson correlation coefficient (PCC), and true prediction rate (TPR).HRV was assessed using Average of normal-to-normal (NN) intervals and Standard Deviation of NN intervals.The specific formulas for these metrics can be found in Appendix B of the study.

Results
This study extensively investigated five key factors in HR evaluation from facial videos: BVP extraction algorithm, region of interest, time window, color space, and HR calculation method.In cases where the effects of these factors varied across datasets, a voting method was employed to compare their impact on HR prediction.Additionally, the influence of individual factors was examined using the control variable method, and optimization effects were analyzed through SNR threshold sliding and BVP signal enhancement.For detailed results of the optimization effect, please refer to Appendix C.

Comparison of HR estimation based on multi-influencing factors
Based on the comparative experimental results, the BVP extraction algorithm, ROI, and time window were identified as crucial influencing factors in our study.Lab color space and FFT algorithm yielded satisfactory results for HR calculations.Therefore, we conducted a comprehensive evaluation of other basic influencing factors using the Lab color space and FFT algorithm.The results were presented in the heat map (Figure 3).The MCI and DEAP datasets, which lack accurate synchronization between reference signals and video frames, were not suitable for HR estimation.Thus, the COH, PURE, and MSSD datasets were primarily utilized to explore the influencing factors.Comparing the MAE, PCC, and TPR in HR estimation across these datasets, we found that the performance of the MSSD dataset fell between PURE and COH.Despite its lower performance compared to the PURE dataset, the larger sample size in our MSSD dataset (107 participants vs. 10 participants) may enhance the generalizability of the findings.

Comparison of HR estimation based on reasonably well factors
Based on the optimization results and factors that performed well, we identified the following parameters for HR estimation from facial video: Lab color space, Chrom algorithm, time window of 30s, ROI 9, FFT algorithm, and SNR of 0.3.Table 2 illustrates the performance of these parameters on the four public datasets (PURE, COH, MCI, and DEAP) and our MSSD dataset.Among the datasets, the PURE dataset achieved the best HR estimation performance with an MAE of 1.763, a PCC of 0.989, and a TPR of 0.949.The MSSD dataset also demonstrated satisfactory results, with an MAE of 1.962, a PCC of 0.885, and a TPR of 0.929.

Comparison of HR estimation based on cross-dataset and subgroup
Table 3 reveals that specific motion subgroups such as speaking, deep breathing, and head transition or rotation had a detrimental effect on HR estimation, despite using an SNR of 0.3, BVP signal enhancement, and factors that performed well.In the MSSD dataset, the most accurate HR estimation occurred during rest after exercise, achieving an MAE of 1.414, PCC of 0.950, and TPR of 0.955.In the PURE dataset, the most reliable HR estimation performance was observed during a steady state, with an MAE of 1.200, PCC of 0.996, and TPR of 1.000.Adequate and evenly distributed ambient lighting improved the accuracy of HR estimation from facial videos.In the MSSD dataset, videos recorded under a non-standard light source outperformed those under a standard light source.The standard light source recordings were captured in a dimly lit room, with only a ring LED light as the auxiliary lighting source in front of the participant.Excessive brightness caused discomfort to the participants.Therefore, we used low light intensity for auxiliary lighting.Non-standard light source conditions involved natural light or auxiliary lighting from fluorescent and LED lamps, resulting in better light distribution on the participants' faces.In the COH dataset, the nature scenario was illuminated by natural light, which overall provided insufficient lighting for accurate HR estimation.In the clean scenario, both natural light and spotlights were used as auxiliary light sources.Furthermore, the industrial camera yielded better results compared to the webcam, possibly due to the industrial camera retaining more original information from uncorrected video images.

Discussion
In recent years, the field of artificial intelligence has made significant progress, with advancements like ChatGPT, and is expected to have a profound impact on the medical field.TCM can benefit from this technological development by emphasizing both inheritance and innovation.Artificial intelligence, particularly computer vision-based techniques, can expand the scope of TCM observation and diagnosis, moving towards the goal of "superior doctors could understand disease conditions through observation."Facial videos have shown promise in measuring various physiological indicators, such as arterial oxygen saturation, tissue oxygen saturation, pulse rate, and respiratory rate [22].
However, despite the growing interest in remote HR estimation from facial videos, most studies have been limited to small private datasets and lack standard research designs and data processing protocols, hindering their application in real-world scenarios.Additionally, there is a lack of research on single-color channel-based HR extraction from BVP signals.Therefore, it is crucial to investigate the influencing factors in HR estimation from facial videos.In this study, we systematically examined the effects of various factors, including datasets, BVP signal extraction algorithms, HR calculation algorithms, ROIs, time window, color spaces, and scenarios.Furthermore, we proposed an algorithm with signal enhancement and validated its positive impact on HR estimation.
Our findings indicate that the performance of HR estimation from facial videos varies across datasets.The PURE dataset is considered the most favorable, although it has a limited number of participants.This aligns with previous research, which identified PURE as an excellent dataset while reporting lower performances for COH and MCI datasets [14].Our optimization algorithm demonstrated improved results for the COH and MCI datasets in HR estimation.The choice of color space for HR estimation from facial videos depends on the dataset and algorithm used.In our study, the Lab color space yielded relatively optimal results.However, other studies have reported the superiority of the Hue channel or the Cb + Cr combination in YCbCr color space [23,24].Additionally, transforming the RGB color space to the Lab color space and applying prior smoothing methods have been shown to improve accuracy in HR estimation [25].
Regarding the region of interest (ROI), both the full-face region (ROI 8) and the fusion region representing the five organs (ROI 9) outperformed individual sub-regions.Among the individual sub-regions, the forehead and cheeks showed better results, which aligns with the findings of Nakhaei-Rad et al. [26].Further investigation is needed to determine the impact of removing lips, jaw, and eye regions by employing facial skin segmentation models.
Expanding the time window improves the accuracy of HR estimation from facial videos, with time window between 10-30 seconds proving to be more appropriate.The Chrom and POS algorithms outperformed the ICA and Single Channel algorithms in extracting BVP signals and estimating HR.These algorithms exhibit better resistance to motion artifacts and changes in lighting environments.Setting the SNR threshold sliding to screen high-quality signals based on application needs has not been widely reported but can enhance the effectiveness of HR estimation.Signal enhancement based on BVP signals can improve the quality of the signals and enhance the estimation of instantaneous HR and HRV.While PulseGAN has been proposed for optimizing BVP signals using generative adversarial [27].Zijie Yue et al. propose a local remote PPG expert aggregation module to estimate remote PPG signals from augmented samples.This module captures pulsation information from various face regions and combines them to generate a single remote PPG prediction [28].All of these research strategies optimize the remote PPG or BVP signal extracted from facial video, and improved the prediction effect of vital signs.

Limitations
While this study provides a comprehensive evaluation and optimization of classical algorithms for HR estimation from facial videos, there are several limitations that should be considered.Firstly, the study primarily focuses on classical algorithms and does not extensively explore the potential of deep learning algorithms in this field.Deep learning algorithms have shown promise in improving HR estimation, but their interpretability can be challenging.Additionally, deep learning models often require a large amount of training data, and the limited number of participants in the datasets used in this study may impact the model's performance in real-world scenarios.Secondly, individual differences among participants can significantly influence the accuracy of the model.To overcome these limitations, future research should consider larger sample sizes and advanced algorithms, including deep learning approaches, to enhance the generality and expand the application scenarios of HR estimation from facial videos.Finally, the optimization of the waveform signal through inverse Fourier transform is useful for dealing with the original signal that has interference.By amplifying the main frequency spectrum and applying inverse Fourier transform, the optimization effect is closely linked to the amplification amplitude of the main frequency spectrum.While the prediction of HR may not be significantly affected, there will be considerable challenges in predicting physiological indicators such as blood pressure and blood oxygen saturation due to the loss of some detailed characteristics.Therefore, to optimize the waveform signal, it is necessary to select an appropriate amplification amplitude for the main frequency spectrum.Generally, choosing an amplification amplitude of 2-3 times is reasonable as it allows for both waveform signal optimization and retention of the original signal's characteristics.

Conclusions
In conclusion, this article presents a thorough investigation of the factors influencing HR estimation from facial videos.Practical recommendations for improving accuracy of HR estimation from facial videos, such as using Lab color space, FFT algorithms, longer time windows, and signal enhancement by invFT.Future research can build upon these findings by exploring advanced algorithms and utilizing larger datasets, leading to further advancements in HR estimation from facial videos.

Figure 2
Figure 2 Framework for estimating HR based on classical methods of facial video.ROI, region of interest; HR, heart rate; BVP, blood volume pulse; FFT, fast Fourier transform.

Figure 3
Figure 3 Overall effect of facial video estimation of HR with different factors.MSSD, Multi-Scene Sign Dataset; PCC, pearson correlation coefficient; TPR, true prediction rate; MAE, mean absolute error; COH, COHFACE; PURE, Pulse Rate Detection Dataset; HR, heart rate.

Table 2 Comparison of different datasets on HR prediction performance from facial videos
Based on Lab color space, Chrom algorithm, time window at 30s, ROI 9, FFT; SR, the proportion of reserved videos with SNR at 0.3.The label signal of an MCI dataset is ECG, and the peak detection algorithm calculated the reference HR.MAE, mean absolute error; PCC, pearson correlation coefficient; TPR, true prediction rate; HR, heart rate; N, number; SNR, signal-to-noise ratio; MSSD, Multi-Scene Sign Dataset; PURE, Pulse Rate Detection Dataset; MCI, hci-tagging database; DEAP, DEAP dataset; COH, COHFACE.

Table 3 Effects of different test scenarios on the estimated HR from facial videos
MCI datasets, focused on emotion analysis, have issues with time synchronization for HR estimation.Moreover, estimating HR from facial videos during movie-watching scenarios is challenging due to complex ambient lighting conditions.While the COH dataset is proprietary for HR estimation from facial videos, some videos within it have significant contrast in facial lighting, negatively impacting the results.In contrast, our MSSD dataset proves effective, offering a larger sample size and multiple test scenarios.