Reducing the effects of parallax in camera-based pulse-oximetry.

Camera-based pulse-oximetry enables contactless estimation of peripheral oxygen saturation (SpO2). Because of the lack of readily available and affordable single-optics multi-spectral cameras, custom-made multi-camera setups with different optical filters are currently mostly used. The introduced parallax by these cameras could however jeopardise the SpO2 algorithm assumptions, especially during subject movement. In this paper we investigate the effect of parallax quantitatively by creating a large dataset consisting of 150 videos with three different parallax settings and with realistic and challenging motion scenarios. We estimate oxygen saturation values with a previously used global frame registration method and with a newly proposed adaptive local registration method to further reduce the parallax-induced image misalignment. We found that the amount of parallax has an important effect on the accuracy of the SpO2 measurement during movement and that the proposed local image registration reduces the error by more than a factor of 2 for the most common motion scenarios during screening. Extrapolation of the results suggests that the error during the most challenging motion scenario can be reduced to approximately 2 percent when using a parallax-free single-optics camera. This study provides important insights on the possible applications and use cases of remote pulse-oximetry with current affordable and readily available cameras.


Introduction
Monitoring of physiological parameters without touching the skin is highly attractive, e.g., for people with burned or fragile skin, when there is a serious risk of cross-infection transmission such as with COVID-19, or where the conventional contact-based measurement causes discomfort which could affect its diagnostic value. Over the years, various contactless methods for physiological monitoring have been proposed including radar [1], sonar [2], thermal imaging [3], WiFi [4] and regular cameras. Using consumer-grade cameras is particularly interesting as it allows a cost-efficient and convenient measurement of physiological information, movements and context information simultaneously. Furthermore, compared to RF-based techniques (e.g., radar, WiFi) that only measure the motion signal induced by breathing and beating of the heart, remote photoplethysmography (rPPG) is based on blood absorption and therefore can be used to measure the blood oxygenation level.
So far, most attention in the field of rPPG has been given to heart (pulse) rate and derived features [5,6], and respiration [7]. For (preterm) infants [8], COPD patients [9] and for the detection of sleep apnea [10] it is important to also monitor the oxygen content of arterial blood, i.e. the fraction of hemoglobin saturated with oxygen. Peripheral oxygen saturation (SpO 2 ) is the optical measurement of arterial oxygen saturation (SaO 2 ). In contrast to SaO 2 where arterial blood samples have to be taken, SpO 2 enables a non-invasive and continuous measurement using low-cost hardware, a pulse-oximeter, attached to a peripheral such as finger or earlobe. The cardiac-synchronous periodicity of the PPG waveform is used to isolate the absorption of arterial blood from the other absorbers. One of the challenges in rPPG is the very small signal because only a limited amount of the skin-reflected light has interacted with pulsatile blood. Compared to the conventional transmissive contact-based sensors, the modulation depth (PPG signal strength) is much smaller for the camera, up to 100 times [11], mostly because of the different source-detector geometry, while also anatomical location (finger vs forehead) may play a role.
To extract physiological information from the often noisy and motion-corrupted PPG waveforms, methods have been proposed which exploit the fact that the spectral signature of blood is different than those of the disturbances. By capturing the scene at different wavelengths, suppression of distortions within the heart rate band, e.g., 40 − 240 beats per minute, is possible. Compared to other physiological parameters, the validity of the assumptions on the direction of the physiological and disturbance information are more critical for SpO 2 . For pulse, violation of these assumptions typically leads to a worsened signal-to-noise ratio which would often still allow determination of the correct pulse rate based on frequency characteristics. For SpO 2 , however, the measurement based on amplitude characteristics is directly affected. Part of the algorithm assumption violation is caused by the hardware because a cost-efficient equivalent of an RGB camera, i.e. a camera with a color filter array placed in front of the image sensor, is not readily available in the red near-infrared range, the 'optical window' where the sensitivity to SpO 2 variations is largest and the light is able to reach the deeper layers of the skin. The available single-optics multi-spectral options: 1) do not have the desired spectral characteristics for SpO 2 , 2) are expensive and bulky due to the concept, e.g. a prism to project the incident light to multiple image sensors with different spectral filters, 3) focus on DC spectroscopy applications, e.g. agriculture, and do not offer the required signal-to-noise ratio for rPPG, or 4) suffer from color inhomogeneities and crosstalk between channels. As a consequence, multiple monochrome cameras with different optical filters are typically used to obtain spectral selectivity for the SpO 2 measurement. These cameras however have slightly different viewpoints which introduces parallax. Parallax is a difference in the apparent position of an object viewed along two different lines of sight. The algorithms assume the wavelengths, i.e. the cameras or color channels, to capture the same skin area. A systematic investigation on how much violation of this assumption is acceptable for SpO 2 measurements has not been performed yet.
In this paper we investigate and quantify the effects of parallax by a large-scale experiment. We construct a dedicated dataset with three different parallax settings. For each of these settings subjects were asked to follow a protocol which includes scenarios ranging from suppressed motion, to realistic motion, to challenging head motions, to simulate different use cases. We estimate the oxygenation levels with our earlier published SpO 2 method ("APBV" [12]) for a global image registration approach and compare this to a proposed adaptive local registration method to further reduce the parallax-induced image misalignment. The outcomes of this investigation are an important indicator of the practicality of camera-based pulse-oximetry for different clinical use cases such as sleep, screening for infectious diseases such as COVID-19, or emergency department (ED) triage.
The paper is organized as follows: in the next section image registration methods, the processing framework, the experimental setup, dataset and evaluation metrics are described. Next, the results are presented including a discussion. Finally, the conclusions are drawn in the last section.

Materials and methods
In this section we first describe the problems of parallax and introduce global and local image registration approaches. Next, we describe the processing framework for SpO 2 estimation, the experimental setup, the created dataset and the metrics used for evaluation.

Image registration
To capture the desired wavelengths for SpO 2 , currently multi-camera setups are typically used [11,[13][14][15]. However, a significant challenge of such multi-camera setups is the parallax between cameras. If the image planes of different cameras cannot be well aligned due to parallax, the pixels across multiple channels will not measure the same optical information, which may jeopardize the measurement of SpO 2 that is highly sensitive to the relative color changes between wavelength channels. A hardware solution to reduce parallax is to increase the distance between subject and camera. This option is limited in specific scenarios with short subject-to-camera distances like in a Neonatal Intensive Care Unit (NICU), patient monitoring room or sleeping room, or in-car setting. An existing algorithmic solution to reduce parallax is by image registration.
Conventional image registration methods [16] perform transformation of image pixels via a global model estimated from the epipolar geometry of at least two cameras (called "global registration"). It can be described in two steps: (i) calibration, in the first few frames of a video sequence, it first estimates a global linear transformation model between image planes of two cameras. The model can be assumed to have different degrees of freedom (see Fig. 1), such as translation, rotation, scaling, Euclidean, affine and homography, but the underlying transformations are all assumed linear; (ii) transformation, for the subsequent frames (i.e. assuming that the camera setup is fixed during the recording), it uses the estimated model to transform the images from one camera to the reference camera (e.g. warp the pixels) to have a single alignment. Such image registration has two apparent limitations: (i) the transformation is linear, which cannot cope with the challenge that either the scene has a clear depth information or the object has a clear 3D geometry, i.e. linear transformation is essentially only effective for a 2D plane; (ii) the estimation of the registration model is not adaptive to video contents, which is only valid for the case with a fixed subject-to-camera distance. If the subject has body motion during the measurement that changes his/her distance-to-camera, the registration using a fixed model will fail. The poor image registration will lead to mis-aligned image planes and introduce color gradients/artifacts in the multi-channel images, which may kill the SpO 2 measurement that relies on accurate color amplitude information between multi-wavelength channels.
To address the limitations of existing registration approaches, we propose a non-linear adaptive camera registration method. In general, the proposed method has two essential steps: (i) estimate the displacement of local image patches between different image planes that correspond to the same object; (ii) transform one image with respect to the other using the local displacements measured per image patch, which is a highly non-linear transformation, i.e. there is no assumption posed on epipolar geometric constraints for this step (see the illustration in Fig. 2).
More specifically, the proposed registration method is a patch-to-patch alignment approach, depending on the unit of pixel representation, where each image patch across multiple cameras has its own interpolation in the new image plane. A typical way is to first select a reference camera. Assuming the use case with three cameras, the camera placed in the central position is selected as the reference as it has shorter distance with respect to the other cameras. We measure the patch-to-patch displacement between the reference image and non-reference image. The local displacement can be measured by dense optical flow, such as Lucas Kanade flow [17], Farneback flow [18], Horn-Schunck flow [19], block-matching flow [20], deep-nets flow [21] or 3DRS [22]. This step can be expressed as: where DOF(·) denotes the dense optical flow; I ref and I nonref denotes the reference and nonreference images, respectively; D denotes a 2-channel image representing the displacement of pixels, i.e. one channel contains horizontal shift information and the other channel contains vertical shift information. We mention that (dense) optical flow has been used to measure pixel movement between time-sequential images. Here we use it to measure the displacement of image patches between images from different cameras sampled at the same time due to parallax. Next, we use D to interpolate the non-reference images such that the image patches will be aligned with the ones in the reference image: where Interpolate(·) denotes the pixel interpolation; I reg denotes the registered image. The image patches-based interpolation is a highly nonlinear image transformation. These two steps, including the measurement of local image patches displacement and image patches interpolation, are performed for each individual frame. Therefore, our image registration is adaptive to real-time video contents/events, i.e. it is robust to the scene depth changes or the object position change (e.g., change of the distance between subject and camera) during the monitoring (see Fig. 3).
The strength of our method is that it allows non-linear registration between different cameras and such registration is adaptive to video contents. Thus, it can effectively reduce the color artifacts due to imperfect image alignment that are common in the existing registration methods that assume a global linear transformation model and the model is not adaptively estimated in real-time. In the following section, we shall benchmark its performance against the existing registration approaches.

Processing for SpO 2 extraction
After registration, 13 facial skin patches are defined and tracked using a Histogram of Oriented Gradients (HOG)-based facial landmark detector, as visualized in Fig. 4. The pixel intensities within the skin patches are spatially averaged for the three wavelengths and concatenated over time to generate PPG traces. We divided the raw PPG signals for each wavelength (C i ) by its quasi-DC signal obtained by low-pass filtering (LPF), and bandpass filtered (BPF) the resulting signal to obtain the DC-normalized PPG waveforms: whereC i,s denotes the DC-normalized signal for wavelength i and skin patch s. LPF(·) is a first-order Butterworth filter with a cut-off frequency of 0.7 Hz, and the BPF(·) is a fourth-order zero-phase Butterworth filter with a passband in the range 0.7 − 4 Hz, corresponding to 42 − 240 beats per minute (bpm), the typical range of pulse rates for healthy adults. The DC-normalized color signalsC s are the input for the APBV method [12], which can be mathematically summarized as: where P s is the pulse signal and scalar k is chosen such that ⃗ W PBV has unit length. The calculation of the weights for extraction of the pulse signal, ⃗ W PBV , is formulated as a least squares problem using pulse signatures ⃗ P bv as representation for the different SpO 2 values, where we use the pulse signatures from our earlier calibration study [23] because of the same experimental setup. Although an SpO 2 measurement requires only two wavelengths, adding dimensionality by a third 'redundant' wavelength allows to suppress distortions as we showed earlier [12]. We sample pulse signatures in the range 70 − 105 % SpO 2 with a resolution of 0.1 %. The SpO 2 estimate is obtained by quadratic weighting of the individual estimates, O s , by the corresponding pulse quality indices, Q s . The quality index Q is calculated as the skewness in the frequency domain. We calculated the SpO 2 value every second using processing windows of 10 s and used a 3 s moving average filter as post-processing to arrive at the final estimate.

Experimental setup
The experimental setup consists of three monochrome CCD cameras (AVT Manta G-283B, Allied Vision GmbH, Stadtroda, Germany) placed on an optical table. All cameras were equipped with identical 150 mm lenses (Schneider-Kreuznach 7805791, Bad Kreuznach, Germany). To obtain spectral selectivity, optical fluorescence single-band bandpass filters with center wavelengths (CWLs) of 760, 800 and 905 nm and full-width at half-maximums (FWHMs) of 20, 20 and 43 nm were used. The cameras were externally triggered at a stable frame rate of 15 Hz and were horizontally spaced by 8 cm (small parallax), 24 cm (medium parallax) and 50 cm (large parallax), as visualized in Fig. 4. All cameras had an exposure time of 20 ms, where we adjusted the apertures such that the maximum intensities of the pixels within the region-of-interest (face) are at 80% of the dynamic range. The image data was captured at a resolution of 968 × 728 pixels with 2 × 2 binning and 8-bits pixel depth. Homogeneous and diffuse illumination was provided by

Dataset
In order to quantify the effects of parallax we created a dataset consisting of 150 videos with a total duration of 1125 minutes. Five subjects (4 male) were enrolled, where each subject was asked to follow the same recording protocol 10 times for each of the three parallax settings to reduce the effects of random errors such that solid conclusions can be drawn. A summary of the dataset is listed in Table 1, the distributions of the physiological values present in the dataset are visualized in Fig. 5. The subjects were asked to sit on a chair in upright position at a distance of approximately 8 meters from the cameras. The recording of the reference data was started 20 seconds prior to, and was stopped 20 seconds after the protocol to allow synchronization of the camera and contact data because of the processing and physiological delays. The protocol consisted of five scenarios of 90 s each, and has a duration of 7.5 min as visualized in Fig. 6 The subject was instructed to make small movements and have an irregular respiration including short breath-hold events. This scenario was included to simulate a screening setting, e.g. for COVID-19.
• Motion II (270-360 s): Using an auditory stimulus, the subject was instructed to move their head from the central position to the head support (distance: 15 cm) at a frequency of 25 BPM. This scenario was included to simulate a non-cooperative, restless patient.

Evaluation metrics
To evaluate the performance of the camera-based SpO 2 estimates we computed the mean-absoluteerror (MAE) and the bias (mean difference) metrics, which are calculated as: Fig. 6. To investigate the effects of parallax for realistic motion and physiological challenges, a protocol consisting of four different scenarios was used: 1) with head support, 2) without head support, 3) small movements with irregular respiration, and 4) large, periodic movements. is the median of all four contact-probes to improve the reliability of the reference, and L are the number of samples. Before calculating the metrics it is important to compensate for processing and physiological delays between contact probes and between the contact probes and camera. The delay between contact probes is mostly caused by processing and was determined in our earlier study [11]. In this study we applied the same delay before calculating the sample-wise median. Similarly, we applied a delay of 20 s to compensate for the time offset between the camera and reference measurements.

Results
Examples of raw PPG signals with corresponding relative amplitudes from all subjects are visualized in Fig. 7. For the estimation of the amplitudes, all segments where the head was supported were used, i.e. the first 90 s of the protocol, a total of 45 min for each subject. The amplitudes are estimated for the camera channel with the strongest PPG signal, 905 nm, using peak-valley detection in combination with outlier rejection. It can be observed that Subject C and D have a weaker pulse signal compared to the other three subjects. The spread per individual could partly be explained by the three weeks period over which the data was collected, with associated variations in temperature, metabolism and physiology.
The evaluation metrics calculated for the dataset are visualized in Fig. 8. Figure 9 shows  Furthermore, there is a dominant negative bias in the error. This can partly be explained by the direction of the motion-induced intensity variations. Under color homogenous illumination conditions these variations are equal in all channels in the DC-normalized space, corresponding to a pulse signature of 1, similar to low saturation values. Another reason for the negative bias could be the asymmetry of the sampled pulse signature vectors for oxygen saturation values in this dataset, i.e. sampled vectors in the range 70 − 105 % SpO 2 for normal blood oxygen levels in the dataset in the range 95 − 100 %. For screening applications a negative bias leading to false positives is much preferred over a positive bias where critical conditions of patients could be missed. The ISO standard for pulse-oximeters is relatively lenient (±4%) [24] and for some adult applications the accuracy might be acceptable since there is typically no adverse health effect associated with providing more oxygen (higher FiO 2 levels). For premature infants, however, inaccuracies larger than 2 % may be unacceptable given the relatively narrow target range [25]. In these patients, too high SpO 2 levels are associated with adverse health effects including retinopathy of prematurity.  It can be observed that the proposed adaptive local registration method reduced the error compared to global image registration. On average the error is reduced by 0.47 pp, where the largest gain is obtained for the large parallax setting with an error reduction of 0.83 pp. For this setting the error is reduced by more than a factor of 2 for the most realistic screening scenarios, i.e. excluding 'Motion II'. For large parallax the improvement from local registration is significant (p<0.05) with p = 0.0217, one-tailed, paired t-test. The p-values for medium and small parallax are 0.0757 and 0.3008, respectively. Linear extrapolation of the results for the three parallax settings suggests that the error during the most challenging 'Motion II' scenario can be reduced to approximately 2 pp when the parallax is reduced to zero. With a parallax-free system individual calibration errors relative to the population will then soon be dominant in addition to factors such as face tracking inaccuracies and temperature-dependent vignetting that violate the assumptions of the SpO 2 method.
Representative examples of the pulse and SpO 2 signals from all subjects are visualized in Fig. 10. Here the pulse signal is constructed by concatenation of the pulse segments from the selected pulse signatures of the SpO 2 estimation. Clear variations in pulse rate can be observed during the 'Motion' scenerio, where Subjects A and C in addition also show a variation in SpO 2 , visible in both the camera and reference. It is worth mentioning that the parallax is defined as the ratio between the distance between the cameras and the distance between the cameras and the subject. In our setup we used rather large cameras and lenses and consequently had to place the setup at a large distance from the subject to get a small parallax. If the distance between cameras can be reduced by a factor of 4, e.g. as in smartphones or cameras with s-mount lenses, the 'small' parallax setting with a more practical working distance of 2 m can easily be realized.

Conclusion
In this study we systematically investigated the impact of parallax on the accuracy of camera-based pulse-oximetry by means of large scale experiments. A dataset consisting of recordings with a total duration of almost 19 hours was created where subjects were asked to perform realistic and challenging head movements to simulate possible use cases. Three different parallax settings were evaluated with three identical monochrome cameras with different bandpass filters in near-infrared. Oxygen saturation values were estimated with the motion-tolerant APBV method, where the performance of global image registration was compared to that of a newly proposed adaptive local image registration method to further reduce the image misalignment.
The results showed the clear impact of parallax on the accuracy of the measurement, especially during the most challenging motion scenarios, with errors of 2.53 pp and 0.99 pp for the largest and smallest evaluated parallax setting, respectively. The proposed adaptive local registration method enabled to reduce the error by more than a factor of 2 for the most common motion scenarios during screening settings, e.g. for COVID-19. This study gives important insights on the possible applications and use cases of remote pulse-oximetry with current affordable and readily available cameras. Furthermore, extrapolation of the results suggests that the error during the most challenging motion scenario can be reduced to approximately 2 pp when using a parallax-free single-optics camera.