Bayesian fusion of physiological measurements using a signal quality extension

Objective: The fusion of multiple noisy labels for biomedical data (such as ECG annotations, which may be obtained from human experts or from automated systems) into a single robust annotation has many applications in physiologic monitoring. Directly modelling the difficulty of the task has the potential to improve the fusion of such labels. This paper proposes a means for the incorporation of task difficulty, as quantified by ‘signal quality’, into the fusion process. Approach: We propose a Bayesian fusion model to infer a consensus through aggregating labels, where the labels are provided by multiple imperfect automated algorithms (or ‘annotators’). Our model incorporates the signal quality of the underlying recording when fusing labels. We compare our proposed model with previously published approaches. Two publicly available datasets were used to demonstrate the feasibility of our proposed model: one focused on QT interval estimation in the ECG and the other focused on respiratory rate (RR) estimation from the photoplethysmogram (PPG). We inferred the hyperparameters of our model using maximum- a posteriori inference and Gibbs sampling. Main results: For the QT dataset, our model significantly outperformed the previously published models (root-mean-square error of ms for our model versus ms from the best existing model) when fusing labels from only three annotators. For the RR dataset, no improvement was observed compared to the same model without signal quality modelling, where our model outperformed existing models (mean-absolute error of bpm for our model versus bpm from the best existing model). We conclude that our approach demonstrates the feasibility of using a signal quality metric as a confidence measure to improve label fusion. Significance: Our Bayesian learning model provides an extension over existing work to incorporate signal quality as a confidence measure to improve the reliability of fusing labels from biomedical datasets.

addressed this by jointly inferring the ground truth and estimating the bias and precision of annotators in an unsupervised manner. However, our model assumed that the precision of each annotator was not affected by underlying task difficulty, an assumption which is infrequently true in practice, particularly in medical applications. Large amounts of noise can increase the false-positive alarm rate for automated annotators in electrocardiogram (ECG) arrhythmia analysis (Aboukhalil et al 2008) or photoplethysmogram analysis (Petterson et al 2007).
The level of noise in a signal can be quantified by one or more derived measures referred to as signal quality indices (SQIs). Signal quality, as quantified by SQIs, can be seen as a measure of task difficulty: noisy records are harder to label due to noise contamination of the signals, while clean records are easier to label. Most studies have used signal quality as a filter for removing artefactual segments in the pre-processing stage (Karlen et al 2013), or considered it as a feature component for automated prediction (Behar et al 2013, Zhu et al 2014. Other studies have used the signal quality in a similar context to estimate confidence for a given segment of medical data (Li et al 2008, Tsanas et al 2014, Vollmer 2014, Johnson et al 2015, Pimentel et al 2015.
Instead of using the signal quality as an indicator for removing noisy segments which might cause large sections of data to be discarded, we propose to incorporate signal quality as an independent variable in a Bayesian model. Expert annotators are able to 'filter' noise to some degree and provide consistently reliable annotations across data of differing noise levels, whereas non-experts might mistake noise for intrinsic features of the signals. Our proposed model incorporates a probabilistic measure of signal quality to capture task difficulty and thus better infer the underlying ground truth of a physiological signal.

Novelty
This article introduces the use of signal quality of the physiological measurements to describe the uncertainty of the labels provided by automated algorithms. Our proposed Bayesian unsupervised model, BCLAs, is able to fuse the predicted labels from algorithms by incorporating the underlying quality of a physiology, to further improve the estimation of consensus, and guarantee the output is better than any of the individual algorithms considered independently.

Methods
Suppose that there are N recordings of time-series data annotated by R annotators. Let D = , where x i is a column feature vector for the ith record containing d features, and y j i corresponds to the annotation provided by the jth annotator for the ith record. A novel incorporation of the signal quality is proposed in our model: we assume that y j i is a noisy version of z i , the unknown underlying ground truth for the ith record, and that y j i is drawn from a Gaussian distribution as N z i + φ j , t j i λ j −1 5 . The bias for annotator j, denoted as φ j , is defined as the averaged estimation error, and λ j is the precision of the jth annotator, defined as the estimated inverse-variance of annotator j. The signal quality for the ith record from a jth annotator is denoted as t j i (note the dependence on j as not all annotators provide labels on the same segment). The signal quality is assumed to be within the range of (0,1], where 1 indicates a good signal quality (the segment is clean) while any value close to 0 indicates a noisy segment. Essentially, t j i acts as a scaling factor for λ j : as t j i → 0 annotator j is less precise in labelling the ith recording as it is of noisy quality. If we set t j i = 0, this would imply zero precision (or infinite variance) for each annotator in the model, and thereby a dataset specific lower bound of t j i needs to be obtained to define the range of 'good quality' data. The bias for jth annotator, φ j , is also assumed to be drawn from a Gaussian distribution as N (µ φ , 1/α φ ) 6 . Furthermore, the ground truth, z i , has its own distribution as N (a, 1/b), where a can be expressed as a linear regression function f (w, x), with w being the coefficients of the regression as previously described in the BCLA model (Zhu et al 2015). An intercept that models the overall offset predicted in the regression is included, and is different from the annotator-specific bias in this model. The graphical representation of BCLAs is illustrated in figure 1.
Under the assumption that records are independent and y 1 i , · · · , y R i are conditionally independent given the feature x i , the posterior probability density function with signal quality extension of the parameter θ = {w, λ, φ, α φ , b, z i } for a given dataset D can be written using Bayes' theorem as

The maximum-a posteriori (BCLAs-MAP) approach
Estimation of θ θ θ can be performed using the maximum-a posteriori (MAP) approach, which maximises the log-posterior of the parameters, i.e. argmax The parameters in θ θ θ can be derived by equating the gradient of the log-posterior to zero. In the case where there exist missing labels from annotators, only the available annotations should be considered for inferring the ground truth. Let U j be the set of records/segments with annotations provided by the jth annotator, and V i be the set of annotators that provided annotations for the ith record. Then, we can derive equations for each parameter in θ θ θ as follows: (4) Figure 1. Graphical representation of the BCLAs model: y j i corresponds to the annotation provided by the jth annotator for the ith record, and it is modelled by the z i (the unknown underlying ground truth), the φ j (bias), the λ j (precision), and signal quality t j i . BCLAs is the BCLA model with an additional t j i component: z i is drawn from a Gaussian distribution with parameters mean a and variance 1/b, where a can be a function of feature vector x i . φ j is modelled from a Gaussian distribution with mean µ φ and variance 1/α φ . The b, λ λ λ, and α φ are drawn from a Gamma distribution (denoted as Gamma) with parameters k b , ϑ b , k λ, ϑ λ , and k α, ϑ α respectively. Adapted from Zhu et al (2015) © Biomedical Engineering Society 2015. With permission of Springer.
The parameters can be solved using a two step expectation maximization (EM) algorithm: (i) the E-step estimates the expected true annotations for all records, ẑ, as a weighted sum of the provided annotations, and can be estimated using equation (6); (ii) the M-step is based on the current estimation of ẑ and given the dataset D. The model parameters, w, φ φ φ, α φ , b, and λ λ λ can be updated using equations (2)-(5) and (7) accordingly in a sequential order until convergence.

The Gibbs sampling (BCLAs-Gibbs) approach
The Gibbs sampling approach can be considered as a stochastic process of the EM algorithm. Unlike the MAP approach, which is a point estimation, the Gibbs sampler not only provides estimates but also produces confidence in its estimation-this is a key advantage of fully-Bayesian inference. The random samples of the parameters in θ are drawn from the posterior distribution while one parameter is assigned to a fixed value at each sampling process. Each parameter in θ described in equation (1) can therefore be updated by sampling from its posterior distribution with its hyperparameters (denoted by * ) as Note that z and φ are the averaged value of z and φ φ φ respectively. The Gamma distributions (as conjugate priors), are considered as the prior probabilities for the parameter set θ. After M sequences of the Gibbs sampler, the expectation of each parameter can be approximated by calculating the mean across samples after the burn-in period (i.e. M 2 ).

Capnobase RR dataset
The Capnobase dataset (Karlen et al 2010) was collected during elective surgery and routine anaesthesia of 59 children (median age: 9, range: 1-17 years) and 35 adults (median age: 52, range: 26-76 years). Photoplethysmogram (PPG) recordings and reference capnometry data were collected at a sampling frequency of 300 Hz. The dataset is useful for the development of algorithms to estimate respiratory rate from the PPG as the simultaneous capnometry data can be used as a ground truth. Karlen et al (2013) defined a test set of 42 recordings lasting 8 min each from 42 distinct subjects for estimating respiratory rate from the PPG. All experiments performed in this paper use this set. Three respiratory-induced modulation signals are derived from the PPG: amplitude modulation (AM), baseline wonder (BW), and frequency modulation (FM). These modulation signals are created using 32 s windows, with successive windows having 29 s overlap. To extract these modulations, PPG beat detection was performed using a segmentation algorithm (Li et al 2010), where it produced a series of maximum and minimum intensities for each PPG pulse. The series of maximum intensities were used for extracting the BW signals; the (max-min) amplitude was used to derive the AM signal; the intervals between successive beats was used to extract the FM signal. Respiratory rate (RR) was then estimated using two different spectral approaches: Fourier analysis (FFT) and autoregressive (AR) modelling. Spectral analysis requires evenly-sampled data, and so each time-series (corresponding to BW, AM and FM) was first re-sampled at 4 Hz using linear interpolation. For the FFT method, the frequency at which the maximum intensity of each spectrum is obtained within the frequency range of interest (corresponding to 3-60 breath-per-minute (denoted as bpm)) to derive RR. For the AR method (Orphanidou et al 2009), a AR model with order of 7 was fitted to each modulation signal. The ideal respiratory frequency was identified as that corresponding to the pole with the greatest magnitude within the plausible range of frequencies to derive RR. A maximum of 900 RRs were extracted for each subject for 150 windows of data. Each window has six RR estimates calculated using the FFT-and AR-based methods applied to the three modulation signals.

Signal quality in the Capnobase RR dataset
A pulse signal quality index (denoted as pSQI) was adapted from Pimentel et al (2015) to measure signal quality. pSQI is defined as the percentage of the agreement between two PPG peak detectors, wabp and PUD, calculated using 5 s windows with 50% overlap. To derive a single pSQI for each 32 s window described earlier, we average all the pSQI values calculated within the 32 s window. A total of 150 pSQI values were obtained for the 8 min recording for each subject. The PPG peak detectors are outlined below: (i) wabp is available in PhysioNet as a Matlab function of the 'wfdb' toolbox designed by Zong et al (2003) 7 . The detector locates the onset of PPG pulses that have been re-sampled at 125 Hz, by first converting each waveform into accumulative first-derivative signal, and then performing adaptive thresholding and local-search strategies to identify pulse onsets. The PPG peak was found subsequently by determining the maximum value in a 150 ms window after the onset locations. (ii) PUD is a pulse waveform delineator, and was adopted from Li et al (2010), where the PPG signal was again re-sampled to 125 Hz. The onset, peak, and dicrotic notch of each PPG pulse were detected using PUD. The final peak location of each pulse was determined by searching the maximum value within a 150 ms window centred at a detected peak (if a dicrotic notch was not found) that appeared in the pulsatile PPG waveform.

PCinC QT dataset
The 2006 PhysioNet/Computing in Cardiology (PCinC) Challenge QT dataset (Moody et al 2006) is publiclyavailable and provides ECG QT interval annotations from both human and algorithmic annotations. Each participant in the Challenge was required to submit a Q onset with accompanying T offset for one 'representative' beat in lead II of each of the 548 recordings in the Physikalisch-Technische Bundesanstalt Diagnostic ECG Database (PTBDB) (Bousseljot 1995). The QT length estimated from each participant represents the QT interval of a particular recording. The records were obtained from 290 subjects (209 men with mean age of 55.5 and 81 women with mean age of 61.6) with a variety of ECG morphologies having different QT intervals ranging from 256 ms to 529 ms, each represented by between one and five recordings. About 20% of the subjects were healthy controls. A total of 38 621 annotations sourced from: 20 human annotators (grouped into Division 1 and used to derive the reference annotations); 48 automated algorithms (grouped into Division 2 as algorithms were closedsource); and 21 automated algorithms (grouped into Division 3 as algorithms were open-source). An additional division, Division 4, was created so as to combine all automated algorithms from Divisions 2 and 3, and to infer a potentially better estimation of QT intervals. The competition score for each entry was calculated from the root-mean-square-error (RMSE) between the submitted and the reference QT intervals. To reduce any possible inter-beat variations, the first 5 s segment of each record was considered as a short segment where the QT interval was not changing dramatically (with respect to any particular beat chosen by an annotator), while retaining the highest number of annotations. Those that fell outside this segment were considered to be missing information and discarded in the process of the QT estimation.

Signal quality in the QT dataset
In the context of the QT dataset considered by this study, it is known that an underlying true QT interval can be modulated by heart rate variation (International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use 2014) and signal quality is recording-dependent. As a proofof-concept, a beat signal quality (denoted as bSQI) as described by Johnson et al (2015) was used in the work described by this study. The bSQI variable measures the percentage of the agreement of two R-peak detectors, naming gqrs and jqrs, within a 5 s segment, evaluated every 1 s. bSQI = 1 indicates 100% agreement between the two detectors, and this serves as a measure that the segment has good quality, and free from noise and artefact. Any segment with bSQI < 1 indicates the disagreement between detectors which is assumed to be due to the presence of noise and artefact. Description of each detector is as follows (Johnson et al 2015): (i) gqrs is available in PhysioNet as a Matlab function of the 'wfdb' toolbox 8 . The detector consists of a QRS matched filter with a custom-built set of heuristics. (ii) jqrs consists of a window-based peak energy detector described in Behar et al (2013) and was modified: (1) the original band-pass filter was replaced with a 'Mexican hat' wavelet filtering of the QRS complexes; (2) a heuristic implementation was applied to ensure no detections were made during periods of flat-line; (3) a search-back procedure was included in the case of suspected missed beats.

Capnobase RR dataset
The BCLAs model was applied for fusing RR estimates for 42 subjects individually. It was evaluated in the same parameter setting as the BCLA model (see table 1). The mean of the mean-absolute-error (MAE) and its standard error of the inferred RR estimates from the PPG across all subjects using BCLAs were compared to the reference RR values from capnography. The MAE was chosen as the method of comparison as it allows us to compare our results with those that were published in literature. BCLAs was applied to the RR estimates extracted using the FFT-based, AR-based, and all (FFT + AR) algorithms. In addition, BCLAs was compared to BCLA, 'smart fusion' (i.e. the benchmarking algorithm proposed by authors of the dataset (Karlen et al 2013)), the bestperforming (lowest-MAE) single algorithm (denoted 'Best'), the EM algorithm proposed by Raykar et al (2010) (denoted as EM-R), the scalar Simultaneous Truth and Performance Level Estimation (denoted as sSTAPLE) proposed by Warfield et al (2008), and also the mean and median voting approaches.

PCinC QT dataset
As mentioned previously, the value of the lower bound threshold (denoted as T) of bSQI needs to be initialised to be greater than zero to avoid assigning a zero precision to each annotator (which results variance of their annotations to be infinite). The evaluation of T is dataset-specific, and can be estimated as follows: T is set to be ranged within [0.01,1], and with T = 1, BCLAs is equivalent to the BCLA model as any values of bSQI ⩽ 1 are mapped to 1. For each T value, RMSE of the inferred ground truth is computed, and the optimal threshold for the lower bound of bSQI (denoted as T min ) that corresponds to the minimum RMSE can be obtained.
To test the feasibility of introducing bSQI into the BCLAs model, the same parameter setting (see table 1) was provided as in the BCLA model for latter comparison. The feature matrix of the beat-specific heart rate (bHR) for the ith record, and R R is the median of the R-R intervals within the segment.
In assessing the performance of BCLAs as a function of the number of annotators, a random number of annotators was selected 100 times with replacement. This was repeated with the annotator numbers varied from three to the maximum number of annotators in Division 4. RMSEs using BCLAs with the bHR feature were computed and compared to (1) BCLA with no features (denoted as NF); (2) BCLA with the bHR feature; (3) EM-R with the beat-specific heart rate and signal quality features described in a previous work (Zhu et al 2014) (denoted as EM-R with bHRSQIs); (4) sSTAPLE. The RMSE was chosen as the method of comparison in this dataset as it allows us to compare our results with those that were published in 2006 PCinC Challenge.

Capnobase RR dataset
MAEs of the RR estimates using BCLAs were similar to those using BCLA when fusing FFT-based, AR-based, and all (FFT + AR) algorithms (see table 2). Noting in figure 2(a) that pSQI ≈ 1 for most windows, with 2.47% of windows across 42 subjects having pSQI ⩽ 0.9, meaning that there was high agreement between the two peak detectors throughout most windows. As discussed previously when pSQI ≈ 1, the BCLAs model is equivalent to the BCLA model, which explains why no improvement in MAE was observed. Furthermore, the reference labels were provided from the capnometry waveform, but extraction of RRs was performed on the modulation signals, making the pSQI values from the PPG less predictable of the underlying signal quality. As a matter of fact, the pSQI values were not associated with MAEs of the RR estimates (see figure 2(b)). Although the BCLAs approaches did not improve upon the BCLA approaches, they outperformed other voting methods as well as the best-performing single algorithm. In particular, the BCLAs approaches (using all windows available) had smaller MAEs than those of the smart fusion which discarded 44.6% of windows that had a large variance in the estimates. Our proposed model is therefore deemed to have robust performance for noisy physiological data. Figure 3(a) shows a plot of MAEs across annotators against the reference QT intervals for all recordings provided in Division 4. As there is no direct correlated relationship between MAEs in QT interval estimation and the length of the QT intervals, the figure demonstrates that the errors in annotations had little associated with the physical length of the reference QT intervals. A similar observation is shown in figure 3(b), where there is no linear relationship between the MAEs across annotators and the averaged bSQI extracted from all recordings. The  (2006) and Couderc et al (2011). The values with e are derived from Malik et al (2002) and Goldenberg et al (2006). cumulative distribution function of bSQI values for Division 4 is shown in figure 3(c). 1.08% of the annotations have bSQI ⩽ 0.3125, which amounted to 260, 103, and 363 annotations for Divisions 2, 3, and 4, respectively. However, only 10.90% annotations had bSQI = 1 in Division 4, the remaining are distributed across different values of bSQI (see figure 3(d)). Hence a lower bound of bSQI should be obtained to define the range of 'good quality' data, and avoid discarding a large amount of annotations which have bSQI < 1.

Lower bound optimisation of bSQI
As described earlier, the bSQI variable introduced in the BCLAs model takes values within the range of (0,1], where a higher bSQI value approaching 1 indicates that the annotated segment has no obvious noise and artefact, and hence the task of estimating the QT interval may be assumed to be easier than for data with lower bSQI values. Table 3 shows the optimised lower bound T of the bSQI values for different divisions of the QT dataset. Across all divisions, the BCLAs approaches had smaller RMSEs than the BCLA approaches. Note that a range of T values (see T range in table 3) produced the same minimum RMSE (rounded to two decimal places). As algorithms considered in Division 3 were different from those in Division 2, it was not trivial to perform direct comparison between Divisions 2 and 3 concerning the lower bound on bSQI. Nevertheless, there was a range of T values for which they overlap between the two divisions, and which resulted in smallest RMSE: [0.35, 0.45] when using the BCLAs-MAP approach and [0.9, 0.94] for the BCLAs-Gibbs approach. Moreover, it is interesting to note that although different T values affected the RMSEs for Divisions 2 and 3, few changes are observed in Division 4, even though it is the union of Divisions 2 and 3. This could be because as more data are being included into the model to infer the ground truth, BCLAs becomes less reliant on the information encoded by bSQI, and therefore little improvement was observed in RMSE. For the remaining analysis, the lower bound on bSQI was chosen from Division 4 to provide a generalised approach to test the feasibility of the BCLAs methods.

Comparison of results
A first analysis was conducted to investigate the effect of introducing the bHR feature into the BCLA and BCLAs models as a function of the number of automated annotators for Division 4 (see figure 4). 100 RMSEs were computed as a result of sub-sampling the number of annotators 100 times ranged from three to 69. For each selected number of annotators (i.e. three), the mean of the 100 RMSEs estimated using BCLA with NF, was plotted against that estimated using BCLA with the bHR feature, and BCLAs with the bHR feature. Figure 4(a) shows the   results using the MAP approach, and those using the Gibbs sampling approach are shown in figure 4(b). In the MAP results, BCLA-MAP with the bHR feature has a slight improvement (i.e. it produced smaller RMSEs at least over 50% of the time out of a total of 100 sub-sampled RMSEs) over BCLA without features except when annotator number is five, 15, and 61 (see middle bar plot in figure 4(a)). BCLA-Gibbs RMSE improvement was also relatively small when the bHR feature was included: less than 60% RMSE improvement was observed when selecting different numbers of annotators (see middle bar plot in figure 4(b)). In the case when the BCLAs model with bHR was included, both the MAP and Gibbs sampling approaches have small but consistent improved performance over their corresponding BCLA models with bHR respectively when selecting any number of annotators (see bottom bar plots in figure 4). This has demonstrated the feasibility of incorporating the signal quality extension, where it has further reduced the overall RMSE errors when selecting any number of annotators. A comparison of the performance of BCLAs with state-of-art benchmarking approaches like EM-R and sSTA-PLE is shown in figure 5. The figure shows that using physiological features such as bHR provides improvement on the estimate of the inferred ground truth (i.e. produced smaller RMSEs), whereas an algorithm like sSTAPLE, which operates with no features, performed worst across different numbers of annotators. Furthermore, the results of BCLAs with the bHR feature consistently outperformed those of BCLA with the bHR feature, particular for small number of annotators. For numbers of annotators varied from three to 11, the RMSE values are shown in table 4. These numbers of annotators are chosen to provide a realistic comparison that reflects the number of available automated algorithms as would occur in practice. There was no significant difference between BCLA-MAP with the bHR feature versus EM-R with the bHRSQIs features. In comparison, BCLAs-MAP outperformed upon BCLA-MAP slightly with smaller RMSEs, and had a significant reduction in the RMSE when compared to EM-R when considering 11 annotators. The BCLA-Gibbs model showed significant improvement over EM-R when five annotators were considered. Through incorporating the signal quality extension into BCLA (i.e. BCLAs), we conclude that it provides an additional and significant improvement for BCLAs-Gibbs when three, five, and 11 annotators were considered. Overall, BCLAs-Gibbs with bHR provides the best estimation of the inferred ground truth with the lowest RMSE when compared to other methods. Although no signifi-   figure 4 demonstrates that the BCLAs approaches consistently outperform the BCLA approaches. Moreover, the nature of the PCinC QT dataset is that the recordings are of high quality with very little noise. BCLAs is therefore expected to have an even larger benefit for noisier datasets.

Conclusion
In this study, a signal quality extension was introduced in Bayesian continuous-valued label aggregator (i.e. BCLAs) to reduce the error of the inferred ground truth when aggregating predictions from algorithms. In the PPG application, the record-specific pSQI as a signal quality metric was applied to the respiratory rate estimation. Although no improvement was observed as the dataset was with little noise, BCLAs guaranteed its performance better than the three bench-marking algorithms (i.e. EM-R, sSTAPLE, and smart fusion), bestperforming single algorithm, as well as mean and median voting approaches. For the ECG application, results of BCLAs have demonstrated that performance can be improved by incorporating record-and beat-specific signal quality (i.e. bSQI) and heart rate (i.e. bHR) extracted from the ECG independently from the estimates. A fully-Bayesian approach (i.e. BCLAs-Gibbs) was shown to be more robust across two datasets and produced the smallest errors. A significant improvement of error using BCLAs-Gibbs was also observed based on a selection of using three algorithms. In summary, incorporating signal quality in the BCLAs model better informs its belief in the annotations provided and hence further improves the estimation of the inferred ground truth.