Diagnostic Quality Assessment for Low-Dimensional ECG Representations

There have been several attempts to quantify the diagnostic distortion caused by algorithms that perform low-dimensional electrocardiogram (ECG) representation. However, there is no universally accepted quantitative measure that allows the diagnostic distortion arising from denoising, compression, and ECG beat representation algorithms to be determined. Hence, the main objective of this work was to develop a framework to enable biomedical engineers to efficiently and reliably assess diagnostic distortion resulting from ECG processing algorithms. We propose a semiautomatic framework for quantifying the diagnostic resemblance between original and denoised/reconstructed ECGs. Evaluation of the ECG must be done manually, but is kept simple and does not require medical training. In a case study, we quantified the agreement between raw and reconstructed (denoised) ECG recordings by means of kappa-based statistical tests. The proposed methodology takes into account that the observers may agree by chance alone. Consequently, for the case study, our statistical analysis reports the"true", beyond-chance agreement in contrast to other, less robust measures, such as simple percent agreement calculations. Our framework allows efficient assessment of clinically important diagnostic distortion, a potential side effect of ECG (pre-)processing algorithms. Accurate quantification of a possible diagnostic loss is critical to any subsequent ECG signal analysis, for instance, the detection of ischemic ST episodes in long-term ECG recordings.


Introduction
In medical applications, expert systems require both the extraction of reliable information from biomedical signals and efficient representation of this knowledge. Extracting relevant features is essential, since these are usually either fed to rule-based expert systems or machine learning methods or used directly by medical experts to derive a diagnosis. Clearly, the main objective of biomedical signal processing methods is therefore to efficiently represent important medical knowledge, eliminating noisy and redundant signal features while keeping the informative ones. However, applying denoising or feature reduction algorithms may lead to unintentional removal of diagnostic information. Common metrics in signal processing, such as signal-tonoise ratio (SNR), mean squared error (MSE), and standard error (STDERR), quantify the numerical error, but not the loss of diagnostic information. For instance, in the case of evaluating an electrocardiogram (ECG) that is superimposed with noise (e.g., power-line interference or baseline wander), the values of the previously mentioned objective measures may improve significantly if denoising algorithms are applied or the signal is represented in a low-dimensional space. However, these measures do not consider possible diagnostic distortions that may change the interpretation of the electrocardiogram (ECG) curves, and -as a consequence -the diagnosis. Signal-quality indices (SQI) form another class of metrics which lies in between quantitative and qualitative distortion measures. These application-oriented metrics are intended to quantify the suitability of ECG signals for deriving reliable estimation of particular medical features, such as the heart rate [1,2]. Note that the use cases of SQIs are limited to the application in question, and thus they are not applicable to perform a thorough agreement analysis between diagnostic features of the raw ECG and its low-dimensional representations. In fact, to date there is no universally accepted quantitative measure that captures the diagnostic distortion of such biomedical signal processing methodologies.
In this paper, we propose a testing framework that is suitable for measuring the diagnostic resemblance between preprocessed ECG signals and original recordings. This enables medical validation of ECG processing algorithms, such as filtering and data compression, and extraction/monitoring of clinical features. The latter has recently grown in importance, mainly due to the increasing amounts of data recordable (e.g., long-term ECG recordings), which require automated information extraction in order to be manageable by medical experts. In addition to standard clinical features (e.g., QT interval, QRS duration), the shapes of individual waves and characteristic shape changes over time are of high diagnostic interest and have been shown to carry important information [3]. However, it is exactly this information that might get lost when applying algorithms for low-dimensional information representation. Eliminating noisy signal features, such as the baseline wander, may lead inadvertently to removal of ischemic ST episodes [4], which has a direct influence on diagnosis. Further, waveshape feature interpretation might change: for instance, an originally positive wave may then be identified as a biphasic one. Consequently, a methodology is needed that enables quantitative representation of such potentially diagnostically relevant changes. However, reliable evaluation of whether diagnostic information has changed due to algorithmic processing will always require human expertise. Our test design minimizes the workload and does not require the evaluating person to be a medically qualified, since the test involves deciding between distinct wave shape types rather than making a specific diagnosis. Biomedical engineers can therefore easily carry out this evaluation themselves and can improve their algorithms more rapidly and efficiently, as they do not have to wait for feedback from medical experts who are able to diagnose a possibly very specific pathology. We demonstrate this by comparing original ECG recordings (selected from [5]) and their corresponding low-dimensional representations, which were obtained by applying an approach we have previously developed [6]. Our framework is intended to help biomedical engineers in evaluating the diagnostic distortion of an ECG processing algorithm for real-world recordings in the early stages of development. In contrast to [6], where we evaluated diagnostic distortion based on synthetic data only, in this work we elaborate on real ECG recordings, as they appear in daily clinical practice. Although synthetic data is crucial in biomedical signal processing, since they provide the ground truth (clean signal), these signals do not capture the wide variety of possible ECG morphologies and noise. It should be emphasized, however, that alongside the proposed diagnostic distortion analysis, a complete performance report must asses the robustness of the tested algorithms with respect to the noise level [7], the perturbation of the input parameters [8], etc.
The methodology we propose for agreement analysis is inspired by former work on diagnostic distortion measures, which we review in Section 2. The design concepts of our testing framework are discussed in Section 3. Section 4 briefly describes our previous work on low-dimensional ECG representations [9], followed by an analysis of the agreement between the diagnostic features of original ECGs and their approximations. Finally, Sections 5 and 6 discuss the results and conclude the paper, respectively.

Related work
The approaches most closely related to ours seek to quantify diagnostic relevance in ECG signal compression. Compression is very similar to feature extraction in that ECG data samples are represented in a low-dimensional (feature) space from which the original ECG signal can be restored. In order to prove that the reconstruction preserves diagnostic information, the quality of the restored ECG signal must be evaluated.
Diagnostic distortion can be measured by objective methods that are based on mathematical models. The most commonly used objective evaluation method uses the percent root-mean-square difference (PRD), measuring the squared error between original and reconstructed ECG signal. The PRD is a numerical quantity that assumes equal error contribution over the whole ECG. This is not appropriate for evaluating diagnostic distortion, since numerical error of the same degree, for instance, in the approximations of the QRS complex and the P wave do not imply the same level of diagnostic error. The weighted diagnostic distortion (WDD) measure was an early attempt to tackle this problem and to quantify the diagnostic error of compressed ECG signals [10]. It compares amplitude, duration, and shape features of original and reconstructed ECGs, and assigns weights to the corresponding error terms based on their diagnostic relevance. These features are, however, extracted automatically, and thus the procedure is prone to inaccuracies of the ECG delineation algorithms. Later, Al-Fahoum [11] introduced the wavelet-based weighted PRD (WWPRD), which quantifies the diagnostic error in wavelet space: In each wavelet subband, the PRD between the wavelet coefficients of the original and the reconstructed ECG is calculated, and then the error contribution of each subband is weighted based on its diagnostic significance. Although the WWPRD can be easily calculated and correlates very well with clinically evaluated results, it is heavily influenced by the presence of noise in the relevant subbands. Manikandan et al. [12] therefore proposed the waveletenergy-based diagnostic distortion (WEDD) measure, a reweighted variant of the WWPRD. In WEDD, the weights are equal to the relative energy between the overall signal and the corresponding subbands. This way low-energy highfrequency noise can be suppressed in the PRD calculation. However, low-frequency high-energy noise which overlaps with the diagnostic content of the ECG is counted in the WEDD. Fig. 1 illustrates this phenomena for an ECG distorted by a sinusoidal baseline drift. Even though the diagnostic remained unchanged, the WEDD, WWPRD, and PRD measures suggest a significant distortion of the ECG curve. All these metrics have their limitations in medical validation, as discussed in relation to baseline removal algorithms.
Clearly, application of such objective evaluation methods does not require the feedback from cardiologists, and thus the test results are not influenced by intra-and interobserver variability. Despite their advantages, objective distortion measures are not fully accepted in the ECG signal processing community due to their lack of medical validation [13]. Subjective methods, such as the mean opinion score (MOS), are therefore considered to be the gold Analyzing the diagnostic distortion after perfect baseline removal, using objective measures: PRD=20.4%, WWPRD=31.2%, WEDD=26.0%, which corresponds to the quality groups not bad and bad according to Tab. 8 in [12].
standard [13]. In this case, quality is assessed by cardiologists, who check whether original and reconstructed signals imply the same diagnosis. The results of these tests are then converted into a single measure called the MOS error value, which quantifies the diagnostic distortion of the compression/feature extraction technique under study. The MOS test introduced by Zigel et al. [10] and its variant [14] remain in use for evaluating ECG processing algorithms. In these studies, cardiologists rated the general quality of the ECG recordings, and investigated whether the same interpretation would result from the medical features of the original and the processed ECGs. To this end, a set of ECG features was considered which included both simple shape features, such as positivity/negativity of the T wave, and more complex morphological characteristics, such as left/right bundle branch block (LBBB/RBBB) and premature ventricular contraction (PVC). However, some of these features are ambiguous (e.g., waveform symmetry, which is difficult to distinguish from slightly asymmetric cases), and thus increase intra-and interobserver variability. Other morphological characteristics are just too complex to identify without the help of clinicians. Another drawback of previous MOS tests is that they consider only the overall percentage of agreement and ignore the possibility that the observers may agree by chance alone. For instance, the presence of a delta wave in the QRS complex is relatively rare [15], and therefore in most cases cardiologists will simply exclude this diagnosis.
We developed a carefully designed MOS test for evaluating the diagnostic relevance of ECG preprocessing algorithms that uses only simple shape features which can be recognized by biomedical experts and which reduce the intra-and interobserver variability. The results of our test support design decisions and speed up the development of ECG processing algorithms, since input from the medical experts is then only needed in the final testing phase. Note Table 1 Recordings selected for evaluating the ECG beat representation algorithm [5]. that current approaches consider the proportion of observed agreement alone as an index of concordance. we, in contrast suggest the use of Cohen's kappa in assessing the performance of ECG preprocessing algorithms, as it takes into account agreement by chance [16,17,18].

Test design
The main objective of the test design was to allow fast and reliable evaluation of the diagnostic distortion caused by low-dimensional ECG signal representation. Together with a medical expert we selected 26 recordings from the Massachusetts general hospital/Marquette fundation (MGHMF) database [5,19]. This database has a detailed patient guide, which allows recordings to be chosen according to the occurrence of various pathologies and waveforms. An overview of the selected recordings is given in Table 1; for more detailed information we refer to the patient guide provided in [19]. Our main inclusion criterion for selecting the recordings was the occurrence of various (abnormal) wave morphologies, for instance, positive/negative, biphasic, and flattened waves. Further, different manifestations of the QRS complex (R, Rs, RS, etc.) and the ST segment (e.g., elevated or depressed) were important criteria, since these are key diagnostic features which should be preserved by ECG compression / denoising algorithms. A medical expert chose the recordings based on the patient guide and visual judgement of the ECG strips.
Subsequently, as described in Sec. 4.1, these recordings were transformed into a low-dimensional representation using Hermite and sigmoidal functions combined with spline interpolation [6]. These functions have been extensively studied in several ECG related medical applications, such as data compression [20], heartbeat clustering [21], and myocardial infarction detection [22]. In order to investigate the diagnostic distortion of Hermite-based ECG decomposition, a test set was built that included 32 original 3-lead ECG recordings and 32 reconstructed 3-lead ECG recordings, 12 of which (6 original and 6 reconstructed) occured twice. This was done to allow assessment of self-consistency (withinobserver agreement). In total, 64 3-lead ECG recordings (i.e., 192 ECG strips), were therefore evaluated by experts. The test set was split into 4 subsets, each of which was to be processed on a different day to avoid exhaustion and possible resulting inaccuracies that would bias the results. The recordings were arranged in a pseudo-randomised order with the restriction that original and reconstructed ECG recordings were not allowed to occur in the same subset. To simulate daily clinical practice, the ECG recordings were presented on a standard ECG grid,(10 mm/mV and 25 mm/s). An example recording is shown in Fig. 2 (note that this recording is scaled for better visibility). The questionnaire for assessing diagnostic distortion was then designed based mainly on two factors: First, in our preliminary experiments, we realized that even highly experienced physicians were not able to identify with confidence more complex pathologies such as a left bundle branch block (LBBB) or a right bundle branch block (RBBB) based on the ECG recordings alone. This is because additional laboratory tests or additional ECG leads would be needed for a sufficiently accurate diagnosis. Therefore, if the questionnaire offers options such as LBBB and RBBB, experts tend not to tick these boxes unless it is a very clear case, which could of course bias the evaluation significantly. The results may lead one to believe that the reconstructed (lowdimensional) ECG still has retained the complete diagnostic information, but this might just be due to the ECG always being labeled as normal by the expert.
Second, development of a signal representation algorithm should require medical expertise only in the final testing phase. We therefore included only simple evaluation criteria for judging the diagnostic distortion of the P and T waves, the QRS complex, and the ST segment, as illustrated in Fig. 2.
Specifically, this means that in a first step the ECG wave segments of all available leads are to be evaluated according to their general shape features, for instance, insignificant, positive, negative, and biphasic in the case of the P wave. Clearly, one of these options must be selected. Additionally, depending on the segment investigated, the evaluating person may tick an optional box which indicates a (general) pathology indicated by the wave.
Subsequently, the quality of the single leads is to be rated, where the experts are asked to focus on the signal clarity of the wave. This allows assessment of whether the low-dimensional representation degrades, retains, or even increases the quality of the ECG recording. Quality improvement would imply that noisy signal features were successfully eliminated while important diagnostic features were retained. Finally, a main diagnosis is to be given as free text. However, it should be mentioned, that this is considered optional and should only be answered by medical experts (in the final testing stage). This allows assessing whether the main diagnosis changed between ECG recording and low-dimensional signal representation and serves as an additional source for identifying a possible diagnostic distortion.
For our case study a total of 3 physicians were briefed with the information above and with additional instructions in written (see supplementary material) and oral form.

Case-study: low-dimensional ECG representation
In the best case, low-dimensional ECG representation preserves important diagnostic/morphological features, while redundant and noisy signal features are mostly eliminated. This is, however, a difficult task, since the frequency spectra of the ECG and possible noise (e.g., baseline wander) overlap in most cases [23]. Therefore, the diagnostic distortion of these preprocessing techniques must be investigated before they are applied to real-world problems. In this study, we tested the reliability of our former work approach (Sec. 4.1) by assessing its between-method agreement. More specifically, by means of statistical tests we checked whether two measurements (i.e., the original, and the reconstructed lowdimensional signal) produce the same diagnostic features defined in our MOS test.
In order to estimate the consensus between original and reconstructed ECGs, we computed the proportion of observed agreement and the coefficients between each (original and reconstructed) feature pair. In the case of dichotomous features, these quantities can be calculated as follows: where denotes the chance agreement, , are the marginal totals, and and are the numbers of agreements on present and absent values of the corresponding feature (see Tab. 2). Note that, in the case of morphological features, we defined more than two mutually exclusive categories for which the calculations in Eq. (1) can be generalized according to [24,25]. Cohen's kappa is widely used in reliability studies in clinical research [16,17,18]. The kappa values range from −1 to 1; = 0 suggests that the observed agreement is not better than would be expected by chance alone, while = 1 implies perfect agreement, and negative values indicate potential systematic disagreement between the observers. Other values of kappa can be interpreted based on Tab. 3 as proposed by Landis and Koch [26].
In our case, achieving a perfect agreement (i.e., = 1) is unrealistic because the evaluating cardiologists typically have different levels of experience and mental fatigue. Taking this into account, we also report max , which expresses   the maximum attainable kappa provided that the marginal totals , are fixed. The value of max can easily be calculated by substituting with ,max = min( 1 , 1 ) + min( 2 , 2 ) ∕ in Eq. (1). The difference max − indicates the unachieved agreement beyond chance constrained by the marginal totals [16].
In ECG compression, it is common practice to provide as a measure of diagnostic concordance [14]. However, results can be misleading, especially when most of the observations fall within a single category [27]. For instance, previous studies [10,14] considered the presence of delta waves in the QRS complex; however, compared to other QRS shape features these occur relatively rarely [15]. In most cases, cardiologists would therefore agree on the absence of delta waves in both the original and the filtered signals. To avoid such effects, alongside the percent-agreement figures, we report the corresponding coefficients, their confidence intervals, and the maximum attainable kappa.

ECG beat representation -algorithm
Among the classes of low-dimensional ECG representations, the Hermite-based decomposition has been widely studied especially for extracting features in advance to machine learning algorithms (see e.g., Chapter 12 in Ref. [28]). These approaches utilize similarities between the shapes of Hermite functions and ECG waveforms. In a recent work, we extended the theoretical framework of Hermite-based ECG models by sigmoidal functions combined with piecewise polynomial interpolation [6]. One of the main goals of this work was to reduce noisy and redundant signal features while retaining diagnostically important waveform features. Hence, we sought not only to reduce dimensionality, but also to simultaneously denoise the signal and to segment the ECG into its fundamental waves (P-QRS-T). We used adaptive Hermite and sigmoidal functions to extract important ECG waveform information, while piecewise polynomial interpolation captured mainly the undesired baseline wander. Additionally, because of the (smooth) basis functions we selected, high-frequency noise was also reduced. Figure 3 shows an example ECG trace, which was segmented into its fundamental parts, that is, P wave, QRS complex, ST/T segment, T wave, and baseline estimation. As can be seen, the low-dimensional representation accurately describes the underlying ECG. This was achieved by developing a nonlinear least-squares model with an appropriate set of basis functions. This model was first tailored precisely to a single person by nonlinear global optimization, and then readjusted beat-by-beat by means of nonlinear local optimization. Optimization was carried out with respect to the translation and dilation of the basis functions used to represent the single waves (P-QRS-T). Thus, we created a person-specific model which allows tracking morphological changes in a low-dimensional space while retaining the characteristic shape information of the ECG trace.

Experiments and results
Following the recommendations of Watson and Petrie [17], we conducted three experiments to evaluate the reliability of our previously published approach to low-dimensional ECG representation [6]. To this end, three cardiologists (C1, C2, C3), who were not allowed to discuss their answers with each other, filled out the blind test described in Section 3 and illustrated in Figure 2. We partitioned the test records into four work packages to decrease the daily workload for the cardiologists and thus minimize factors that can negatively influence the evaluation, such as mental fatigue and lack of motivation.

Between-method agreement
In this experiment, we studied the diagnostic concordance between the features of the original (raw) ECG and of the reconstructed signal [6]. Fig. 4 shows that the proportion of observed agreements is about 80% for all cardiologists and all ECG features except for the P wave, where cardiologist C1 achieved only 60% agreement. However, note that the self-consistency of C1 was also low for the P wave (see, e.g., Fig. 7). This is due to the P wave being a low-amplitude ECG component which has more ambiguous characteristics in the presence of noise than the QRS and the T waves. This also explains why the between-method concordances in Fig. 4 vary considerably between cardiologists in the case of the P wave. We also evaluated the diagnostic concordance between the pathological wave shapes of the original and the reconstructed ECG signals. According to Fig. 5, the observed agreement was greater than 80% in most cases, and close to 100% for the P wave and for the QRS complex. Consequently, the low-dimensional beat representation investigated did not significantly increase ambiguity in terms of pathological versus non-pathological waveshape class.
We used kappa statistics to analyze the chance-corrected observed agreement between the features of the original and the filtered ECG (see Tab. 4). The highest with the narrowest confidence intervals and the closest max was achieved for the QRS complex. Second best was the T wave, with substantial agreement between the original ECG and the reconstructed signal (cf. Table 3). The kappa scores of the ST segment morphology are lower, which indicates fair agreement between the features. However, it seems that the low-dimensional representation preserved the ST depression and elevation, as indicated by high kappa scores. Furthermore, the maximum attainable kappa was observed for C2 and C3 in the case of ST depression. This corroborates our previous claims for our joint Hermite sigmoid model [6], namely, that the sigmoid functions perform well in modeling the on/offset shifts of the QRS complex and the ST elevation/depression. As with the diagnostic concordance in Fig. 4, the kappa scores of the P wave morphology are Table 4 Using statistics to analyze between-method agreement between original and filtered ECG features.  not as consistent as the scores of the previously mentioned features. In this case, three different levels of agreement (i.e., fair, moderate, and substantial) can be observed between the original and the filtered ECG features. Note that none of the confidence intervals includes 0, and we would thus reject the null hypothesis on = 0. This means that no evidence of agreement by chance alone was found.

Inter-rater agreement
We also evaluated the reproducibility of our tests by analyzing the inter-rater agreement. Fig. 6 shows the interrater diagnostic concordance. As with the between-method concordance, the percent agreement is around 80%, except for the QRS complex, for which the value is considerably lower than for the other features. The results show that the interpretation of these features can vary between cardiologists. For instance, a qRs-type QRS complex with very lowamplitude q and s waves can easily be misclassified as R, qR, or Rs wave depending on the stringency of the examiner. This explains the relatively low inter-rater concordance, which is also supported by the kappa statistics in Tab. 5. The kappa values related to the QRS morphology are lower than those for the T wave, but the uncertainty of these estimates of is also higher. Overall, in most cases there was moderate agreement between the cardiologists on the morphological features, except for the T wave, where the level of agreement was substantial. The discrepancies were probably caused by the different levels of medical experience and by the fact that the clinical standards and decision rules applied can vary between cardiologists.

Within-observer agreement
To assess the repeatability of our test, we studied the within-observer concordance (Fig. 7) by using 12 repeated Table 5 Using statistics to analyze inter-rater agreement for the ECG features.  records including three leads. The agreement observed was much higher than in the between-method and inter-rater cases. This was to be expected, since an individual cardiologist's interpretation of ECG features should not vary much. The results indicate high self-consistency among the medical experts, which is also supported by the kappa statistics. In fact, is very close to max and suggests almost perfect agreement for the P wave and the QRS complex, and substantial agreement for the T wave and the ST segment.
The kappa values are very low for C2 and C3 in the cases of general ST morphology and ST elevation, respectively. This is due to the first paradox of , which is caused by the high prevalence of the corresponding ST features. For instance, C3 considered ST elevation to be absent in almost all the repeated records, and agreed with himself in 35 of the overall 36 cases including the three leads. Although we would expect almost perfect agreement, the high prevalence reduces the value of kappa, since ≈ (see e.g., [29,30]). For the same reason, is not applicable (n/a) in the case of elevated ST for C2, who achieved perfect agreement with himself on the absence of elevated ST in 36 out of 36 cases. This implies perfect agreement, but the denominator in Eq. (1) becomes zero due to = 1. In summary, we observed a high level of self-consistency among the cardiologists, which demonstrates the robustness of this study.

Discussion
Although is the most commonly used agreement measure in the literature, it is often criticized as being somewhat difficult to interpret in particular situations [16]. For instance, the prevalence of the attributes affects the magnitude of , which was in fact the case with the ST morphologies in Tab. 6. This effect becomes apparent when the proportion of agreements on one attribute differs significantly from those for the others. High prevalence causes high chance agreement , which reduces the value of accordingly. Regarding our experiments on the various types of concordance, we found that the attributes of positive P and T waves and the horizontal ST segment had the highest prevalence. This was to be expected, since these are the most common waveforms in the ECG. In these cases, low values of do not necessarily imply low rates of overall agreement. Therefore, alongside the value of , we also reported the diagnostic concordance for each experiment in Figs. 4-7. Note that there is greater potential for disagreement in the case of a large number of optional categories, as is the case for the QRS complex. Thus, the high values of indicate very strong agreement especially for the QRS morphologies, where we used 10 different shape categories. Generally, in order to counteract high prevalence and the resulting low values, this approach could be extended with interesting/rare ECG recordings, which would result in a more heterogeneous set of wave shapes.
We also evaluated the general quality scores for the original signal ( ) and for the reconstructed ECG ( ). Fig. 8 plots the differences − for each cardiologist and for the mean. Excluding the outliers, the differences have negative signs, which indicates an improved quality in the case of the reconstructed signal [6]. Although this visual enhancement is expected, filtering does not necessarily result in better diagnostic quality. For instance, FIR and IIR filters can remove the noise in the targeted frequency band, but they may also introduce ringing artifacts due to the well-known Gibbs phenomenon [28]. However, the between-method agreement study and the quality score differences show that our joint Hermite-sigmoid model [6] represents the ECG in a lowdimensional space without diagnostic distortion, and even enhances the visual quality of the ECG in most cases.
Hence, compared to the objective measures PRD, WW-PRD [11], and WEDD [12], we obtain a more reliable assessment of the diagnostic distortion and signal quality of the reconstructed signal. While Fig. 9 illustrates that specifically the wavelet-based objective measures perform well in case of (very) low and very high frequent noise (record index 8, mgh056), which is not overlapping with the relevant ECG subbands, Fig. 10 reveals their weaknesses (record index 20, mgh184). In this case the ECG is superimposed by baseline wander noise that overlaps with relevant ECG subbands, therefore the objective measures show a high diagnostic distortion. In fact, the high values (PRD=67.5%, WWPRD=29.0%, WEDD=59.7%) all correspond to low quality groups according to Tab. 8 in [12]. However, the signal quality is significantly improved by the low dimensional representation (Fig. 10), hence, the objective measures are misleading for this recording. In fact, the visual inspection of the original and the filtered ECGs conducted by three cardiologists confirmed the quality improvement for this record (see Fig. 8), which did not indicate any important diagnostic loss. Clearly, the objective measures are helpful and reliable in many cases (e.g., Fig. 9), nevertheless, they are limited for noise overlapping with relevant ECG subbands, demanding frameworks as suggested in this work that overcome this limitation.
Furthermore, as illustrated in Figure 2, the questionnaire also offers the option of proposing a main diagnosis for the three ECG leads. This is intended to be answered by medical experts only, who are briefed that they do not have to give a definitive answer, but their best guess based on the three leads. This provides an additional source for identifying diagnostic distortions which might not be covered by the standard questionnaire (in case diagnosis differs significantly between raw and reconstructed signals). Clearly, this question is challenging and should therefore -if it is to bring value -only be answered in the final stage of algorithm evaluation by experienced, meticulous  experts. In our case, the three experts gave a main diagnosis, showing good agreement in most of the cases, that is, 73 %, 88 % and 92 %, respectively. A more detailed analysis by a medical expert, who subsequently investigated why the raw and reconstructed ECG recordings led to different diagnoses uncovered a loss of diagnostic information in the ST segment. On this basis our previously published algorithm [6] can be further improved in the future. Alesanco et al. [14] argued that this third question should be presented in a semiblind way, which means, showing both the raw and the reconstructed ECGs to the medical expert at the same time and asking whether they would evaluate them differently. This could potentially reduce intra-subject variability caused, for instance, by varying levels of attention when judging the raw and the reconstructed recordings. However, the drawback of this is that -even unintentionally -one typically tends to find the same characteristics in the two recordings if they are presented at the same time. Consequently, this may lead to misjudging an important diagnostic loss, and therefore we believe that the blind test is more appropriate in this case.

Conclusion
We have proposed a methodology for quantifying the diagnostic distortion of ECG signal processing algorithms, for instance, for filtering, segmentation, and data compression. These low-dimensional ECG representations may lead to distortion of the diagnostic information contained in the ECG, which we quantified using Cohen's kappa. Note that is affected by prevalence, and thus the corresponding quality scores are not meant to perform direct comparisons between ECG processing algorithms of different studies.
Instead, our goal was to design a testing framework that includes a questionnaire which minimizes the time to train non-medical staff to evaluate the diagnostic distortion of ECG decompositions. To this end, we chose scoring rubrics such that the interpretation of original and reconstructed ECGs is clear and some level of objectivity is imposed on the rating scale. The proposed test is therefore free of ambiguous medical features, such as symmetry and notches. In a case study, we considered Hermite-based characterization of ECG waveforms, which is a very popular topic in this field. Particularly, we showed that low-dimensional heartbeat signal representation by means of Hermite and sigmoidal functions preserves diagnostically relevant features of the ECG.