Explaining Deep Learning for ECG Analysis: Building Blocks for Auditing and Knowledge Discovery

Deep neural networks have become increasingly popular for analyzing ECG data because of their ability to accurately identify cardiac conditions and hidden clinical factors. However, the lack of transparency due to the black box nature of these models is a common concern. To address this issue, explainable AI (XAI) methods can be employed. In this study, we present a comprehensive analysis of post-hoc XAI methods, investigating the local (attributions per sample) and global (based on domain expert concepts) perspectives. We have established a set of sanity checks to identify sensible attribution methods, and we provide quantitative evidence in accordance with expert rules. This dataset-wide analysis goes beyond anecdotal evidence by aggregating data across patient subgroups. Furthermore, we demonstrate how these XAI techniques can be utilized for knowledge discovery, such as identifying subtypes of myocardial infarction. We believe that these proposed methods can serve as building blocks for a complementary assessment of the internal validity during a certification process, as well as for knowledge discovery in the field of ECG analysis.


Introduction
AI-enhanced ECG The electrocardiogram (ECG) is one of the most frequently performed diagnostic procedures [1] and plays a unique role in the first-in-line assessment of a patient's cardiac state.Until this day, most ECGs are assessed manually with only limited support through rule-based algorithms implemented in ECG devices subject to well-known limitations [2].In the last few years, this analysis paradigm has started to change drastically as a consequence of evidence from an increasing number of studies, which demonstrate the enormous diagnostic potential of the ECG in combination with modern deep learning methods, see [3] for a recent perspective.Most remarkably, this does not only apply to the prediction of diagnostic statements as routinely assessed also by cardiologists, from myocardial infarctions [4] over comprehensive prediction of ECG statements [5,6], and rhythm abnormalities [7] to challenging conditions such as hypertrophic cardiomyopathy [8], but includes conditions that are hard or even impossible to infer from an ECG for human experts.For example, a number of recent works demonstrated the ability to infer age and sex [9], ejection fraction [10], atrial fibrillation during sinus rhythm [11], anemia [12] or even non-cardiac conditions such as diabetes [13] or cirrhosis [14].These very promising findings will eventually transform the role the ECG plays in routine diagnostics with enormous potential for cost savings and improved patient journeys through more precise decisions for follow-up diagnostics or treatments.
Need for XAI Modern deep learning models owe their superior performance to their large number of millions to billions of parameters, which allow to capture complex non-linear dependencies of input features.However, this flexibility comes at the price of the impossibility to interpret such models on a parameter basis, which is why these are often perceived as black boxes.This led to the emergence of explainable AI (XAI) as a subfield of machine learning, which tries to shed light on the decision process implemented by neural networks, see [15,16,17,18] for general reviews.Here it is important to keep in mind that different use cases pose different requirements for the combined ML+XAI system (see Fig. 1 and Fig. 2 for an overview).These range from (1) Providing sideinformation to medical experts (2) Auditing of the ML system before deployment, e.g., to ensure that the models avoid excessive exploitation of spurious correlations or rely on undesired features/principles, see [19,20] for a recent application (3) Scientific discovery through using the ML model as a proxy for the relations between input and output in the data.While the first use-case is very important, it can eventually only be assessed within user studies with cardiologists.In particular, we do not attempt to assess whether single local explanations are insightful as side-information for human experts as part of a clinical decision support system.Instead, we focus primarily on the last two aspects, building on attributions as measures for feature relevance, and provide building blocks that can be Figure 1: Conceptual summary of the XAI study for ECG: We discuss two different ways of investigating consistent model behavior (1) through aggregation of local attribution maps across entire patient groups in the form of so-called glocal attribution maps, which can also be effectively used for knowledge discovery and (2) by using the global XAI method to verify if cardiologists' expert concepts are consistently exploited.employed in these contexts.Both topics relate to the frequent question of the alignment of neural network decisions and cardiologists' decision rules.At this point, we present two methods that provide quantitative evidence of whether a neural network systematically exploits specific decision rules and which segments of the signal are most relevant for the classification decision across all samples with a particular pathology across an entire dataset.In both cases, we find good agreement across a diverse set of four pathologies.
XAI for ECG Various XAI methods have been applied to deep learning models trained on ECG data, see [21] for a recent review, mostly in the form of ad-hoc adaptations of existing methods applied in computer vision.Most methods utilized post-hoc XAI techniques on trained models, with the exception of [22] or [23], which relied on the model's inherent attention weights.Gradient-weighted Class Activation Mapping (GradCAM) [24] is considered the most popular approach [25,12,26,27,28,29], followed in popularity by saliency maps [30], see [31,32,33,34].Other choices include Local Interpretable Model-Agnostic Explanations (LIME) [35].These techniques provide visually appealing attribution maps that are in most cases used to argue whether the decision criteria of deep neural networks align with human expert knowledge.

Main contributions
The main contributions of our work align with four subtopics, which we describe in detail in the following paragraphs: 1. Sanity checks Prior approaches rarely provided arguments why a particular XAI method was chosen or if it was appropriate for this purpose, which is, however, a crucial prerequisite for the application of XAI methods in the first place.We argue that the ECG is particularly suited for a structured evaluation due to its periodic structure and ECG features as well-defined signal features.This makes it possible to set up precise sanity checks for XAI methods, allowing to assess whether they attribute consistently, both in a temporal as well as spatial manner.As one exemplary finding, we demonstrate that most existing approaches including GradCAM with the exception of saliency maps, fail to attribute in a temporally focused manner, see Fig. 3, which clearly puts into question existing approaches in the field.
2. "Glocal" XAI As second main contribution, we strongly argue for the automated analysis of attributions in a beataggregated form, in line with earlier proposals in the literature [32,26,36].In an ECG, a "beat" refers to one complete cardiac cycle, including the P-wave, QRS complex, and T-wave, representing the heart's electrical and mechanical activity.We exploit the periodic structure of the signal to provide attribution maps on the level of median beats or segments, see Fig. 2 for a graphical overview.These can be compared in a meaningful way across samples and patients.In this way, attribution maps can be aggregated across patients (with similar pathologies) to derive dataset-wide patterns from sample-specific attribution maps.We showcase this approach for three major diagnostic classes and provide the first quantitative evidence for alignment of model behavior with cardiolologists' decision rules on the level of ECG segments and leads, see Fig. 4.

3.
Global XAI As third contribution, we aim to stress opportunities for XAI methods beyond attribution maps.To this end, we apply Testing Concept Activation Vectors (TCAV) [37], which allows to assess if a model exploits a given concept, defined via examples.Again, we argue that the ECG domain represents an ideal application domain for such techniques due to the availability of highquality ECG features and concepts as defined via cardiologists' decision rules.This represents the first comprehensive assessment of diagnostic concepts in the context of global XAI for ECG.Indeed, for all considered pathologies, we find that expert rules are consistently exploited by the ML models, see Fig. 5.

Data & Models
Data We base our experiments on the PTB-XL data set [38,39,40], which covers 21799 ECGs from 18869 patients annotated with ECG statements from a broad set of 71 statements covering diagnostic statements and form-and rhythmrelated statements.We focus on the 23 statements at the subdiagnostic level to allow a sufficiently finegrained analysis without adding the complexity of the full set of 44 diagnostic labels, some of which are only sparsely populated.We follow the benchmarking protocol for PTB-XL proposed in [5], which uses eight of the ten stratified folds for training and the remaining two for validation (model selection) and testing.The ECG features that are used for the concept-based explainability methods (described in section 2.6 and shown in section 3.3) are extracted by the University of Glasgow ECG analysis program [41].These features were made available as part of the PTB-XL+ dataset [42].
Models In terms of model architectures, we work with con-volutional neural networks as they are to date the predominantly used model architecture in the field, see e.g., [7,43,10], even though recent works suggest that recurrent architectures [44] or structured state space models [45] might further improve over the convolutional state-of-the-art, most likely due to inductive biases of these models that are better suited to capture the sequential nature of ECG data.In this work, we consider both a shallow model inspired by the LeNet architecture [46] and a ResNet-based [47] architecture, which was shown to lead to competitive results [5] for ECG classification tasks on PTB-XL.The two model architectures are described in more detail below: LeNet: shallow model inspired by the LeNet architecture [46], which is composed of 3 one-dimensional Convolutional Layers (with kernel size 5, stride 2 and output channels 32, 64 and 128, respectively), interleaved with BatchNorm, ReLU and pooling layers (the first two pooling layers are MaxPool and the last one AvgPool).This is followed by two fullyconnected layers, again interleaved with ReLU as activation.
XResNet: ResNet-based [47] architecture, which were shown to lead to competitive results [5] for ECG classification tasks on PTB-XL.More specifically, we use a one-dimensional adaptation of the XResNet architecture [48], which is described in detail in [5].In our experiments, we use a xresnet1d50 architecture.
Prediction task The task is framed as a multi-label classification task and consequently, we use class-wise sigmoid activations and binary cross-entropy as optimization objective.We work at a sampling frequency of 100 Hz and use randomly cropped input sizes of T = 250 tokens, corresponding to 2.5 seconds, which were shown to lead to the best results for the given task [45].

Post-hoc local XAI with attribution maps
Local XAI The XAI community has put forward a range of post-hoc interpretability methods, which can be used to assess the attribution of input features.These methods commonly provide an attribution (map), which shares the shape of the input and indicates the relevance of particular parts of the input sequence for a classification decision at hand.This allows us to temporally resolve the most relevant parts of the sequence but also to identify the most discriminative leads for a particular condition [12,27,28,29,31,33,34,35]. We consider and compare four different XAI methods that are either predominantly used in the ECG community or are popular choices in the XAI community.More specifically, we consider Grad-CAM [24], saliency maps [30], integrated gradients (IG) [49] and layer-wise relevance propagation (LRP) [50].Saliency maps explain model predictions by using the norm of their respective input gradients.GradCAM attribution maps consist of a (feature depended) weighting of the activation gradient of the respective model prediction.Integrated Gradients (IG) creates attribution maps by integrating the input gradient (of a chosen output neuron) from a predefined baseline input to the sample under consideration.Layer-wise Relevance Propagation (LRP) propagates the model prediction from the output neuron back to the input.In every layer, each neuron is assigned an attribution value using the LRP rules, adhering to the conservation rule that ensures that the sum of attributions is (approximately) maintained across layers.

Sanity checks for attribution methods
In the field of Explainable Artificial Intelligence (XAI), the incorporation of sanity checks is paramount to maintaining the trustworthiness and authenticity of explanations of AI models.These sanity checks can also be described as XAI metrics [51] designed to ensure the quality of explanations.In [52], e.g., the authors propose six metrics to evaluate popular local additive explanation methods.Such assessments play a central role in validating that the explanations generated by XAI accurately represent the model's decision-making process.This validation is critical because inaccuracies in the explanations could potentially mislead users and lead to misunderstandings about the operating principles of the model.In addition, sanity checks are essential to identify any biases within an explanation model, particularly those that may favor certain attributes or characteristics.
Proposed experiment The sanity check in this study relies on ECG parameter regression.The intuition behind this exper-iment is the expectation that a model that was trained to regress a particular amplitude in a particular lead from the raw signal should focus spatially on this particular lead and temporally on the corresponding segment.To implement this experiment, we fit a regression model R : R T ×12 → R 12 to a feature present in each lead (P-wave, R-wave and T-wave amplitude), i.e. a model with 12 output neurons, one for each lead feature.For this, we used the same model architectures introduced above but with linear activation and use the mean squared error as optimization objective.As regression targets we use ECG parameters extracted by the University of Glasgow ECG analysis program [41], a commercial ECG analysis software, as available from PTB-XL+ [42].
In order to compare across different attribution methods, we compute attribution maps for all samples and all output leads and analyze them both spatially as well as temporally.
That is, for a given sample x ∈ R T ×12 and a given output lead l ∈ {1 . . .12} we compute the respective attribution map a l ∈ R T ×12 .Based on this, we consider two complementary aspects noted as spatial and temporal specificity.
Spatial specificity We define spatial specificity s l ∈ [0, 1] for a given attribution map a l as the ratio where ãl ∈ R 12 is the euclidean norm of a l along the temporal axis ∥a l 1 ∥ . . .∥a l 12 ∥ , i.e., as the ratio between temporally aggregated attribution in the target lead and all leads.This ratio is close to one if the attribution is specific to the lead in question.We expect a sensible attribution method to be spatially specific.Suppose the model is trained to predict the amplitude of the R-peak in a specific lead.A spatially specific attribution would mean that the model attributes the majority of the prediction to that particular lead.We visualize the median and quantiles over all samples and obtain in this way a spatial specificity statistic for each lead, which can be visualized in terms of boxplots as shown in Fig. 3.
Temporal specificity Complementarily, we also compute the temporal specificity t ∈ R 80 as the spatial mean over all lead-specific temporal means t = 1 12 12 l=1 t l , where the leadspecific temporal mean t l is computed as follows: 1.For each sample-and lead-specific attribution a l , we crop all B beats 300 ms before and 500 ms after each R-peak, yielding âl ∈ R B×80×12 .
2. We remove the spatial information by first computing the euclidean norm along the spatial axis (i.e. the leads) for each beat, yielding ∥â l ∥ L ∈ R B×80 .
3. Finally, we arrive at t l by computing the mean attribution across beats t l = 1 B B i=1 ∥â l i ∥ L This temporal specificity t is computed for all samples entering the analysis and is visualized as line-plots with the median as solid line and the 25% and 75% inter-quantile-range as transparent area below as in Fig. 3.We expect a sensible attribution method to be temporally specific with respect to the regressed segment in question, even though context from other segments might be relevant.For instance, if the model is predicting T-wave amplitude, temporal specificity would mean that the attribution is concentrated around the relevant time segment of the T-wave.
In summary, spatial specificity ensures that the attribution is focused on the correct lead, while temporal specificity ensures that the attribution aligns with the relevant temporal segment associated with the regressed parameter in the ECG signal.

From local to glocal XAI ECG delineation
We aim to align attribution maps based on beats, i.e.R-peaks, or alternatively based on ECG segments.Both require an ECG delineation as a first step.For this, we trained a model capable of segmenting 12-lead ECG samples into 24 different segments.For our purpose, we trained a 2d U-Net [53] with convolutional kernels that span the entire feature axis.The crucial advantage of this approach compared to conventional ECG delineation models is the ability to exploit the consistency of segmentations across several leads as opposed to segmenting all leads individually.As labels we used fiducial points extracted from ECGDeli [54] for PTB-XL (as segments of length one) as well as the segments between them, see Fig. 12, leading to a total of 24 different segments.In general, given a 12 lead ECG sample of length T x ∈ R T ×12 , our segmentation model S : R T ×12 → R T ×12×M computes for each timestamp a soft assignment score for all M segment classes (which can be interpreted as a probability distribution over all M segment classes).The model is trained with random patches minimizing the categorical cross-entropy with the Adam optimizer.The model exhibited strong performance and displayed notable alignment with ground-truth annotations.Furthermore, during manual inspection, it demonstrated more reliable segmentations, especially in cases of disagreement with ECGDeli (as shown in Fig. 10c via direct comparison).A detailed performance analysis , both quantitatively and qualitatively, of this model is given in Appendix A.1.The ECG-features for the concept-based analysis were taken from the PTB-XL+ dataset [42].We make the soft segmentation maps [55] for the entire PTB-XL dataset underlying our analysis publicly available.
Beat-aligned attributions We perform a robust R-peak detection based on the soft predictions for the R-peak class in the segmentation (minimum distance of 30 timestamps between two beats and minimum output probability of 0.25) based on the spatial maximum of V1-V4.The R-peaks identified in this way are part of the data repository released with this paper.We then extract median beats from the signal by extracting the signal 300 ms before until 500 ms after each identified R-peaks (similar to previous approaches [32,26,36]).We proceed similarly for the attribution maps to extract corresponding beatcentered attribution maps.We can then aggregate the signal to derive representative median beats across several beats within a given sample but also across entire subsets of patients, which share a common pathology.
For Fig. 4 and Fig. 13, we perform experiments on entire subgroups of pathological samples by filtering the top 100 model predictions per class as these most clearly express the patterns exploited by the model.Segment-aligned attributions For the segment-specific parts of Fig. 4 and Fig. 13, the idea is to aggregate attributions using the soft segmentation outputs as weighting factors.We opt for soft segmentations (output probabilities), because they provide robustness in case of small segments (i.e. points at peaks in the signal), as opposed to hard segmentations, where the argmax sometimes fails to recover these small segments.To this end, we computed the aggregated maps m ∈ R 12×M of the raw signal x ∈ R T ×12 and the segmentations s ∈ R T ×12×M (similar for the attributions) as m i = x ⊤ :,i s :,i,: for each i in 1,...,12 where each m i ∈ R M is concatenated to m = [m 1 , . . ., m 12 ] and normalized by the temporal sum s = T i=0 s i,:,: .In other words, to compute the entry of the i-th channel and the m-th segment of the resulting aggregated map, we calculate the dot product of the attributions of the i-th channel with the segmentation outputs (normalized by their sum, so that they sum up to 1) corresponding to the m-th segment.The two proposed aggregation approaches at the beat and segment level, which facilitate ECG beat interpretation based on ECG analysis, are graphically summarized in Fig. 2.

Knowledge discovery
Knowledge discovery using XAI enables a deeper understanding of the patterns underlying model decisions [56,29,8,28] but also insights into the data itself.In this study, we propose an explorative unsupervised approach by clustering aligned attributions to uncover patterns in the data.To evaluate the effectiveness of attributions, we compare them with baseline representations, providing evidence of their potential insights.
Representations We complement aligned median beats and attributions by hidden feature representations.Here, we use features from layers after the global pooling operation (present in all our considered models) to ensure translation invariance.
Clustering procedure and metrics In order to avoid issues with dimensionality for clustering, all representations were dimensionally reduced via projecting onto the first k principal components that manage to capture 75% of the total variance.Since each representation differs in multiple unknown properties, there is no single clustering model, which is best-suited for all representations.Instead, we carefully selected models and hyper-parameters for each representation maximizing the clustering scores via grid-search over models (k-Means, Gaussian Mixture Models and Hierachial Clustering) and projections (with or without whitening).As performance metrics we report (1) standard accuracy (ACC) after assigning the clusters to classes and (2) adjusted Rand score (ARS) suited for comparing similarities among clustering.
In this sense, our result is a confirmative study indicating the best possible score (within the range of considered clustering algorithms and hyperparameters) that can be achieved given the ground truth label assignments.The best clustering results were achieved for (1) attributions and (3) hidden features using Gaussian Mixture Models (GMMs), for (2) inputs using k-means.
Cluster insights We demonstrate a structured procedure to gain further insights into the identified clusters.To this end, we plot the (aligned) mean attributions of both clusters and identify temporally as well as spatially localized regions where both deviate most strongly.In a second step, we visualize these regions but this time superimposed on the cluster means in terms of the original signals, see Fig. 7 for clustering anterior and inferior myocardial infarction (AMI and IMI) as known sub-classes within general myocardial infarction (MI) or Fig. 8c/Fig.8d for clustering novel and unknown subconditions within anteroseptal myocardial infarction (ASMI).

Global XAI with concepts
TCAV We base our approach on TCAV [37], which allows testing a trained model for alignment with humancomprehensible concepts.Concepts are defined through examples and can therefore include also abstract concepts.First, a concept activation vector (CAV) encoding the meaning of the concept is trained using the collected examples.Second, the gradient of pathology predictions in the direction of the concept vector is calculated to assess whether the trained model uses the concept for the prediction of those pathologies.The definition of a concept is in this case is always tied to a chosen hidden feature layer in the model architecture, where one expects that low-level concepts that directly relate to signal features are most clearly expressed in feature layers close to the input layers, whereas more abstract concepts at a higher semantic level are expected to be found in higher feature layers.Therefore we always include a range of different feature layers in our analysis.
Experimental setup First, we create a binary data set per concept , as in [57], each consisting of positive examples of the concept and randomly drawn data points that do not contain the concept.Second, we use these datasets to train a linear classifier in the feature space of a hidden layer for each concept.We interpret the accuracy of the linear classifier as a measure of how well the concept can be defined in said feature space.The classifier then yields the concept activation vector (CAV), which we use to compute the TCAV scores via directional derivatives, see the technical description in App.A.5).Although we computed the TCAV scores for each layer of each model, for visualization and comparison of both models, we picked six layers from each model (ranging from input layer to output layer).
Statistical significance To circumvent accidental findings, the process is carried out for 10 CAVs for 10 trained models with different initializations.It is worth mentioning, that the negative samples are resampled at each CAV fitting.Based on the statistics of the resulting 100 TCAV scores per model, layer, concept and output class, we impose the following conditions to single out consistent model behavior: 1.The first condition relies on the linear separability of positive and negative samples in the feature space of a hid-den layer, which we assess via the accuracy of the linear classifier on which the CAV is based.We consider a layer-concept pair, if the accuracy of the linear classifier is above 70%.Otherwise, we omit any interpretation and highlight this condition as gray rows in Fig. 5.
2. The second condition relates to the fluctuations of the TCAV scores.The corresponding confidence intervals, interquartile range (IQR) between the 75th and 25th percentiles, should not be too wide and bounded away from 0.5 to rule out coincidental findings.We therefore require the IQR to be less than 0.25 and that 0.5 is not contained in the confidence interval.We mark those combinations of model, layer, concept and class with a star symbol in Fig. 5 if, and only if, all conditions are met.
These conditions allow us to interpret consistent model behavior in terms of whether a concept is consistently used in a positive or negative way.While the first condition tells us whether a concept can be defined in a meaningful way in the respective feature space, the additional conditions tell us that the concept is exploited for a specific model prediction in a consistent way.We base our discussion on which concepts are consistently used in particular models on those concepts that fulfill all three prerequisites.

Code and data availability
The PTB-XL dataset underlying this work is publicly available [38,39].We provide soft segmentation maps [55] for the entire PTB-XL dataset as computed by our custom segmentation algorithm.We provide the source code to reproduce the main results of our investigations as part of the supplementary material under https://github.com/hhi-aml/xai4ecg.The code builds on the XAI software packages captum [58] and zennit [59].

Results
For most of our analysis, we consider a set of five conditions taken from the sub-diagnostic level (consisting of 23 labels) of the PTB-XL dataset, which covers the NORM class (representing healthy subjects) and a broad range of pathologies: These are prior/chronic, as opposed to acute [60], anterior and inferior myocardial infarction (AMI/IMI), which represent two localizations of the myocardial infarction, left ventricular hypertrophy (LVH), which describes the pathological increase in left ventricular mass, and complete left bundle branch block (CLBBB) as a form of conduction disturbance, which causes a delayed activation of the left ventricle.The corresponding model performances can be found in the supplementary material.

XAI methods
The XAI community distinguishes between inherently interpretable models [61] and post-hoc attributions.Figure 3: Results of the three experiments described in Section 2.3: P-wave amplitude (Fig. 3a, Fig. 3b), R-peak amplitude (Fig. 3c, Fig. 3d ) and T-wave amplitude (Fig. 3f) for LeNet and XResNet, respectively.Each subplot is organized in the same way: On the left, we show spatial specificities with different attribution methods color-coded.In the lower right plot, we show temporal specificities with different attribution methods color-coded.These properties are computed for all samples and their attributions are concatenated to allow for analysis across the whole dataset.For spatial specificity, we consider boxplots, where the leads are on the x-axis and the specificity is on the y-axis.For temporal specificity, we consider continuous line-plots, where time is on the x-axis and the temporal specificity is on the y-axis.In the upper right plot, we provide a median beat as a reference for better localization of time steps in the temporal specificity plot.We compute the median beat attributions across the whole dataset and scale them to have a norm of one.If only the segment in the lead under consideration was relevant, the spatial specificity would be one, and the temporal specificity strongly peaked around the corresponding segment.In terms of spatial specificity, saliency shows the highest specificity among the limb leads with a comparably small variance.For temporal specificity, all methods attribute more relevance to the QRS complex (comprising the Q-, R-, and S-peaks) rather than to the interval in question, which questions their validity.
There is an ongoing debate in the community, about whether post-hoc attributions represent explanations from the perspective of human end-users.We refrain from addressing this question and focus on the latter category, first, due to its widespread use and broad applicability to commonly used models in the field and, second, with an auditing scenario in mind, where one cannot enforce the exclusive use of inherently interpretable model architectures.In particular, we consider four different post-hoc attribution methods that are widely used either in the ECG domain or in the broader XAI community, in particular in the context of image data: Layer-wise Relevance Propagation (LRP) [50], Integrated Gradients (IG) [49], Gradientweighted Class Activation Mapping (GradCAM) [24], and Saliency maps [30].Interestingly, applying them to an iden-tical trained model leads to qualitatively different attribution maps, which is a well-known issue in the XAI community [62].A central requirement for attribution methods to be used for auditing or knowledge discovery is that the attribution method is supposed to be faithful, i.e., it should accurately reflect the model behavior.However, it is non-trivial to assess the accuracy of the model's attributions since the features it considers important are not visible to us.This opacity is commonly referred to as the 'black-box nature' of deep neural networks.Consequently, we are compelled to rely on indirect methods (such as sanity checks) linked to specific input features to evaluate these attribution methods.

Sanity checks
We use ECG parameter regression tasks as sanity checks, i.e., train regression models for both model ar-chitectures to predict P-, R-or T-wave amplitudes from the raw signal.This does not preclude, that the model could exploit information from correlated leads, but the expectation behind this task is that the attributions corresponding to this model should show high temporal and spatial specificity.For example, the prediction of the P-wave amplitude in lead V1 should be strongly influenced by the P-wave in V1, which should be reflected in the corresponding attribution map.We define two measures, the spatial specificity and the temporal specificity, to quantify how effectively the method prioritizes the correct lead or temporal segment in the ECG.The results of the sanity check are compiled in Fig. 3.
Spatial specificity Comparing spatial specificities across all experiments, we observe a largely consistent ranking among methods and leads: while LRP and IG are most specific in the precordial leads (V1-V6) followed by saliency, in the limb leads (I, II, III, aVR, aVL, aVF) saliency is the most specific method.Furthermore, we observed significantly less variance in the attributions for saliency, particularly in comparison to LRP and IG.These observations provide a first argument in favor of saliency, as we prefer methods that provide reliable results across all samples rather than only on average.Generally, attributions exhibit a notably higher degree of specificity in the precordial leads than in the limb leads (especially in V1, V2 and V3).This is largely due to the distinct and unique data presented by the precordial leads.On the other hand, for the limb leads, only two of the six leads are independent, with the remaining four being calculated as linear combinations of these two.This leads to strong inter-lead correlations among the limb leads.In such cases, changes in one variable have effects on others, making it intricate to isolate and distribute attributions independently due to the inherent dependent nature of their relationships.The analysis of the cross-correlations among leads in Fig. 9 in the supplementary material, indicates a notably lower level of total cross-correlation (sum of off-diagonals) in leads V1 and V2 compared to others, which is also reflected in the overall largest absolute spatial specificities in Fig. 3. Without claiming universality, we observe higher spatial specificity for leads with lower total cross-correlations between all other leads.
Temporal specificity In terms of temporal specificity, IG and LRP attribute the largest fraction of relevance to the Rpeak even if the R-peak is not directly relevant for the given task (e.g., for predicting the P-wave or T-wave amplitude).Interestingly, a similar effect is observed for GradCAM.It is not completely implausible that the model could use the Rpeak for localization within each beat and therefore could also attribute relevance to the R-peak.If we impose as a necessary condition for sensible attributions that when predicting a particular amplitude, the region around the corresponding segment should show the overall largest attribution across the entire temporal attribution plot, saliency is the single XAI method, which satisfies this condition in all six experiments.
Summary of sanity checks To summarize the results of our sanity checks, spatial specificity, but in particular temporal specificity, clearly favors saliency over the other three competing attribution methods.Therefore, we consider saliency as the reference attribution method in the following experiments.

Glocal XAI: Insights from aggregated aligned attributions
Aggregated attribution maps In the XAI literature, one distinguishes local and global attributions, where the former are sample-specific and the latter characterize the entire model.It is worth stressing, that while local attributions can potentially aid human decision-making, they are not directly applicable for auditing, which focuses on model behavior across entire patient groups.We propose to compute aggregated beat-aligned attributions across entire subgroups with common pathologies to infer global model insights and refer to this approach as glocal XAI.Fig. 4 summarizes the outcome of this analysis for the five subclasses under consideration in terms of median beats and attributions (top panel), average predictions (middle panel), and in terms of attributions aggregated on segment level for a XResNet model.A similar plot for the LeNet can be found in Fig. 13 in the supplementary material.In the following sections, we analyze the three main diagnostic conditions under consideration in detail.

Left ventricular hypertrophy (LVH)
Condition The early sign of LVH (left ventricular hypertrophy) is an increase in R-amplitude, which is caused by the increased left ventricular mass.That is why the Sokolow-Lyon index [63] is a frequently used, but relatively unspecific diagnostic tool to assist in the diagnosis of LVH.The Sokolow-Lyon index is calculated as the sum of the R-amplitude in V5 and the S-amplitude in V1.The sum must exceed 3.5 mV to be considered an indicator of LVH.
Attributions Both models agree on assigning the highest attributions to the R-peaks in V5, V1, and V6 (in that order).They also attribute relevance to the ST-segment in V1.However, the two models differ in the rest of their attributions: the LeNet places slightly more relevance on the R-peak in V2, while the XResNet additionally focuses on the QR-interval and the beginning of the Q-wave (all in V1).
Discussion The strong emphasis on the R-peak in V5 and V6 by both models aligns well with its presence in various LVH concepts and rules, such as the Sokolow-Lyon-Index SLI-LVH.However, except for some attribution in V1, there are no indications of S-peak amplitudes, which are part of some LVH decision rules (e.g., S12-LVH,LI-LVH-).

Complete left bundle branch block (CLBBB)
Condition In the case of CLBBB, the consequence of the caused conduction disturbance and at the same time the dominant ECG finding is a widening of the QRS complex to at least 120 ms [64].Further criteria include, for example, a QScomplex or rS-complex in V1 [65].
Attributions First of all, both models focus strongly on V1, with the R-peak in V2 as the only minor exception (for LeNet).Both models put most attribution on the R-peak in V1 , in line with previous findings [36].Further relevance is attributed to the second half of the P-wave and the PQ-segment, as well as the beginning of the ST-segment.The models differ in their emphasis: XResNet focuses more on the Q-wave, while LeNet places more emphasis on the ST-segment (all in V1). Discussion One might speculate that the attribution maps reveal a focus on the smoother morphology of the onset of the QRS complex compared to normal samples.On the other hand, it is difficult to determine the significance of the QRS width in an attribution map (which is found to be exploited by both models, as shown in the global XAI analysis in Section 3.3).The strong focus on V1 is aligned with decision rules that are based on large QS or rS-complexes in this lead, V1.

Anterior/inferior myocardial infarction (AMI/IMI)
Condition For the characterization of ECG changes related to prior myocardial infarction, we use the consensus definition [60], which focuses particularly on pathological Q-waves.For the manifestation of the localization of anterior vs. in- ferior myocardial infarction, we refer to [66] which suggests  that AMIs are predominantly detected through QS waves in V1-V3 and IMIs through longer and deeper Q-waves in II, III and aVF.
Attributions For IMI and AMI, there is a strong similarity between the aggregated attribution maps of both models.In the case of IMI, both models focus on the Q-peak and the first half of the Q-wave in aVL, II, aVF and III with minor differences in terms of ranking.For AMI, both models focus almost exclusively on leads V1 and V2 and the area between the QRSonset and R-peak.Both put most relevance on the R-peak and slightly less attribution on the QRS-onset in V1 and V2.
Discussion The attributions for both localizations of the prior MI align very well with the corresponding diagnostic criteria applied in the clinical setting, both in terms of spatial and temporal localization.The alignment of these patterns with domain knowledge across different models is convincing.

Global XAI: Does the model exploit cardiologists' expert concepts?
Shortcomings of attribution maps The aggregated attribution maps are an effective tool to highlight which parts of the signal are most relevant for certain decisions/predictions.However, it remains unclear whether, for example, low or high voltage amplitudes or the particular morphology at the location are decisive for the attribution of a signal, let alone how cardiologists' decision rules can be related to the model behavior.Nevertheless, prominent features in aggregated attribution maps, such as Q-waves for MI in Fig. 4, can provide guidance for the development of formalized rules, for which we can check if they are exploited by the model, as described below.
Concept-based XAI We draw on Testing with concept activation vectors (TCAV) [37] as a complementary approach, which, in contrast to the attribution methods discussed so far, allows to test a trained model for its alignment with humancomprehensible concepts.Our aim is to use it to provide insights that generalize across different training runs for a given model architecture.This goes beyond the usual TCAV setting, where only a single model instance is considered.For the three diagnostic classes CLBBB, LVH and MI, we select a single discriminative concept for each class from a selection of well-established conceptsSpecifically, we choose the concepts CLBBB-QRS for CLBBB, LVH-SLI for LVH, and MI-QWAVES for MI.These concepts are not just theoretical constructs; they are grounded in well-established clinical decision rules that practicing cardiologists use in diagnosing these conditions.For a concise overview of these concepts, we refer to Tab. 1 and for a discussion on the criteria for selecting the most discriminative concepts, see Fig. 11 in the supplementary material.We stress again that the approach is easily applicable to any concept that can be framed in terms of conventional ECG features.To demonstrate the versatility of our approach, we also investigate age and sex as two concepts that are not pathology-specific.
Disease-specific concepts Fig. 5 illustrates the results of the concept-based analysis and shows that the pathology-specific concepts (CLBBB-QRS, LVH-SLI and MI-QWAVES) are consistently (i.e.marked with stars) and consequently (i.e. in all layers) used by both models and thus have a strong positive impact on the model prediction of their associated pathologies.This is indicated by the red squares marked with asterisks, which almost entirely fill the bold black rectangles, which highlights the TCAV scores of the pathologies matching the corresponding concepts.Interestingly, both QRS-CLBBB and QWAVES-MI show a consistently and consequently negative effect (i.e.TCAV values close to zero) on the prediction of the normal class, which we consider as a confirmation of the validity of our analysis.In the case of SLI-LVH, there is no consequent negative effect on the prediction of the normal class, which aligns with a sizable number of normal samples satisfying the criterion [67].
Summarizing the investigation of the first three pathologyspecific concepts, this experiment shows in a so far unseen, consistent fashion that across both model architectures, the concepts are exploited and have a positive effect on the classes they were designed for.The fact that concepts, in a few cases, are also exploited for other classes simply reflects the fact that the rules are not perfectly specific, see Fig. 11 in the supplementary material, and are additionally influenced by cooccurring diseases.For example, the negative impact of the concept SLI-LVH on MI predictions aligns with decreasing R-amplitudes for MIs, as opposed to increasing R-amplitudes for LVH.
Disease-unspecific concepts For LeNet, the AGE>75 concept exhibits on one layer negative influence on the prediction of the normal class and positive influences for LVH, IMI and AMI.In contrast, for XResNet, the concept is exploited exclusively positively, on two layers, with effects on all the pathologies considered.This aligns with the expectation that age is an important covariate for pathologies and the models exploit this fact to a certain degree.Finally, the concept SEX=FEMALE has for both models a consistent positive effect for NORM.Fig. 11 shows that this can at least partially explained by the different prevalence of the NORM class comparing men and women.The TCAV evaluation further substantiates that the model uses this relationship in its predictions.Further, for the LeNet, the concept SEX=FEMALE seems to have a strong negative influence on IMI and some negative influence on LVH The positive effect of the concept for AMI aligns very well with potentially misdiagnosed female ASMI samples, see the discussion in Section 3.4.2.For the XResNet, it only has some positive influence on CLBBB, speaking of significant influences, which is most likely related to the different prevalence of pathologies in the PTB-XL dataset comparing men and women.These results highlight the ability to check if demographic attributes are implicitly exploited by the model, which is important in the context of fairness.

Knowledge discovery: Identifying subclasses using glocal XAI and unsupervised learning
In this final section, we revisit aggregated attribution maps and demonstrate the effectiveness of attributions in revealing substructures within a model, beyond the granularity of label information.To do this, we first present evidence in a controlled environment using the hierarchical labels provided by PTB-XL.Then, we use this framework to uncover intriguing insights into a shift in the interpretation of female ECGs and the possibility of ASMIs.

Confirmative analysis: Discovery of sub-diagnostic classes of myocardial infarction
Experiment In this section, we propose an experiment to discover sub-diagnostic classes within a super-diagnostic class using aggregated attribution maps as a source for knowledge discovery, instead of input or hidden feature representations.Building on the insights from the analysis on glocal XAI in Section 3.2, we focus on the differentiation of myocardial infarctions (MI) into anterior (AMI) and inferior (IMI) myocardial infarctions (see Section 3.2.3).For this, we train a model capable of classifying into five super-diagnostic classes (including MI).Once trained, we fit clustering models on all samples labelled as MI (specifically we filter for samples that were labeled with MI as the exclusive super-diagnostic label and either uniquely AMI (n = 702) or IMI (n = 1250)).In order to provide evidence for the efficiency of (1) attribution maps in discovering sub-diagnostics, we consider the following representations as baselines: (2) median input beats and (3) features from deeper layers of the trained model.

Clustering results
In Fig. 6, we show the results of this experiment, where we observed the best clustering performance on attributions, i.e. an accuracy of 90% for attributions (Fig. 6c) compared to 75% for input (Fig. 6a) and only 61% for hidden features (Fig. 6b).
Cluster visualization In Fig. 7, both clusters are visualized, with cluster 0 predominantly corresponding to AMI and cluster 1 predominantly to IMI.The median beats for both clusters were computed by aggregating the corresponding beats, based on the attribution clusters, along with the corresponding mean attributions on top.The resulting clusters agree with the glocal analysis of supervised models in Fig. 4.

Conclusion
The fact that the most effective discovery capability is achieved through attributions provides supporting evidence of the usefulness of XAI for multiple applications, including debugging, interpretation and discovery.It is important to note, that those observations cannot be made solely based on input data or feature activations of intermediate layers.This finding aligns very well with prior work [20] where clustered attribution maps were used to identify shortcuts exploited by the model.

Explorative analysis: Identifying subgroups within anteroseptal myocardial infarction
Experiment We go one step beyond confirmation and demonstrate how the proposed approach can be used to identify clinically relevant structures for pathologies, where no further substructure is known.Here, we focus on ASMI, which is the most frequently annotated subclass of anterior myocardial infarction in PTB-XL.We deliberately chose a statement at the most finegrained level of the diagnostic label hierarchy to avoid rediscovering already known subclasses.
Clustering results Using the method from the previous section, we cluster the aligned attribution maps for all ASMI samples and visualize the corresponding embeddings in Fig. 8a and Fig. 8b for XResNet and LeNet, respectively.In Fig. 8c and Fig. 8d we visualize the mean beats corresponding to the respective clusters, where we highlight the regions of interest via ellipses.Interestingly, both model types identify similar clusters highlighting the area before the vertical line in leads V1, V2, V3.
The main difference between the clusters is the presence or absence of an R-peak.The R-peak refers to the peak of the Rwave within the QRS complex, which represents ventricular depolarization.Its presence, size, and shape provide important information about heart function and health.In the case of the LeNet, see Fig. 8d, a remnant R-peak is visible in cluster 1 in all three leads V1, V2, V3, in the case of the XResNet only in V3 and V2.In both cases, cluster 0 shows signs of a transmural anteroseptal myocardial infarction, in particular, no visible R-peaks in leads V1, V2, V3.A transmural infarction represents a severe type of heart attack, characterized by involving the entire thickness of the heart muscle in the affected   8a and Fig. 8c) and LeNet model on the right (Fig. 8b and Fig. 8d).Top plots Fig. 8a and Fig. 8b: TSNE embeddings (based on attributions) for ASMI colored by sex (left) and cluster assignment (right, as in Fig. 6c) reveals two distinct clusters.Bottom plots Fig. 8c and Fig. 8d: Corresponding cluster means and median beats for ASMI attribution clusters reveal distinct morphological differences, which can be associated with different degrees of severity of the myocardial infarction.Encircled regions in the plots indicate regions where the difference between the corresponding attributions, which are used for clustering, are significantly different from zero, i.e. the main regions where one expects to find differences between both clusters.These reveal a cluster of transmural ASMIs (cluster 1) and a cluster of less severe ASMIs and according to modern diagnostic standards even normal variants (cluster 0).area.As the size of the R-peak is indicative for extension of the scar, cluster 0 hence encompasses ASMIs of a less severe degree.We stress again that the two clusters were both comprised of samples that were labeled as ASMI and were only differentiated based on their respective attribution maps.
Clinical interpretation Going beyond a descriptive analysis, we aim to explore the potential clinical meaning of these results and identify sex as a covariate that discriminates between both clusters.Indeed, for women, certain ECG changes are in some cases even considered as normal variants [66].This observation aligns with the high proportion of female samples in cluster 1 (XResNet: 55% LeNet:69% all AS-MIs: 53%).This is further evidenced by the mean predicted probabilities for the non-transmural cluster 1.Both models show a lower probability of AMI (compared to the set of all ASMI predictions) and an increased probability of NORM.This pattern is most clearly exposed in the case of the LeNet (AMI 0.27 and NORM 0.41 on cluster 1 as compared to AMI 0.72 and NORM 0.09 on all ASMI samples).A detailed analysis of 10 female samples from LeNet cluster 1 supports this finding.These cases would be diagnosed as normal variants according to modern standards, whereas they would have been diagnosed as ASMIs due to R-progression according to the common diagnostic standards at the time the PTB-XL dataset was created.

Discussion
Sanity checks We propose sanity checks based on an ECG parameter regression task, informed by methodologies previously considered and documented in other domains, as described in [51,52], however, to the best of our knowledge investigated for the first time in the context of ECG analysis.Surprisingly, three out of the four considered attribution methods do not show a sensible degree of temporal or spatial specificity and show a strong bias toward attributing relevance to the QRS-complex, which typically contains the highest signal amplitudes.In particular, GradCAM, the most widely used attribution method in the field of ECG analysis, fails to satisfy the sanity check.Saliency maps, the only method that passed the sanity check, show a high degree of temporal specificity.Gradient noise, a major drawback of saliency methods, can be very effectively reduced by aggregating attributions across beats and samples.The outcome of the sanity check not only call into question results achieved using attribution methods that did not pass the sanity check in the field of ECG analysis, and serves as a warning against blindly applying of off-theshelf attribution methods in any domain.
Glocal XAI Most of the existing applications of attribution methods in the field demonstrate insights based on single hand-picked examples.Our analysis provides strong arguments in favor of aligned attribution maps , in line with arguments and experimental evidence put forward in the literature [32,26,36,27], which can be defined on beat level or even on the level of individual ECG segments composing the beat, that can be aggregated across entire patient subgroups.To summarize, our analysis shows an unexpectedly strong similarity between the attribution maps of both model architectures, in particular also in terms of ranking of the most relevant segments.The quantified attribution distribution onto segments and leads can be used to compare to cardiologists' decision rules [63,64,65,60,66].We find good agreement for the most relevant parts both in terms of spatial and temporal localization.This technique finds its boundaries when decision rules can no longer be directly related to single ECG features but to more abstract concepts, which could be analysed in the global XAI section.Prior studies [26,36] have leveraged visualizations and relevance score analyses to explore ECG features and disease classifications, emphasizing specific leads and features.However, to the best of our knowledge, our study is the first to quantitatively analyze these aspects across the entire dataset.We provide a more comprehensive breakdown by leads and segments, allowing for a deeper understanding and broader generalization beyond the anecdotal evidence typically based on isolated examples.This methodological advancement facilitates a more nuanced insight into ECG data segmentation and its clinical implications.This technology could serve as a complementary assessment of the internal validity during a certification process but can also be used for knowledge discovery.
Global XAI We see the ability to test if deep learning models exploit a certain (abstract) concept as a decisive advance in the global analysis of ML models.It provides crucial new insights into the model that are difficult to attain with conventional attribution methods (as suggested by [37,57,68]).The ECG domain is particularly well-suited for this purpose due to the availability of large rulebooks comprising decades of cardiologists' expert knowledge, which are mostly formulated in terms of ECG features that can be automatically extracted from the signal.To the best of our knowledge, we are the first to demonstrate the potential of this technology for selected concepts and pathologies and found a compelling alignment with expert knowledge.We envision that this approach could serve as a component to assess the internal validity of a machine learning algorithm during the certification process for automatic ECG analysis algorithms.On the one hand, it allows for the verification of certain concepts that are considered to be mandatory for every model deployed in clinical environments.On the other hand, it can also be used to check whether concepts related to sensitive attributes are systematically exploited.This provides a complementary perspective to the fairness literature, which mostly relies only on model performance, see e.g., [69], and touches on the fundamental question of whether the model should be allowed to exploit sensitive attributes for its decision.
Finally, we would like to emphasize the strong consistency of the glocal and global analysis.Our selection of concepts and the resulting TCAV values are consistent with the observations made in the glocal analysis, where we observed the same concepts per diagnostic class in the input using aggregated aligned attribution maps.In particular, compare Fig. 4 and Fig. 13: For LVH, the global analysis reveals that the SLI-LVH concept is clearly exploited and, correspondingly, the glocal analysis shows highest relevance on R-peaks in V1 and V5.For MI, the Q-wave concepts QWAVES-MI are exploited and the glocal analysis highlights Q-waves for AMI and IMI.This finding supports our claim that glocal methods can be used to arrive at global insights, which underlines their dual nature.
Knowledge discovery In the realm of deep learning applied to 12-lead ECG, previous studies [56,29,8,28] have demonstrated the capability of machine learning models to capture and interpret known clinical features and differences in ECG data.While these studies have enhanced our understanding of ECG interpretation, our research builds upon this foundation by introducing explorative clustering.This novel method not only analyzes decision processes but also identifies previously undetectable subclasses within the data, pushing the boundaries of clinical knowledge discovery in ECG data.We demonstrate that aligned attribution maps represent the most reliable way of identifying subclasses within a given superclass (compared to aligned raw signals or hidden features), which is a surprising observation.It crucially relies on the alignment, which is non-trivial to be achieved in other data domains.However, similar techniques have been used in the literature to identify spatially localized artifacts compromising image classifiers [20].Concept-based methods that directly relate to structures in the models' hidden representations [70] might be a way to overcome the limitations of the requirement of aligned samples and might provide a characterization on the level of individual segments rather than entire beats.
Applying the same approach to anteroseptal myocardial infarctions, we demonstrate that the method can distinguish transmural from non-transmural myocardial infarctions as clinically meaningful subgroups.This goes as far as providing hints on the internal consistency of diagnostic classes.We see the ability to question diagnostic knowledge through datadriven insights as a promising path to advance the field of ECG analysis providing insights that even extend to diagnostic criteria underlying the different conditions.Here, the most accurate model does not necessarily allow the deepest insights.The more complex XResNet has sufficient model capacity to learn essentially arbitrary input-output patterns.However, the shallower LeNet, due to its smaller model capacity, reveals the tension between the NORM condition and the diagnostic criterion used to diagnose ASMI.Extending this approach to further pathologies to deepen the understanding of ECG signs through such a combination of data-and model-driven techniques represents a promising perspective for future research.
Limitations and future work While our study makes significant contributions to the field by applying both conceptbased and glocal XAI to the domain of ECG, it is not without limitations.One notable constraint is the reliance on PTB-XL as the primary dataset, which may introduce biases and impact the generalizability of our findings due to its potential limitations in encompassing the full spectrum of demographic diversity, clinical conditions, and ECG signal variability.
While we identified saliency as the most reliable attribution method through a defined sanity check, comparing it against integrated gradients, LRP, and Grad-CAM, we acknowledge the limitations inherent in relying on a single attribution method for the proposed experiments.This reliance potentially narrows our perspective, as various attribution methods might provide different insights.
Furthermore, it is important to stress that the proposed glocal XAI approach is only able to identify consistent patterns that equally impact all beats, which is also the idea of basing ECG feature extraction on median beats.In particular, this approach will not be able to reveal temporal variations of features across several beats such as RR variability or spectral characteristics, even if the model can potentially leverage them internally.However, the global XAI approach with respective concepts can be applied here, emphasizing the importance of a comprehensive choice of views on this topic and highlighting the pros and cons of each approach.Additionally, the methodological approach of global XAI, has inherent constraints such as translation of domain expert concepts into formal conditions, which may encounter challenges in other domains, such as medical imaging, where often no commonly accepted features are available, which can be extracted in an automated fashion, that can be used as part of expert concept definitions.
Further investigations incorporating alternative methodologies for automatic concept discovery, as suggested by [71,70], could provide a complementary perspective on this topic.Recognizing these limitations, our work serves as a valuable step forward, offering insights that should be considered in conjunction with the broader discourse in the field.

Conclusion
In this work we provide evidence for (1) the need for sanity checks to ensure the faithfulness of post-hoc attribution methods (see Section 3.1), (2) the prospects of using segmentation maps to align and aggregate attribution maps both temporally (in terms of ECG segments) and spatially (in terms of leads) to explain model behavior across entire patient populations (see Section 3.2), (3) the power of using concept-based XAI methods to verify if expert concepts based on ECG features are consistently exploited by a model (see Section 3.3), and (4) the ability to discover previously unknown sub-conditions from aggregated attributions that remain hidden from raw signals or hidden model representations see Section 3.4.Both, the glocal and global analyses offer complementary yet consistent insights into the model's behavior, suggesting a quantitative alignment with cardiologists' decision rules.In particular, both methods agree on that the model is exploiting following pathology-concept-pairs: (1) for LVH an increase in R-Amplitude (Sokolow-Lyon-Index), (2) for CLBBB a widening of QRS complex to at least 120 ms and (3) for AMI/IMI pathological Q-waves in specific leads (specific to the localization).We believe that our four main findings are of utmost importance for the future use of XAI methods in the ECG domain with auditing or knowledge discovery applications in mind.

A Supplementary Material A.1 Data and Models
Lead correlations In Fig. 9 we provide cross-correlations among leads highlighting how much the leads in 12 lead ECG signals correlate with each other.This cross-correlation is most prominent in the limb leads, where four out of six leads are synthetic, i.e. a linear combination of the remaining two leads, which are linearly independent.This fact provides evidence for the drop in the performance of attributions methods in terms of spatial specificity in some leads.
Model performance on PTB-XL For the task of predicting the sub-diagnostic labels of PTB-XL, in terms of macro AUC, i.e. the mean across all individual label AUCs, we observe that shallow convolutional models like LeNet (macro AUC 0.9245 ± 0.0030) are almost competitive with deep models like XResNet (macro AUC 0.9292±0.0043).To put this into perspective, we also report the performance on the more comprehensive set of all 71 labels at the most finegrained level in PTB-XL, where we notice a bigger gap between the XResNet model (macro AUC 0.9286 ± 0.0028) and the LeNet (macro AUC 0.9022 ± 0.0043).Both values should be set into perspective with the performance of the best-performing models in [5] (sub-diagnostic: macro AUC 0.93 and all: macro AUC 0.925).
Segmentation Model Performance In evaluating our proposed segmentation model, we conducted a comprehensive analysis considering both soft predictions, as measured by ROC curves, and hard predictions obtained after applying argmax, utilizing confusion matrices and classification reports (precision, recall, and F1 scores).The results are highly promising, showcasing a macro AUC of 0.98 and a accuracy of 0.75.Notably, our evaluation focused on the 6 seconds around the center of each sample (disregarding 2s at the beginning and the end of each of sample), as ECGDeli, representing the ground truth segmentation, exhibited limitations at signal borders.Fig. 10 visually summarizes the quantitative outcomes for hard predictions (Fig. 10a), soft predictions (Fig. 10b), and provides a qualitative comparison that highlights potential weaknesses in ECGDeli's performance (Fig. 10c), where in case of sinus tachycardia (STACH), i.e. more than 100 beats per minute, ECGDeli provides poor segmentations and fails to identify individual beats.Additionally, Tab. 2 offers detailed classification reports for each segment, providing a comprehensive overview of our model's efficacy according to different metrics.These findings highlights the robustness and reliability of our proposed segmentation approach.

A.3 Global XAI
Concept definitions We summarize commonly used concepts/decision rules for the pathologies considered in this work Tab. 4 along with formalized versions of them in terms of automatically extracted ECG parameters.In Fig. 11, we analyze the overlap between annotations for particular pathologies and specific concepts in terms of Matthews correlation coefficients (MCCs).For the selection of the most discriminative concept for each pathology, we use the concept with the highest MCC value for the given pathology.Therefore, we picked SLI-LVH for LVH, QRS-CLBBB for CLBBB and QWAVES-MI for IMI/AMI.

A.4 Glocal XAI
ECG segments In Fig. 12, we visualize the ECG segments that were used to train the segmentation model.
Glocal XAI results for the LeNet model Fig. 13 shows the results of the glocal analysis (corresponding to Fig. 4 in the main text) for a LeNet instead of a XResNet model architecture.

A.5 XAI methods
In order to make the paper self-contained, we provide short technical descriptions of the XAI methods used in this study.
For more details, we refer the reader to the original publications.
TCAV With TCAV [37] it is possible to test a trained model against human comprehensible concepts.These concepts are implicitly defined by data.In contrast to the attribution maps, with TCAV we can explicitly determine, whether e.g. a high magnitude at the R-Peak of a signal is of high importance for the prediction of a certain pathology.In general, the TCAV approach works as follows.First, a binary data set is compiled per concept, where the concept occurs in the positive samples and the negative samples represent a random composition of data points in which the concept does not occur.Second, for every dataset, we train a linear classifier that tries to differentiate the positive samples from the negative samples in feature space.The orthogonal vector of the linear classifier is called concept activation vector (CAV).It provides the direction in which the concept lies in the feature space, as depicted in Figure 14.Using the CAV, we can determine the TCAV score.First, we calculate for every data point x i belonging to class k the sensitivity S C,k,l (x i ) which constitutes the directional derivative of a class k in the direction of the CAV v l C : In a second step, the TCAV score is built as the fraction of data points x i that belong to class k, that have a positive sensitivity S C,k,l (x i ): It provides a global score that allows assessing the importance of concept C at layer l of the model for the prediction of class k.Saliency maps Saliency maps [30] infer attribution simply from input gradients of the corresponding class outputs.Denoting the model's output logits for class k by F k (x), the saliency for input sample x, we define saliency maps as [30] S k (x) = ∂F k ∂x GradCAM GradCAM [24] takes a different approach in that it does not use the gradient information directly as attribution maps, but rather to calculate weights for the activation feature maps.To this end, on a given convolutional layer, for every K feature maps, it computes the activation gradient δy c δA k of the class c we try to explain.Then, it calculates for every feature map A k a weight α k by simply applying a global mean pooling to the gradient: where Z corresponds to the number of entries in the feature map.The attribution map is then given by the ReLU activation Layer-wise relevance propagation Layer-wise relevance propagation (LRP) [50] propagates the model prediction f (x) from output to input, assigning attributions to each neuron in the network, which are computed based on activations and weights of the same layer and attributions of connected neurons.A key idea of the method is that attributions are conserved across layers, i.e., the sum of the attributions at each layer k yields the model prediction f (x): The exact update procedure is dependent on the rule that is used.We use the ϵ-rule for all layers, which is formally equiv-alent to taking the product of input and input gradient (Gradient * Input) for models with only ReLU activations [74] as those considered in this work: We also tested the Z-Plus rule for convolutional layers, which represents a default choice in the imaging domain: with z + jk = x j w + jk and w + jk = 1 w jk≥0 w jk .Due to worse results in the sanity check, we omit this rule and use the ϵ-rule throughout the model.
Integrated Gradients Integrated Gradients (IG) [49] is a model-agnostic explainability method that computes attribu-tion maps by integrating over multiple gradients.It starts with a baseline b, which is chosen as an input with neutral prediction, e.g. a null tensor, which would in our case represent an ECG with no electrical activity.To explain the prediction of class k for a given input ECG x ∈ R m×n , IG integrates the gradients of the model prediction for class k w.r.t to all interpolated data points between the baseline b and x: where F k (x) is the logit for a data point x for class k and IG k,i,j (x) the explanation of class k for the i-th timestep in input x.The full explanation is then given by: IG k (x), which is also called attribution map.It assigns an attribution score to each timestep i in the input x.By superimposing the attribution map on the ECG, we can mark the relevant areas in the ECG and thus simplify the interpretation of the diagnosis.

Figure 2 :
Figure 2: Graphical overview of the proposed aggregation process of local into glocal attributions.First column: Starting point is a deep learning model trained to predict diagnostic classes from raw ECG time series as input.Second column: Local attribution methods provide lead-and timestep-specific attributions for every input sample (bottom).A segmentation model provides probabilistic segmentation maps of the time series into well-established ECG segments.Third column: First possibility to define aligned attributions across several beats and patients is to align signal and corresponding attributions based on fixed crops around the R-peak.Fourth column: Second possibility is to use probabilistic segmentation maps as weighting factors for attribution maps.Performing a dot product between attribution and probabilistic segmentation map along the temporal dimension yields (after appropriate normalization) an aggregated attribution map (bottom) that resolves relevance according to leads and EGC segments.This allows for unprecedented qualitative and quantitative insights into consistent model behavior after aggregation across entire patient subgroups.

Figure 4 :
Figure4: Results of the glocal (dataset-wide) analysis for saliency as an attribution method for a XResNet model.Here, we consider five classes: 1. NORM normal ECG as reference 2. LVH left ventricular hypertrophy 3. CLBBB complete left branch bundle block 4. IMI (prior) inferior myocardial infarction and 5. AMI (prior) anterior myocardial infarction.For each class, we aggregate a median beat for the top 100 predictions per class and also provide the mean of ground truth labels (gray bars) as compared to the prediction (red bars) below each plot (see LVH for inter dependencies with _ISC).On top of each plot, we visualize the absolute attributions color-coded, where deep red indicates high attribution for the respective diagnosis (e.g. the QS-complex in V1 and V2 for AMI is highly relevant).At the bottom, we show the relevance distribution broken down according to ECG segments with the top 7 segments with thehighest relevance per segment length marked, which allows for quantitative statements about the relevance distribution.These show good agreement with the relevant segments used in decision rules from clinical literature.

Figure 5 :
Figure 5: Concept-based analysis: Investigating which of five concepts (from top to bottom: QRS-CLBBB, SLI-LVH, QWAVES-MI, AGE>75, SEX=FEMALE) are used for the prediction of a certain class in LeNet(left) vs. XResNet(right).Within a block (i.e., a particular concept-model combination to be tested) rows denote different layers in the model and columns represent the different output classes.Each block is color-coded according to the corresponding mean TCAV score indicating whether the concept is used for/against the class under consideration.Stars indicate confidence intervals for the TCAV score that are sufficiently narrow and do not overlap with 0.5, see the text description for details, i.e., correspond to cases where concepts are consistently exploited.The numbers in brackets are the respective CAV accuracies, which describe how well a concept can be linearly separated, where blocks with insufficient accuracy are grayed out.

Figure 6 :
Figure 6: Results of the experiments as described in Section 3.4.1.T-distributed stochastic neighbor embeddings (TSNE) (with default parameters each) of three different representations extracted from a XResNet model: Fig. 6a for the input median beats, Fig. 6b for the hidden features after global pooling and Fig. 6c for the saliency attributions.In each subplot, we color-coded the ground truth labels left and the resulting clustering on the right.The plots reveal that (aligned) attributions are the most effective input representation for subclass discovery.

Figure 7 :
Figure 7: Analysis and visualization of the cluster means of the Gaussian Mixture Model fitted on attributions for super-class MI as described in Section 3.4 revealing two clusters corresponding to sub-classes AMI (cluster 0) and IMI (cluster 1).
(a) TSNE embeddings for the XResNet model.(b) TSNE embeddings for the LeNet model.Cluster means based on LeNet.

Figure 8 :
Figure8: Explorative analysis of the substructure of the ASMI (anteroseptal myocardial infarction) for the XResNet model on the left (Fig.8aand Fig.8c) and LeNet model on the right (Fig.8band Fig.8d).Top plots Fig.8aand Fig.8b: TSNE embeddings (based on attributions) for ASMI colored by sex (left) and cluster assignment (right, as in Fig.6c) reveals two distinct clusters.Bottom plots Fig.8cand Fig.8d: Corresponding cluster means and median beats for ASMI attribution clusters reveal distinct morphological differences, which can be associated with different degrees of severity of the myocardial infarction.Encircled regions in the plots indicate regions where the difference between the corresponding attributions, which are used for clustering, are significantly different from zero, i.e. the main regions where one expects to find differences between both clusters.These reveal a cluster of transmural ASMIs (cluster 1) and a cluster of less severe ASMIs and according to modern diagnostic standards even normal variants (cluster 0).

Figure 9 :
Figure 9: Cross-Correlations among leads showing leads V2, V3 and III have less cross-correlation compared to other leads.

Figure 10 :
Figure 10: Evaluation results our proposed segmentation model which was trained on ECGDeli segmentations.Fig. 10a evaluates the hard predictions after argmax based on confusion matrix.Fig. 10b evaluates soft predictions based on class-wise ROC.Fig. 10c shows a direct comparison of delination/segmentation between Deli and our model as described in Section 2.4 and evaluated in Appendix A.1.

Figure 11 :
Figure 11: Matthews correlation coefficient (MCC) for all pairs of concepts and pathology.

Structured XAI from Local to Glocal Standard ML Pipeline Beat Level
Finally, through unsupervised learning, we demonstrate that attributions are more informative for subgroup discovery than the respective input signals or high-level model features, see Fig.6, which highlights a greater level of exploitative information of attribution maps.As an exploratory study, we use this insight to identify clinical meaningful subgroups of anteroseptal myocardial infarctions, see Fig.8, which, ultimately, relates back to the diagnostic criteria used to an-

Table 3 :
Sanity checksPerformance evaluation In Tab. 3, we describe the result of the regression experiment introduced in Section 2.3.The results in terms of mean absolute error and coefficient of determination are provided in Tab. 3, showing slight advantages of the LeNet models in two of the three experiments.Overall, each considered model performs reasonably well such that the analysis is not affected significantly by the choice of the model architecture.Results of regression experiments as described in Section 2.3, where we report mean and standard deviation (ob- i A.2 tained by bootstrapping) mean absolute error (MAE) and coefficient of determination (r 2 ) for each pair of models (LeNet and XResNet) and task (P-and T-wave and R-peak amplitudes in all 12 leads).