Self-supervised representation learning from 12-lead ECG data

Clinical 12-lead electrocardiography (ECG) is one of the most widely encountered kinds of biosignals. Despite the increased availability of public ECG datasets, label scarcity remains a central challenge in the field. Self-supervised learning represents a promising way to alleviate this issue. In this work, we put forward the first comprehensive assessment of self-supervised representation learning from clinical 12-lead ECG data. To this end, we adapt state-of-the-art self-supervised methods based on instance discrimination and latent forecasting to the ECG domain. In a first step, we learn contrastive representations and evaluate their quality based on linear evaluation performance on a recently established, comprehensive, clinical ECG classification task. In a second step, we analyze the impact of self-supervised pretraining on finetuned ECG classifiers as compared to purely supervised performance. For the best-performing method, an adaptation of contrastive predictive coding, we find a linear evaluation performance only 0.5% below supervised performance. For the finetuned models, we find improvements in downstream performance of roughly 1% compared to supervised performance, label efficiency, as well as robustness against physiological noise. This work clearly establishes the feasibility of extracting discriminative representations from ECG data via self-supervised learning and the numerous advantages when finetuning such representations on downstream tasks as compared to purely supervised training. As first comprehensive assessment of its kind in the ECG domain carried out exclusively on publicly available datasets, we hope to establish a first step towards reproducible progress in the rapidly evolving field of representation learning for biosignals.


INTRODUCTION
The availability of datasets with high-quality labels is an omnipresent challenge in machine learning in general, but especially in the health domain, where the labeling process is particularly expensive and clinical ground truth is in many cases hard to define.However, the amount of unlabeled data often exceeds the amount of labeled data by several orders of magnitude, which represents a strong case for (self-supervised) representation learning from unlabeled Temesgen Mehari is with Physikalisch Technische Bundesanstalt, Berlin, Germany and Fraunhofer Heinrich Hertz Institute, Berlin, Germany, email: temesgen.mehari@ptb.de.Nils Strodthoff is with Fraunhofer Heinrich Hertz Institute, Berlin, Germany, e-mail: nils.strodthoff@hhi.fraunhofer.de.Corresponding author: Nils Strodthoff (Permanent address: University of Oldenburg, Germany, e-mail: nils.strodthoff@uol.de).Both authors contributed equally to this work.
data.During the past few years, self-supervised learning has made enormous advances in different domains ranging from natural language processing [1] over speech [2] to computer vision [3].Self-supervised learning could be one component towards addressing the problem of data scarcity.It could help to train more accurate and potentially also more robust models given the same amount of labeled data, which is a desirable prospect for any application field.Of particular importance for the medical domain are improvements in label efficiency, which could allow to train models on more finegrained and consequently less populated label hierarchies, or to include rare diseases that were out of reach with conventional training methods.
In this work, we investigate self-supervised representation learning in the context of clinical electrocardiography (ECG) data.The ECG is a non-invasive method that allows to assess the general cardiac condition of a patient.It is therefore an important tool for the first-in-line examination for the diagnosis of cardiovascular diseases, which rank among the diseases of highest mortality [4].In particular, the (short) 12-lead ECG, which we focus on in this work, is the most commonly used type of ECG with a very broad clinical applicability ranging from primary care centers to intensive care units.Even though the technology underlying the ECG is by now more than 100 years old and it is an extremely common procedure, which is ordered or provided during 5% of the office visits in the US [5], its interpretation is still performed mostly manually with only limited algorithmic support.Here, it is important to recognize that ECG interpretation is in some cases even challenging for cardiologists [6].
There are deep-learning-based ECG interpretation algorithms with exceptionally high performance [7], [8] that have been trained on large closed-source datasets.The sizes of publicly available datasets are smaller by several orders of magnitude, which provides the motivation to investigate if and how far self-supervised learning techniques can improve the performance of algorithms trained on these datasets.In addition, the question of label quality remains challenging even for the above large-scale datasets.Here, it is important to stress that even though self-supervised methods have been applied successfully in computer vision and speech, ECG records are timeseries (rather than a onedimensional image) and multivariate data (unlike speech) with considerably different properties than speech.This means that the degree to which self-supervised methods work in this domain is not clear from the onset and deserves a thorough study.Even though the underlying methods are based on methods developed for other application domains, it requires subtle adaptations, such as a careful choice of augmentation transformations or adaptations of the model architecture and training procedure, to make them actually work in the context of ECG data.And finally, just as in the case of supervised learning [9], measurable progress in the field of representation learning for ECG data requires benchmarking based on clearly defined evaluation criteria, in the ideal case with open software on publicly accessible datasets.With this work, we hope to establish a first step in this direction.
Going beyond a mere benchmarking of representation learning algorithms for the ECG domain, the real benefits lie in the potential benefits of self-supervised pretraining for finetuned downstream classifiers.This includes aspects such as including improved data efficiency, improved quantitative performance, or improved robustness in a general sense as compared to model trained in a purely supervised fashion.In our experimental results, we present explicit evidence for these benefits.Putting these results into perspective, demonstrating significant improvement through self-supervised pretraining in terms of downstream performance should not be taken for granted as the effects often remain small [10].Also improved robustness properties from self-supervised have rarely been demonstrated explicitly in other domains not to mention the domain of ECG data.
Our key achievements can be summarized as follows: • We present the first comprehensive assessment of self-supervised representation learning for 12-lead ECG data to foster measurable progress in the subfield of representation learning for biosignals.

•
We adapt and directly compare instance-based selfsupervised methods (Simple Framework for Contrastive Learning of Visual Representations (Sim-CLR), Bootstrap Your Own Latent (BYOL), Swapping Assignments between multiple Views (SWaV)) and contrastive, latent forecasting methods (contrastive predictive coding (CPC)) and find compelling evidence for the feasibility of learning useful representation from ECG data through self-supervised learning.

•
We propose and evaluate several modifications in the CPC architecture and training procedure that lead to considerable performance improvements.

•
We evaluate different quality aspects of downstream classifiers finetuned from self-supervised models compared to training from scratch and find evidence for improved quantitative performance given the same downstream training set, improved label efficiency and improved robustness through selfsupervised pretraining.

ECG analysis using deep learning
By now, the analysis of ECG data has developed into a very popular application domain for deep learning methods.As we are solely focusing on deep learning methods in this work, a brief discussion of state-of-the-art methods in the field is appropriate.For a detailed comparison of different approaches, we refer the reader to recent reviews [11], [12].Until very recently, the merits of newly proposed methods in the field have been very difficult to assess due to a lack of appropriate, large, publicly available ECG datasets for training and evaluation of algorithms as well as due to a lack of clearly defined benchmarking criteria.The first issue was resolved very recently with the publication of several large clinical ECG datasets, see Section 2.3 for details.Relating to the second issue, we draw on a recent benchmarking study [9] on the PTB-XL dataset [13], the very dataset also used as downstream dataset in this study, where a number of different classification algorithms was assessed on different clinically relevant ECG classification tasks.The overall best-performing methods in this study turned out to be modern convolutional network architectures, namely resnet-or inception-based architectures.This is in line with the network architectures used in the literature [7], [8], [14] for which the authors report excellent performance on datasets with restricted access that are up to several orders of magnitude larger than the currently available public datasets.

Self-supervised representation learning for ECG data
Contrastive methods in computer vision have witnessed tremendous advances in the past few months [3], [15]- [18], which significantly improved the linear evaluation performance on ImageNet and demonstrated the usefulness of the learned features for other computer vision tasks.These methods can be adapted straightforwardly to learning representations from a large number of relatively short time series segments if one interprets the time series record as a one-dimensional multichannel image and adapts transformations appropriate for time series.The predominantly used approaches rely on instance discrimination as pretraining task and will be discussed in Section 2.2.1 in detail.
A second domain, where self-supervised methods for nondiscrete data have been implemented successfully is the domain of representation learning speech, where predictive coding methods [2], [19] have been applied to conventional acoustic features [19]- [21] but also to raw waveform data [2], [22], [23].The best-performing methods in this field rely on latent forecasting tasks and are discussed in Section 2.2.2.

Instance discrimination (SimCLR/BYOL/SwAV)
Current state-of-the-art contrastive methods from computer vision aim to learn representations based on multiple views on the same instance, see Figure 1.These are created by applying stochastic transformations to the input data.This idea is implemented in the most straightforward way in SimCLR [3], where a noise contrastive loss is used to attract two (positive) copies originating from the same original instance and to repel instances from all other (negative) instances in the batch, an approach which typically relies on training with large batch sizes, which is less problematic in our case due to the reduced dimensionality of time series data as compared to image data.BYOL [17] does not explicitly rely on contrasting against negative samples in the same batch, but uses a moving average of the model  itself and reported slight improvements over SimCLR in the image domain.Finally, SwAV [18] relies on contrasting cluster assignments rather than individual instances and once again improved the linear evaluation scores on ImageNet.
In our case, we build on the implementations of all three frameworks in PyTorch Lightning Bolts [24].
As model architecture, we use the convolutional neural networks of the xresnet1d-family, one-dimensional adaptations of the popular xresnet-architecture [25] from computer vision, which showed very good perform in a very recent ECG classification benchmarking study [9] that was carried out on the PTB-XL dataset, see Section 2.3, using the same evaluation scheme that is supposed to be used in this work, see Section 2.4.We base our experiments on a xresnet1d50, whose performance is compatible with that of the bestperforming xresnet1d101 from [9] within error bars.However, it is more parameter-efficient and showed a slightly superior performance for linear evaluation and finetuning.At this point, we stress again that even though we use an architecture that achieves state-of-the-art performance on PTB-XL, the main point of our study lies in the demonstration of relative improvements compared to supervised performance.
The transformations used to generate two semantically equivalent views of a given original record lie at the heart of the recent success of contrastive methods in computer vision.As demonstrated in [3], the quality of the learned representations depends crucially on the choice and proper combination of transformations.We therefore evaluated a number of transformations inspired by effective transformations in computer vision and transformations specific for time series, see Section A for a detailed description.Finally, we also evaluate representations obtained by using only prototypical physiological noise during pretraining.

Latent forecasting (CPC)
Contrastive Predictive Coding (CPC) [2] is also a contrastive approach, which, in contradistinction to the approaches described above, explicitly makes use of the sequential ordering of the data.The idea is to encode the input sequence by means of an encoder with strided convolutions or fully connected layers and train a model to predict the latent representation of the sequence a fixed number of steps in the future given the encoded representation of the sequence in the past again using a noise contrastive estimation approach, see Figure 2 for a graphical representation.As we work with data at 100Hz, which is sampled rather coarsely compared the typical sampling rates of 10 kHz in the audio domain, there is no need to drastically downsample the signal by means of strided convolutions.Instead, we use a fully connected encoder, in our case composed of four layers with 512 filters with batch normalization, as it was also done in self-supervised representation learning from classical audio features [19]- [21].We predict 12 timesteps or equivalently to 0.12s into the future and work with 128 false negatives that are drawn from the same sequence as the original record.For the prediction task, we use a LSTM model [26] with 2 layers and 512 hidden units.We propose and evaluate an enhanced version of the CPC architecture, with an additional hidden layer and non-linearity before the linear output layer of the LSTM.This modification was inspired by the additional multilayer perceptron in SimCLR, which was one of the key components that lead to superior performance compared to previously used self-supervised approaches in computer vision.
When finetuning a classification model, we apply a concat-pooling layer [27], which concatenates the maximum of all LSTM outputs, the mean of all LSTM outputs, and the LSTM output corresponding to the final step, and a fully connected classification head with a single hidden layer with 512 units including batch normalization and dropout for regularization.To assess the linear evaluation performance, we use a single fully connected layer on top of the concat-pooling layer.The effect of the different modifications compared to standard CPC implementations and finetuning schedules are investigated in detail in Section C.

Self-supervised representation learning for physiological time series data
Self-supervised methods have also been used for representation learning from biomedical sequence data, including, most prominently, ECG [28]- [31] and electroencephalography (EEG) [28], [29], [32], [33] data.With the exception of [31], none of the existing works considered the case of representation learning from clinical 12-lead ECGs, the clinically most widely encountered type of ECG measurement.The authors of [31] also consider BYOL and SimCLR for pretraining but used a very shallow network architecture with only five layers.We believe that it is necessary to use larger models, which reach state-of-the-art performance on large, comprehensive ECG datasets such as PTB-XL and which consequently allow to learn richer representations, along with pretraining on larger datasets.In particular, it is not clear if pretraining advances in the small model regime carry over to larger models with competitive supervised performance.Also, from the methodological point of view, [31] deviates significantly from our approach in the sense that they aim to learn a lead-independent, universal singlelead encoder, as opposed to a joint 12-lead encoder in our case.The former is not directly applicable to downstream 12-lead ECG tasks.In addition, they propose new contrastive methods that can use 12-lead ECG data during pretraining but differ from our methods in that they are not expedient for downstream 12-lead ECG tasks.This is because the proposed models do not process 12-lead data directly but exploit the fact that different leads from the same patient during pretraining can be considered as positive pairs.From the methodological point of view, [29] is also close to our contrastive approach but their experimental results were limited to a small 2-lead dataset with less than 50 records.Without access to the original implementation, it is impossible to assess if their proposed approach would be competitive on 12-lead data and on large (pretraining) datasets, where self-supervised methods reveal their full potential.Earlier approaches such as [28] trained representations from 2-lead ECGs using skip-gram models.Finally, [30] use transformation recognition as a pretext task and proposed a framework specific to representation learning from single-lead ECGs.

ECG datasets
We use a collection of three datasets for pretraining henceforth referred to as All, namely CinC2020 [34], Ribeiro [8] and Chapman [35], which constitutes a collection of the largest publicly available 12-lead ECG datasets with in total 54,566 records.It is worth mentioning that CinC2020, the training dataset used for the Computing in Cardiology Challenge 2020, is by itself a compilation of five different datasets.In particular, it includes the PTB-XL dataset [13], [36] that we also use for evaluation in this study.At the most finegrained level, which is used here, the PTB-XL dataset comes with 71 labels and the evaluation task is framed as a multi-label classification task.It is worthwhile stressing that these labels cover a wide variety of diagnostic, form and rhythm statements and can be used for a comprehensive evaluation of ECG analysis algorithms.The 44 diagnostic statements can be categorized in terms of five super classes (normal/conduction disturbance/myocardial infarction/hypertrophy/ST-T change), the 19 form statements relate to mostly morphological changes in specific ECG segments such as an abnormal QRS complex, and the 12 rhythm statements comprise statements characterizing normal cardiac rhythms as well as arrhythmia.The dataset is organized into ten stratified, label-balanced folds, where the first eight are used as training set, the ninth is used as validation set and the tenth fold serves as test set [13].All datasets are summarized in Table 1.
Table 1: Dataset summary: For pretraining we use All (CinC2020 and Chapman and Ribeiro) or PTB-XL.We evaluate on PTB-XL.Note that PTB-XL is a subset of CinC2020.

Training and evaluation protocol
We restrict ourselves to ECG data at a sampling rate of 100Hz in all cases.We pretrain CPC models on input sequences with a length of 10 seconds, all other models (including finetuned CPC models) are trained on input sequences with length of 2.5 seconds.During training, subsequences are randomly cropped from the input record.During test time while finetuning, we use test-timeaugmentation and crop all sequences to a length 2.5 seconds (using a stride of 1.25 seconds) from the original record and take the mean of their respective output probabilities as final prediction, a method which considerably improved the model performance by approximately 0.01 in macro AUC as compared to a naive evaluation [9].Both for finetuning and pretraining, we use the AdamW optimizer [37] with a weight decay of 0.001.During pretraining, we optimize the InfoNCE loss [2] using a constant learning rate schedule for CPC and the respective contrastive losses with a cosine learning rate schedule for SimCLR, BYOL, and SwAV as described in the original publications.During finetuning, we optimize binary crossentropy with a constant learning rate schedule as appropriate for a multi-label classification task and evaluate the model performance based on macro AUC as in [9], computed from the 71 labels on the most finegrained level in PTB-XL [13].We perform model selection on the validation and select the model with the lowest validation loss during pretraining and highest macro AUC during finetuning.We report the respective test set score of the selected best model.The source code to reproduce all our experiments is publicly available [38].
As conventionally done in self-supervised representation learning studies, we use two different evaluation procedures, linear evaluation and finetuning.The linear evaluation protocol aims to assess the quality of the learned representations through the linear separability of the learned representations.To this end, we replace the classification head by a single linear layer and freeze all other layers as well as batch normalization statistics.Within the finetuning protocol, we investigate the usefulness of these representations for downstream tasks, where we unfreeze the classification head as well as all layers of the pretrained model.For CPC, we found it beneficial to follow a twostep approach during finetuning: In a first step, we finetune just the classification head for 50 epochs while keeping the remaining pretrained weights fixed but still updating batch normalization statistics.We perform model selection using validation set scores and then finetune the entire model for 20 epochs at a reduced learning rate using discriminative, i.e. layer-dependent learning rates to mitigate the danger of overwriting the information captured during pretraining, where we typically divide models into head, body and stem/encoder and reduce the learning rate by a factor of 10 compared to the respective previous layer group.Also in this case, we select the final model based on validation set scores.In all other cases, we train models for 100 epochs using a constant learning rate.

EXPERIMENTS
The performance of self-supervised contrastive methods based on instance discrimination crucially depends on the choice of transformations that are used to create two semantically equivalent copies of the original input sequence.
To determine appropriate transformations, we carried out an experiment, where we pretrained a model using SimCLR and different combinations of transformations and evaluated the quality of the learned representations based on linear evaluation performance on PTB-XL, see Section B in the supplementary material for details.The results clearly identify time out (TO) in combination with random resized crop (RRC) as most effective transformation pair, see Section A in the supplementary material for a description of all transformations under consideration.In addition, we consider physiological noise transformations that were designed to mimic typical physiological noise that might occur during ECG measurements, namely baseline wander, powerline noise, electromyographic noise and baseline shift.In a second step, we used a similar protocol to compare the three different contrastive learning frameworks SimCLR, BYOL and SwAV now using the predetermined set of transformations.Whereas SimCLR reached clearly the best linear evaluation performance, finetuning from a BYOL representation lead to a superior downstream performance after finetuning, which is why we consider both methods in the following sections.We start by discussing the linear evaluation performance in Table 2, which should be set in perspective to the supervised performance achieved on PTB-XL.The best published result for this task on the same dataset with identical splits using purely supervised training was 0.925(07), also using a xresnet1d-model [9].Our supervised results remains compatible within error bars with this baseline result.The architecture used for CPC pretraining (denoted by 4FC+2LSTM+2FC) was not investigated in previous studies [9] and shows the strongest supervised performance reported on PTB-XL thus far.
The linear evaluation performances in Table 2 show that the pretrained representations are highly relevant for downstream classification tasks.Most notably, the linear evaluation performance of the CPC model only shows a performance gap of 0.5% compared to the same model architecture trained in a supervised manner and already exceeds the best result previously obtained using supervised training on the same dataset [9].The contrastive methods from computer vision show a slightly weaker performance, but still the best linear evaluation performance reaches 95.5% of the respective supervised performance.The main point we are trying to convey here is how far one can push the linear evaluation performance not only in relative comparison to supervised performance, but also on an absolute scale.Whereas the former can also be demonstrated with simpler model architectures, the latter requires a certain model complexity.Based on these results, it is justified to claim that self-supervised representation learning is very effective in the ECG domain.To demonstrate the impact of the dataset size, we also report results for pretraining CPC just on PTB-XL i.e. using only 40% of the original training dataset.As expected, increasing the size of the training dataset leads to improvements in the linear evaluation performance.Again, the reader is referred to Section C in the supplementary material for details on the impact on the modifications in the original CPC architecture and finetuning schedule.

Self-supervised pretraining improves downstream performance
In this section, we investigate whether finetuning from selfsupervised representations can potentially lead to improvements in downstream performance as compared to purely supervised training.The results are compiled in Table 2 and summarized graphically in Figure 3.As before, SimCLR reaches the best linear evaluation performance whereas it is slightly outperformed by BYOL in terms of downstream performance.The considerably better linear evaluation performance of the CPC model as compared to BYOL and SimCLR directly translates into an improved downstream performance.Interestingly, the SimCLR and BYOL finetuned model performance when using physiological noise during training almost reaches the results from using (RRC,TO)transformations, while a sizable performance gap exists between them in terms of linear evaluation performance.In order to test in how far the overlap between using PTB-XL for pretraining and for evaluation leads to an overestimation of the positive effects of pretraining on unseen data, we also investigate the performance of a model pretrained on CinC2020 w/o PTB-XL, i.e.CinC2020 with records from PTB-XL excluded, which is comparable in size to PTB-XL.
On the one hand, the linear evaluation performance after pretraining on CinC2020 w/o PTB-XL is lower and does not overlap with the result from pretraining on PTB-XL, which supports the initial hypothesis.On the other hand, the seemingly superior representation from pretraining on PTB-XL do not translate into a stronger classification performance after finetuning.In fact, finetuning from pretraining on CinC2020 w/o PTB-XL even leads to a higher point estimate whereas both results remain consistent within error bars.Hence, these somewhat inconclusive results do not allow any decisive statements about the initial hypothesis.
In particular, it remains difficult to disentangle the effects of overlap of samples in pretraining and finetuning from differences in the distribution of the pretraining datasets, which are rather pronounced as shown in Figure E1.In all cases, the results from finetuning pretrained models improve over the corresponding supervised results (by 1.0% for CPC, by 0.2% for SimCLR, and by 0.5% for BYOL).Furthermore, it is noticeable that already after the first finetuning step, where just the batch norm statistics and the classification head were adjusted, the CPC model reaches performance values around 0.931, i.e. already roughly matches supervised performance.These results provide a clear case for self-supervised learning in the ECG domain.
At this point we find it appropriate to comment on different sources of uncertainty impacting our results.In Table 2, we report uncertainties related to the inherent randomness of the training process, which we assess via multiple training runs.In addition to this systematic error, there is also an uncertainty in the final scores due to the finiteness and the particular sample distribution of the test set.As in [9], we assess this error via empirical bootstrapping on the test set using 1000 bootstrap iterations and evaluate 95% confidence intervals.For finetuning after selfsupervised pretraining we find confidence intervals ±0.006 (±0.008 for linear evaluation) as opposed to ±0.008 for training from scratch with only minor variations between different runs.This statistical error therefore represents the dominant source of uncertainty, which provides a strong argument for larger ECG evaluation datasets.As in [9], we check if the confidence interval for the difference between finetuning following pretraining and training from scratch encloses only positive values, which would indicate that the performance improvement is statistically significant.We investigated the improvements for every combination of the 10 models that were finetuned from the model pretrained on All and 10 models trained in a supervised fashion, combining both sources of uncertainty in a single analysis.In 90% of these 100 comparisons, pretraining led to a statistically significant improvement, underscoring the positive effects of self-supervised pretraining.
In Figure 4  statements by their respective super class (for diagnostic statements), or mark them as rhythm or form statement and sort them according to the supervised performance.In terms of improvements through self-supervised pretraining, the largest average gains are observed for form and rhythm statements, see Figure D1, which supports the hypothesis that different pathologies tend to profit differently from selfsupervised pretraining.This effect is superimposed by a different effect that is visible from Figure 4, which indicates that the gains through pretraining show a negative correlation with the performance level before pretraining, i.e.ECG statements with low supervised performance tend to profit more from self-supervised pretraining, see Figure D2 for an explicit demonstration.As a final remark, one has to consider the different sizes of models under consideration.Whereas the CPCmodel during finetuning comprises 5.8M parameters, the xresnet1d50 only counts 930K parameters, which might suggest that part of the gap between CPC and BYOL/SimCLR can at least partially be attributed to a difference in model capacity.However, in preliminary experiments we saw no indications of strong performance increases with wider or deeper models.It remains to see how the performance of SimCLR and BYOL scales on larger datasets.On the contrary, for the 4FC+2LSTM+2FC-models, the model capacity is in-  strumental to reach its very high performance, see Section C. A serious disadvantage of CPC is the sequential nature of the LSTM, which leads to slow training times.The training times are not directly comparable due to the different nature of the tasks, but give at least a hint.CPC models were pretrained for 200 epochs, which took approximately 6 days on a single Tesla V100 GPU.SimCLR and BYOL pretraining was performed for 2000 epochs using batch sizes of 8192 with approximate runtimes of 15h and 13h on a single Tesla V100 GPU, respectively.Performing 50+20(100) epochs of finetuning for the 4FC+2LSTM+2FC(xresnet1d50)-model takes approximately 25 (10) minutes on the mentioned hardware.

Self-supervised pretraining improves downstream data efficiency
Another potential advantage of self-supervised pretraining lies in a potentially improved data efficiency when finetuned on a downstream task.This is a particularly relevant case for medical applications, where high-quality labels are hard to obtain.To investigate this claim in detail, we compare the performance of different pretrained models from self-supervised representations to models trained from scratch in purely supervised fashion while varying the number of training folds from the original 8 folds to a single fold.This can be read off for example from the number of training folds where the pretrained model reaches the same performance as the supervised model trained on the full dataset.In the case of CPC, this point is reached approximately at 5

Self-supervised pretraining improves robustness of downstream classifiers
In addition to quantitative performance and data efficiency, robustness is one of the key quality quality criteria for machine learning models.Here, we focus on robustness against input perturbations.It is well known that certain types of noise tend to occur in ECG data as a consequence of the measurement process and physiological interference [39], [40].In Section A in the supplementary material, we briefly review typical kinds of ECG noise and propose simple ways of parameterizing them.For simplicity, we just superimpose the different noise types and the original ECG waveform.We define different noise levels by adjusting the amplitudes of these noise transformations and evaluate the performance of the models from the previous sections on perturbed versions of the original test set.We also indicate signal-to-noise-ratios (SNRs) corresponding to the different noise levels, where we identify the signal with the original ECG waveform.However, one has to keep in mind that this assessment neglects the noise inherent in the original measurement, which implies that the given SNRs only upperbound the actual values.
The goal is to investigate if pretrained models are less susceptible to physiological noise.The results in Figure 6 reveal an interesting pattern: For the 4FC+2LSTM+2FCmodels, the CPC-pretrained model shows a considerably improved robustness to noise compared to its supervised counterpart.However, both models turn out to be less robust than the considerably less complex xresnet1d-models.For the latter, the BYOL-pretrained model with physiological noise shows the strongest overall performance and also performs considerably stronger than the corresponding supervised model, which is to a certain degree an expected result as the model experienced a comparable kind of noise during pretraining.This result provides a strong argument for pretraining with domain-specific noise transformations even if it comes at the cost of a slight performance loss compared to the best-performing pretraining procedure during noiseless evaluation, see Table 2. Somewhat surprisingly, the BYOL-pretraining with the artificial (RRC,TO)transformation even lead to a reduced robustness compared to the supervised xresnet1d50.As a final remark, the noise levels 3 and beyond are already strongly dominated by noise and correspond to situations that will rarely be encountered in real-world scenarios.

Implications of the results
Even though we discussed the implications of our findings on a technical level, we find it appropriate to also briefly comment on the broader implications of our results.Section 3.1 addresses the question how far one can push the linear evaluation performance and demonstrates that selfsupervised learning can be implemented effectively in the domain of ECG data.In terms of (clinical) impact, however, it is primarily a necessary prerequisite for the following investigations: Section 3.2 shows that self-supervised pretraining improves the predictive performance compared to training from scratch.Most importantly, breaking down the improvements according to individual diagnoses shows that underperforming diagnoses when training from scratch profit most from self-supervised pretraining.This is an encouraging sign as it entails the prospects to eventually train models on even finer and hence less populated label hierarchies and/or to tackle rare diseases, for which only a very limited number of labeled samples exist in the first place.Section 3.3 demonstrates that the improvements achieved through pretraining directly translate into an improved label efficiency.This is also an encouraging result for the broader ECG research community given the growing but still compared to commercial ECG datasets small sample size available from freely accessible ECG datasets with high label quality.Finally, Section 3.4 stresses that self-supervised pretraining leads benefits such as improved robustness that go beyond quantitative performance and that are also very desirable in clinical applications.

SUMMARY AND CONCLUSIONS
In this work, we put forward a comprehensive assessment of self-supervised representation learning on 12-lead clinical ECG data.Even though self-supervised algorithms have been applied successfully in computer vision and speech, ECG is a different data modality, where the degree to which self-supervised learning works is not clear from the onset.
Upon adapting self-supervised representation learning to the ECG domain, self-supervised learning turns out to be very effective: Self-supervised representations (via CPC) reach scores that only fall behind 0.5% supervised performance during linear evaluation and lead to improvements of 1.0% compared to supervised performance during finetuning, which represents a significant increase in 90% of the time within an analysis incorporating both systematic as well as statistical uncertainties.The sizable performance gap translates into an improved label efficiency, i.e. the pretrained model reaches the same performance as the supervised model but using only roughly 50-60% of the samples.We also investigate the impact of self-supervised pretraining on the robustness of the corresponding finetuned classifiers against physiological noise.We find increased robustness for most pretrained models compared to the corresponding models trained from scratch, but particularly for those that were pretrained using domain-specific noise transformations.This provides a strong case for the use of domainspecific noise transformations during pretraining.
To summarize, self-supervised learning is one path towards more robust and more label-efficient training procedures, which might alleviate the problem of label scarcity, which is particularly pressing in medical applications.In this work, we demonstrated clear advantages in terms of quantitative performance, label efficiency and robustness.It will be interesting if these carry over to further quality dimensions.We see our work as a first step towards measurable progress in the field of representation learning for 12-lead ECGs.All used datasets and the source code underlying our study are publicly available [38].where C is a uniform random number drawn from [0, C max,bls ] and c i is drawn from a standard normal distribution and modulated by a random sign.
A.2.0.5 Superposition: Eventually, all four noise types are superimposed and added to the original signal s(t) i.e. s physio.noise (t) = s(t)+n blw (t)+n pln (t)+n emn (t)+n bls (t) (5) The noise strength is adjusted by varying C max,blw , C max,pln , C max,emn , and C max,bls while keeping all other parameters fixed.

A.3 Parameter values used during pretraining and evaluation
During pretraining, we used C max,blw = 0.1, C max,pln = 0.2, C max,emn = 0.5, and C max,bls = 1 when using the physiological transformations.For our robustness test, we created noisy validation sets.We considered different levels of noise, which are described in Table 3.

TRANSFORMATIONS
The choice of appropriate transformations to induce two semantically equivalent views on the original instance is crucial for the effectiveness of the approach and one of the key components for the success of SimCLR [3].In order to find the best combination of transformations during pretraining, we followed the example of [3] and performed a grid search, based on six transformations, which are partly inspired from computer vision and partly from time series analysis: Gaussian noise (GN), Gaussian blur (GB), channel resize (CR), time out (TO), random resized crop (RRC) and dynamic time warp (DTW), see Section A for a detailed description.For all pairs as well as single transformations, we trained a xresnet1d50 using SimCLR for 500 epochs on the All dataset.Figure B1 shows the respective linear evaluation performances on the PTB-XL dataset.The results rather clearly identify time out in combination with random resized crop as most effective transformation pair.This is one combination of transformation that will be used for all following experiments.For comparison, we train a model on transformations that are supposed to mimic common types of physiological noise that typically occur in ECG measurements [39], [40].Here, we consider baseline wander, powerline noise, electromyographic noise and baseline shift, which are also described in detail in Table 4: Comparing different contrastive learning frameworks and augmentation transformations in terms of linear evaluation and finetuning performance after 2000 epochs pretraining on the All dataset.We report mean and standard deviation of the validation set scores over 10 finetuning runs using a concise error notation where e.g.0.8976 (11)  Section A.
In a second step, we aim to identify the most effective pretraining framework.To this end, we pretrain models using SimCLR, BYOL, and SwAV each with (RRC, TO) and physiological noise transformations.The results are summarized in Table 4 both in terms of linear evaluation as well as finetuned performance.As first observation, in terms of both evaluation modes the models pretrained with artificial transformations are considerably stronger than their counterparts pretrained using physiological noise.In terms of linear evaluation performance, the gap is smallest in case of BYOL, which is consistent with findings about a less pronounced sensitivity to transformation choices in computer vision [17].However, the most interesting observation is the mismatch between linear evaluation and finetuned model scores: Whereas SimCLR reaches clearly the best linear evaluation performance, finetuning from a BYOL representation leads to a superior downstream performance after finetuning.This iterates the fact that the ranking in terms of linear evaluation performance is not necessarily a perfect proxy for the ranking in terms of downstream performance.

SPEECH
In this final section, we investigate the impact of different modifications of the CPC architecture and training procedure to demonstrate in how far they positively impacted the performance.These results potentially convey general insights for CPC and related self-supervised approaches that go beyond the specific application to ECG data.Therefore, we vary one aspect while keeping the other ones fixed and report the impact on linear evaluation and finetuning procedure.Pretraining and evaluation is performed on PTB-XL for simplicity.We refer to the configuration with fully connected encoder, MLP during pretraining, predicting 12 steps ahead, hidden layer, batch normalization and dropout in the classification head, two-step finetuning, discriminative learning rates during finetuning as CPC Baseline.
The results of this investigation are summarized in Table 5: The MLP during CPC pretraining has a small but consistent positive impact both in terms of linear evaluation as well on the downstream.Also modifications of the classification head have slight but consistent positive effects.However, we observe more severe performance degradation upon reducing the number of hidden units of the LSTM modules from 512 to 256 (and 128), which confirms our assumption that the model capacity is of great importance.The most significant performance gain arises from finetuning in a two-step approach, where the head is finetuned first and the full model is only finetuned in a second step.Omitting discriminative learning rates in the final pretraining step lead to an identical mean performance as in the baseline case omitted, but the results are much less consistent across different runs as visible from a standard deviation that is almost double the size of the baseline value.
Finally, a comparison to CPC applied to raw audio is in order.The original CPC [2] applied to raw audio waveform data works on 10 kHz.The encoder uses strided convolutions and the encoded data therefore undergoes a downsampling by a factor of 160.Predicting 12 steps into the future then corresponds to a look-ahead interval of 0.192s.In our case, we work with much more coarsely sampled data at 100 Hz, but the encoded data undergoes no downsampling due to the use of a fully connected encoder.In this case, predicting 12 steps into the future corresponding to 0.12s, which lies in a similar order of magnitude as in speech.Using an encoder with strided convolutions lead to considerably worse performance that was already apparent for models trained in a supervised fashion.

APPENDIX D PERFORMANCE IMPROVEMENTS THROUGH SELF-SUPERVISED PRETRAINING
In this section, we provide additional details on the specific effects of self-supervised pretraining.In Figure D1, we aggregate the improvements according to superclasses revealing the largest improvements within the form and rhythm label categories.In Figure D2, we show the improvement through pretraining as a function of the supervised performance, which explicitly shows the negative correlation already mentioned in the main text.

APPENDIX E LABEL DISTRIBUTION OF CINC AND PTB-XL
PTB-XL is a dataset that contains 21837 samples and is annotated with 71 labels at the finest level.The dataset used for the Computing in Cardiology Challenge 2020, here refered to as CinC2020, was created by compiling five different datasets, including PTB-XL.In order to create a unified set of labels, the orginal labels were mapped to SNOMED codes.This mapping process inevitably introduces ambiguities due to the fact that the original labels where assigned based on a different set of labels.Nevertheless the CinC2020 labels can be used to assess differences in data distributions of the different subsets of CinC2020.Here, we are most interested in comparing the label distribution of PTB-XL to CinC2020  w/o PTB-XL, i.e. the subset of CinC2020 where all PTB-XL records have been removed.Figure E1 compares the PTB-XL dataset with the samples of the four remaining datasets in CinC, which sum up to 21256 signals.For the majority of pathologies, there are only a few samples in both datasets.For the pathologies that occur more frequently, large differences can be observed in many cases, indicating that the label distributions differ considerably.

Figure 1 :
Figure 1: Schematic illustration of contrastive methods based on instance-discrimination for the case of BYOL.

Figure 3 :
Figure 3: Comparison of different contrastive learning frameworks in terms of downstream performance comparing the three self-supervised learning frameworks CPC, BYOL and SimCLR.The previous supervised state-of-the-art result from [9] is represented by a dashed line.

Figure 4 :
Figure 4: Individual label AUCs for a 4FC+2LSTM+2FCmodel from supervised training (in color, sorted in descending order) and the corresponding improvements through self-supervised training with CPC (in black).
number of folds used in finetuning CPC supervised CPC self-supervised CPC supervised baseline CPC supervised baseline -4 folds

Figure 5 :
Figure5: Finetuning downstream performance on PTB-XL dataset of a 4FC+2LSTM+2FC-model pretrained using CPC compared to its supervised counterpart.The finetuning was performed for different number of training folds, ranging from 1-8 folds.We used 10 runs for 8 folds as before and 3 runs for 7 folds or fewer.The plot shows the mean performance as a solid line and one standard deviation around it as a shaded band.To guide the eye, we indicate the performance of the supervised model trained on 8(4) folds performance on the full training set by a dashed(solid) line .

Figure A1 :
Figure A1: Artificial Transformations used for computer vision based self-supervised methods: (a) Gaussian noise, (b) Gaussian blur, (c) Channel resize, (d) Random resized crop and (e) Time out.
Figure A1: ECG-specific physiological noise transformations used in computer vision based self-supervised methods: (a) Baseline wander, (b) Powerline noise, (c) Electromyographic noise and (d) Baseline shift.

Figure B1 :
Figure B1: Linear evaluation performance (macro AUC) on the PTB-XL validation set of a xresnet1d50 model after 500 epochs pretraining on All with SimCLR using one or two data augmentations.Diagonal entries correspond to a single transformation and off-diagonal entries correspond to the sequential composition of two transformations.We report the mean over three linear evaluation runs.

Figure D1 :Figure D2 :
Figure D1: Improvement of classification of super level diagnosis, rhythm and form labels of the PTB-XL dataset of self-supervised training + finetuning in comparison to normal supervised training

Figure E1 :
Figure E1: Comparison of the label distributions of PTB-XL and CinC2020 w/o PTB-XL in terms of labels provided within the CinC2020 dataset.Blue bars represent counts for CinC w/o PTB-XL.Orange bars represent PTB-XL counts.

Table 2 :
Linear evaluation and finetuning performance on a comprehensive clinical downstream ECG classification task (macro AUC on the PTB-XL test set).As before, we report mean and standard deviation over 10 finetuning runs.
, we break down the macro AUC values of the best performing model by ECG statement.Per statement, we present both the performance of the supervised trained model (in color) and the improvement (in black) that occurs through self-supervised training.Moreover, we color the folds or equivalently approximately 62% of the training data.The performance level of the supervised models at 4 folds is approximately reached by the pretrained signifies 0.8976 ± 0.0011.

Table 5 :
Impact of different architectural and procedural components during CPC pretraining and finetuning.We report the performance in comparison to our baseline result when omitting a specified component.