Evaluating regression and probabilistic methods for ECG-based electrolyte prediction

Imbalances in electrolyte concentrations can have severe consequences, but accurate and accessible measurements could improve patient outcomes. The current measurement method based on blood tests is accurate but invasive and time-consuming and is often unavailable for example in remote locations or an ambulance setting. In this paper, we explore the use of deep neural networks (DNNs) for regression tasks to accurately predict continuous electrolyte concentrations from electrocardiograms (ECGs), a quick and widely adopted tool. We analyze our DNN models on a novel dataset of over 290,000 ECGs across four major electrolytes and compare their performance with traditional machine learning models. For improved understanding, we also study the full spectrum from continuous predictions to a binary classification of extreme concentration levels. Finally, we investigate probabilistic regression approaches and explore uncertainty estimates for enhanced clinical usefulness. Our results show that DNNs outperform traditional models but model performance varies significantly across different electrolytes. While discretization leads to good classification performance, it does not address the original problem of continuous concentration level prediction. Probabilistic regression has practical potential, but our uncertainty estimates are not perfectly calibrated. Our study is therefore a first step towards developing an accurate and reliable ECG-based method for electrolyte concentration level prediction—a method with high potential impact within multiple clinical scenarios.

• Lin et al. [27] use 66 321 ECG recordings from 40 180 patients and related potassium concentration in a time frame of ±60 minutes.
• Galloway et al. [26] use 2 835 059 ECG recordings from 787 661 patients and related potassium concentrations.The authors develop their model on 60 %(= 449 380) of the patients.All ECGs were recorded within 4 hours before potassium measurements.
• Kwon et al. [17] have 92 140 patients, whereof 48 356 patients were used for model development with 83 449 ECGs.The study considered potassium, sodium and calcium within ±30 minutes of ECG recordings.
We analysed our datasets in more detail to observe possible causes of errors or shortcuts for our model.In Figure S-1 we show histograms of age, recording year and the time difference between ECG recording and blood measurement.In Figure S-2 we show the distribution of electrolyte concentrations for all four electrolytes, which shows a Normal distribution for all electrolytes except for creatinine which is skewed towards large values.In order to validate our inclusion filter of ±60 minutes, we analyze the concentration of electrolytes vs the time difference and observe no clear change of concentration value over time.A similar analysis is done for age and sex.Here, we observe that older patients tend to have more extreme electrolyte concentration values for all four electrolytes.

A.3 Pre-processing
For the high-pass filter to remove the baseline (trends and low frequencies), we use an elliptic filter with a cut-off frequency of 0.8 Hz and an attenuation of 40 dB which is applied to the forward and reverse direction to avoid phase distortions.We additionally include a notch filter after observing that some ECGs are distorted by power line noise.The notch filter removes the 50 Hz with a quality factor of 30.Also, this filter is applied to the forward and reverse directions for the same reason.We use the pre-processing from the public library github.com/antonior92/ecg-preprocessing.
For the traditional machine learning methods, which we compare in "Results", "Deep Direct Regression", we further apply Principal Components Analysis (PCA) to reduce the dimensionality of the data.Here, we first concatenate all leads to get a 1D signal of length leads • samples = 8 • 4096 = 32 768.Then we fit PCA on our train dataset.We choose the number of principal components based on the eigenvalues in Figure S-3.We see that the eigenvalues decrease fast and start to converge between 200 and 300, which is why we choose to use 256 components.[10], and later also in Lima et al. [13], which also provides a public GitHub repository: https://github.com/antonior92/ecg-age-prediction.We adjust the last linear layer of the model for the different tasks, for example, different number of outputs for classification.
Our ResNet backbone from [10] consists of one convolutional layer followed by four residual blocks.The convolutional layer and each residual block modify the sequence length by {4096, 1024, 256, 64, 16} and the filter size by {64, 128, 196, 256, 320}.We use a kernel size of 17 and a dropout rate of 0.5.

B.2 Hyperparameters
We use the default training hyperparameters from the original network architecture repository.The only deviation is the number of epochs which we reduced from 70 to 30, since this is sufficient for our datasets to converge.The exact hyperparameters are listed in Table S-1.

C Additional Results
Below we present additional results.First, we have a detailed performance table (more detailed than Table 3) for all electrolytes in Table S-2 for the random test set and in Table S-3 for the temporal test set.No significant difference in performance between the test sets is observed which shows that our model is robust to shift and trends over time.
Second, we list more results for classification and ordinal regression.In

Figure S- 1 .
Figure S-1.Histogram of metadata age (top left), recording year (top right) and minutes difference between ECG recording and blood measurement (bottom) for our four datasets.
Figure S-3.Eigenvalues of PCA components fit on train set.We show the first 512 eigenvalues of possible 8 • 4096 = 32 768 ones.We choose to reduce the dimensionality of our signal to 256 as it covers most information according to this figure.
We use a modified ResNet which was first developed in Ribeiro et al.
Figure S-4 we show the MAE for potassium and calcium which complements Figure 4 that shows the Macro ROC. Figure S-5 complements the electrolytes by showing the Macro ROC and MAE for the other electrolytes (creatinine and sodium).Third, we show additional results for probabilistic regression.Figure S-6 gives the calibration plot for potassium.The tables in TableS-4 and TableS-5 contain numeric details about the sparsification plot for more uncertainties, and the correlation between MSE and the variance to quantify the uncertainty calibration.TableS-6 lists the results of the OOD experiments.While the experiments for the SNR are expected (larger MAE and uncertainties for lower SNR), the results for masking are not as clear.While the MAE still increases, notably, especially the epistemic ensemble uncertainty decreases.This means that there is less variance in the mean predictions between the different ensemble members.Finally, Figure S-7, Figure S-8 and Figure S-9 yield the results for the remaining electrolytes that were previously shown for potassium alone.
Figure S-4.Classification (C) and Ordinal regression (O) MAE: Similar to Figure 4, we plot the MAE against the number of classes.The dashed line is the MAE of the corresponding deep direct regression model.
Classification (C) and Ordinal (O) regression: Same plot as Figure 4 and Figure S-4 but for creatinine and sodium (here we only used 4 seeds for shown mean and sd).

Figure S- 6 .
Figure S-6.plot, potassium: Top row and bottom left: calibration plots as standard deviation vs. absolute error (to have the same units) for different uncertainties.Colours indicate frequency by a fitted Gaussian kernel density estimate.A perfectly calibrated model would follow the diagonal.Bottom right: sparsification plot with more results than in the main paper.

Table S - 2 .
Regression performance on the random test dataset: Table shows metrics for different electrolytes of the regression models from "Results", "Deep Direct Regression".Target variance refers to the variance of the dataset and therefore yields a worst case MSE performance (since a model with that MSE just predicts the mean of the dataset).

Table S -
3. Regression performance on the temporal test dataset:Table shows metrics for different electrolytes of the regression models from "Results", "Deep Direct Regression".Target variance has same meaning as in Table S-2.

Table S - 4 .
Sparsification against MAE: Numbers in columns show different levels of sparsification (in per cent), and the corresponding row shows MAE values.This table gives the numeric values of the bottom right plot of Figure S-6.

Table S -
5. Correlation between MSE and Variance: we correlate the MSE with the variance from different uncertainties.A correlation of 1 would indicate perfect calibration.

Table S -
6.OOD experiments: This is an extended table from Table5.SNR X refers to OOD experiments with varying SNR; Mask X refers to OOD experiments where X per cent of the data is masked.