Interpretable clinical time-series modeling with intelligent feature selection for early prediction of antimicrobial multidrug resistance

Electronic health records provide rich, heterogeneous data about the evolution of the patients’ health status. However, such data need to be processed carefully, with the aim of extracting meaningful information for clinical decision support. In this paper, we leverage interpretable (deep) learning and signal processing tools to deal with multivariate time-series data collected from the Intensive Care Unit (ICU) of the University Hospital of Fuenlabrada (Madrid, Spain). The presence of antimicrobial multidrug-resistant (AMR) bacteria is one of the greatest threats to the health system in general and to the ICUs in particular due to the critical health status of the patients therein. Thus, early identification of bacteria at the ICU and early prediction of their antibiotic resistance are key for the patients’ prognosis. While intelligent data-based processing and learning schemes can contribute to this early prediction, their acceptance and deployment in the ICUs require the automatic schemes to be not only accurate but also understandable by clinicians. Accordingly, we have designed trustworthy intelligent models for the early prediction of AMR based on the combination of meaningful feature selection with interpretable recurrent neural networks. These models were created using irregularly sampled clinical measurements, both considering the health status of the patient and the global ICU environment. We explored several strategies to cope with strongly imbalance data, since only a few ICU patients are infected by AMR bacteria. It is worth noting that our approach exhibits a good balance between performance and interpretability, especially when considering the di ffi culty of the classification task at hand. A multitude of factors are involved in the emergence of AMR (several of them not fully understood), and the records only contain a subset of them. In addition, the limited number of patients, the imbalance between classes, and the irregularity of the data render the problem harder to solve. Our models are also enriched with SHAP post-hoc interpretability and validated by clinicians who considered model understandability and trustworthiness of paramount concern for pragmatic purposes. Moreover, we use linguistic fuzzy systems to provide clinicians with explanations in natural language. Such explanations are automatically generated from a pool of interpretable rules that describe the interaction among the most relevant features identified by SHAP. Notice that clinicians were especially satisfied with new insights provided by our models. Such insights helped them to trust the automatic schemes and use them to make (better) decisions to mitigate AMR spreading in the ICU. All in all, this work paves the way towards more comprehensible time-series analysis in the context of early AMR prediction in ICUs and reduces the time of detection of infectious diseases, opening the door to better hospital care.


Introduction
In the last decade, there has been a growing interest in analyzing clinical data as time-series sequences, allowing clinical experts to assess better the patient's health evolution [1,2,3]. Multivariate Time Series (MTS) have a strong presence beyond clinical applications, with relevant examples including finance, meteorology, or video processing, to name a few [4]. Focusing on healthcare applications, different data-driven approaches based on MTS have been developed [5,6].
Given the complexity and irregular patterns present in clinical datasets, deep neural networks (NNs) have emerged as a suitable alternative to model and handle MTS [7]. Lasko et al. pioneered the application of deep learning tools to healthcare, demonstrating the capacity of deep learning to generalize patterns from serum uric acid measurements [8]. Three of the most widely-used deep learning approaches for dealing with time-series sequences are the Gated Recurrent Unit (GRU) [9], the Long Short-Term Memory (LSTM) [10] and the Bidirectional LSTM (Bi-LSTM) [11]. The GRU, LSTM, and Bi-LSTM are different instances of Recurrent NNs (RNNs), which are widely used for prediction using MTS due to their ability to deal with time-varying observations and capture longterm temporal dependencies [10]. For example, Lipton et al. applied an LSTM network to classify diagnoses based on the temporal data recorded in the Electronic Health Record (EHR) of a pediatric Intensive Care Unit (ICU) [12]; Pham et al. used an LSTM to model the interaction between diagnosis and medication [13]; and Nguyen et al. developed a Bi-LSTM model to predict ICU mortality outcomes [14].
In this paper, we describe how different RNNs can predict antimicrobial multidrug resistance (AMR) in the ICU. AMR can be defined as the bacteria's ability to withstand the effects of a variety of harmful chemical agents designed to damage them [15]. The adaptation of the bacteria to different antimicrobials (to which they were previously susceptible) is a serious challenge due to the reduction of appropriate treatments and the scarcity of secondary antimicrobials [15,16]. As a result, situations such as cuts, care of premature babies, chemotherapy against cancer, or infections can cause debilitating or even lethal epidemics in the absence of effective treatments [15,17].
Understanding AMR factors (e.g., epidemiology, emergence, prevalence, or burden of infectious diseases) is crucial for early AMR prediction. It is also likely to improve decision-making processes in ICU management, e.g. by allowing early patient isolation and therefore reducing AMR rates. Even if RNNs are ready to achieve high performance, they behave in practice as black boxes, hindering their interpretation by humans. The lack of interpretability is even more severe for MTS-based models like GRUs and LSTMs due to their fairly opaque hidden states [18]. In particular, because the information stored in the hidden states is a mixture of all the MTS, it is impossible to discern the individual contribution of each time series. This lack of interpretability is one of the main reasons why data-driven machine learning (ML) models in general, and RNNs in particular, are not intensively used in healthcare applications yet [19]. Indeed, interpretability is of paramount importance to make intelligent systems ready to assist clinicians in high-risk decisionmaking processes [20]. Accordingly, intelligent clinical models need to be endowed with interpretability as a requirement to become explainable, trustworthy, and used worldwide [21].
The so-called Responsible and Trustworthy Artificial Intelligence (AI) pays attention to fairness, accountability, responsibility, and privacy, in scenarios where Explainable AI (XAI in short) plays a key role [22]. XAI is an endeavor to develop human-centric AI sensitive not only to technical but also legal and ethical issues. XAI is rooted in knowledge engineering, which transforms raw data into meaningful knowledge (through data collection, data pre-processing, feature engineering, interpretable modeling, validation, etc.) ready to be understood by humans while respecting the "chain of trust" [23]. All in all, the aim of XAI is twofold [24]: i) building "white-box" AI models (e.g., decision trees, rule-based systems, expert systems, etc.) that are interpretable by design [25]; and ii) developing novel techniques to endow opaque data-driven AI models (e.g., RNNs) with interpretability [26]. More precisely, approaches for explaining black-box models can be categorized into two main groups regarding the type of explanations: ii.a) intrinsic explanations supported by interpretable surrogate models [27]; and ii.b) extrinsic post-hoc explanations (e.g., SHAP [28]) that only pay attention to the model output while disregarding the model internal mechanisms.
In this work, we apply XAI for assisting clinicians in the discovery and understanding of how AMR develops and spreads in the ICU. This is the main novelty of this work compared to previous studies that have attempted to model ICU information as MTS to predict the AMR onset [29,30]. This paper extends the preliminary work published in [30]. Unlike the preceding study, we have further expanded the proposed model by introducing different time window lengths, new meaningful features such as the ICU occupation, the treatment provided in the ICU, and the application of XAI for assisting clinicians. Our main contributions are as follows: • Analyzing and modeling MTS related to AMR in the challenging scenario of an ICU. The dataset compiles data with the evolution of 3,470 patients. Data have been carefully cleaned and pre-processed before modeling.
• In the modeling stage, we coped with missing values in MTS and class imbalance.
• Regarding XAI, we first studied the effect of Feature Selection (FS), finding out relevant and meaningful features for clinicians. Then, we built several predictors based on RNNs (endowed with post-hoc interpretability) to model the temporal relation among the previously selected features. Then, we built linguistic (interpretable by design) models to better understand the interaction among the most relevant features in the model that exhibited the best interpretability-performance trade off.
• Validating with clinicians the interpretability of the models in AMR prediction.
The rest of the paper is organized as follows. Sec. 2 presents the notation and methods used in this paper. Sec. 3 describes the dataset and the related pre-processing tasks. Experiments and results are shown in Sec. 4. Finally, the main conclusions and associated discussions are drawn in Sec. 5.

Methods
The experimental pipeline is sketched in Figure 1 and discussed in the following sections. Data pre-processing is introduced in Sec. 2 Therefore, data associated with the i-th patient can be arranged in the matrix represents the value of the d-th feature in the t-th time slot for the i-th patient. Note that, while the value of D is the same across patients, the value of T i can be different, since the length of the patient's ICU stay depends on the condition and evolution of the particular patient.
The task that we address is cast as a binary classifier, with label '1' identifying AMR patients. We use y i to represent the label associated with the i-th patient, andŷ i to denote the output generated (predicted) by the ML model at hand.
Working with data collected from the EHR is challenging since the observations come from different sources, often have outliers and require homogenization, especially when working with MTS [31,32]. For this reason, a pre-processing stage is required to guarantee consistent and reliable results. Towards that end, we followed a process of normalization, database integration, outliers cleaning, and window modeling. Further details on the pre-processing stage will be given in Sec. 3.

MTS windowing
As already pointed out, the length of the MTS (i.e., the number of columns of X i ) can change with the index i. Since most ML models require inputs to have the same size across samples, we use a windowing technique to render the MTS length homogeneous across patients. Windowing requires setting a window length (denoted as W) and then, for every patient i, setting a time Note that the values of t ini i and t end i depend on the particular patient, since the data is asynchronous and our database was collected throughout several years.
The windowed input data for the i-th patient is given bȳ which, as desired, has the same size across patients. For notational convenience, we will usex t i ∈ R D with t = 1, 2, ..., W to denote the t-th column ofX i and x d i ∈ R W with d = 1, 2, ..., D to denote the d-th row ofX i . This way, the vector x t i collects the D values of the features of the i-th patient in the t-th instant. Analogously, x d i represents the time series of length W associated with the d-th feature of patient i.

Mechanisms for FS
FS, oftentimes disregarded as a minor task, is essential in data-science pipelines. The elimination of input features that are extremely noisy or redundant is critical to enhance classification performance, avoid overfitting and boost generalization [33,34]. In addition, and equally important for clinical applications where a substantial amount of information is recorded (so that the value of D can be very high), FS provides a disciplined data-driven approach to identify the key features for the task at hand, providing insights on the problem, eliminating redundant features and contributing to enhancing the interpretability of models and results. Mathematically, FS for MTS amounts to designing a set D ′ with cardinality D ′ such that D ′ ⊆ D = {1, 2, .., D} and D ′ < D. The set D ′ contains the features to be kept and D\D ′ those to be eliminated; hence, the smaller the value of D ′ , the more aggressive the selection mechanism is. We note that the value D ′ can be set beforehand or, alternatively, generated by the FS algorithm. Suppose now that the FS algorithm produces as output the set D ′ = {d 1 , d 2 , ..., d D ′ } where, without loss of generality, we assume that d n < d n+1 so that the elements of D ′ are ordered. Leveraging D ′ , we define the binary selection matrix S D ′ ∈ {0, 1} D ′ ×D such that: i) for every row, all the entries are zero except for a single one, and ii) for the n-th row, the entry whose value is one is that corresponding to the d n -th column. That is, [S D ′ ] n,d n = 1 for n = 1, 2, ..., D ′ and zero everywhere else. With this notation at hand, for each patient i, the original MTS inputX i ∈ R D×W is replaced with the reduced-dimensionality input S D ′X i ∈ R D ′ ×W , where we emphasize that S D ′ is the same for all i.
Next, we discuss three sounded and widely-adopted FS methods and describe how those methods can deal with MTS. Our experiments will implement the three of them, analyzing their differences, and comparing their classification performance. Last but not least, rather than choosing the method with the best classification performance, the paper advocates a voting mechanism considering the three FS methods to enhance the classification results.

Confidence intervals with bootstrap
Bootstrap resampling is a non-parametric technique used to estimate the distribution of a statistic (e.g., the mean value) taking samples from a population without replacement [35]. Bootstrapping considers that the empirical and actual distributions are not too different, is appropriate when the actual distribution is unknown, and does not make any assumption related to the properties of the actual distribution function [36].
In our work, we use bootstrap resampling to assess if the value of a particular feature for the AMR population is significantly different from the value of the same feature in the non-AMR population performing a hypothesis test. More precisely, let S AMR be the set (population) of AMR patients and S non−AMR the set of non-AMR patients. The intermediate goal is to quantify the difference between µ AMR (the mean value of a specific feature in the population S AMR ) and µ non−AMR (the mean of the same feature in the population S non−AMR ) and assess if the difference ∆ = µ AMR − µ non−AMR is relevant. Instead of computing ∆ using all patients in S AMR and S non−AMR and comparing the (single and deterministic) number obtained to a pre-specified threshold, we implement a more statistically robust resampling bootstrap approach. Thus, we resample each of the populations R times, obtaining the sets {S (r) AMR } R r=1 for AMR patients and {S (r) non−AMR } R r=1 for non-AMR ones. As explained in the experimental sections, we use R = 3, 000 and set the cardinality of the resampled sets to be the same (balancing the classes) and equal to 50% of the size of the minority class. Then, the mean statistic is computed across features and resamples, obtaining for each feature the values {µ (r) AMR } R r=1 and {µ (r) non−AMR } R r=1 . Third, we obtain the difference between the statistic in both populations for each resample, generating ∆ (r) = µ (r) AMR − µ (r) non−AMR for r = 1, 2, · · · , R. Fourth, we build the histogram of ∆ and empirically calculate the 95% CI for ∆, denoted as CI ∆ . We consider that the null hypothesis H 0 (the feature being not relevant/informative) is true if 0 ∈ CI ∆ . In other words, if there is no statistically significant difference between the mean of the feature in the two considered populations, the feature is not selected. In contrast, the alternate hypothesis H 1 (the feature being relevant) is considered true if 0 CI ∆ , indicating that a significant difference between the mean of both populations exists and, as a result, the feature is added to the set D ′ .
The bootstrapping-based FS process described above assumes that the features are one-dimensional scalars. However, in an MTS environment, the problem to solve is, given the patient-data matrices {X i } I i=1 and focusing on a particular feature (say the d-th one), whether to keep or remove the d-th row of the data matrices for all the patients in the dataset. In other words, for each and every d = 1, ..., D, we need to decide if the W-dimensional vectors {x d i } I i=1 are selected to be part of the inputs provided to our ML architectures. In this work, we have computed ∆ (r) = ∥µ (r) AMR − µ (r) non−AMR ∥ for each of the W entries of the vector x d i , assessing the relevance of each of the W entries separately and, then, implementing a majority-rule scheme where the d-th feature is selected if more than half of the values within the window were deemed relevant.

Conditional Mutual Information
Mutual Information (MI) is directly related to the well-known and widely used Shannon entropy [37]. The Shannon entropy of a generic random variable X, which is denoted as H(X), is an information metric related to the probability of occurrence of the values of X [38]. A high value of entropy means that every event in X has the same probability of occurrence, while a low value means that the probability of occurrence of each event is different. With X denoting all the values the (discrete) random variable X can take, the entropy of X is defined as H(X) = − x∈X p(x)log(p(x)), where p(x) is Pr{X = x}. If another variable Y is considered, the joint entropy can be computed as H(X, Y) = − x∈X y∈Y p(x, y)log(p(x, y)), with p(x, y) = Pr{X = x, Y = y}. We can define the conditional entropy as with p(y|x) = Pr{Y = y|X = x} = Pr{X = x, Y = y}/Pr{X = x}. The mutual information between X and Y measures the shared information between both variables, and is expressed as In other words, the MI is the amount of information that variable X has about variable Y. Lastly, the conditional MI is the expected value of the MI of two random variables given the value of a third [39,40]. The conditional MI can be defined as When using MI for FS, the goal is to select the set D ′ ⊆ {1, 2, ..., D} of D ′ features that maximizes the MI between the reduced input S D ′X and the associated label y. Such an optimization is NPhard and, hence, suboptimal schemes must be adopted. The approach in this paper is to use a greedy-selection scheme that chooses the features in D ′ one-by-one using an iterative scalar optimization of the MI metric. From an algorithmic point of view, this entails initializing D [0] sel = ∅ and D [0] non−sel = D, and running the following D ′ steps, with j denoting the iteration index and starting with j = 0: * with the highest MI and update the sets sel . If not, go to step 1.
The approach to estimating I y, sel for our MTS dataset requires simply considering that the output is binary (so that Y = {0, 1}), that the inputs are W-dimensional vectors (so that X is the Cartesian product of the value sets for each of the W entries), and that the probabilities need to be estimated using the population sets (S AMR for y = 1) and (S non−AMR for y = 0).

Group LASSO
LASSO stands for Least Absolute Shrinkage and Selection Operator and it is a popular regularization and FS selection method [41]. The three main advantages of LASSO are: i) its ability to search for the best set of features jointly, without the need to resort to a greedy algorithm; ii) a sound theoretical motivation; and iii) the existence of computationally efficient algorithms [42]. The LASSO is formulated as an optimization problem, and it can be used both in regression and classification tasks. While the regression form is presented here for simplicity, the generalization to classification tasks is straightforward. Let us suppose that we have a set with I input-output with the output being a scalar and the input x i having D dimensions. LASSO assumes that the predicted outputŷ i is estimated linearly as x ⊤ i α, where α = [α 1 , α 2 , ..., α D ] ⊤ is a vector with the D linear coefficients of the predictor. The optimal value of α (denoted as α * is then obtained as where ∥α∥ 1 = D d=1 |α d | is the ℓ 1 norm of α, and λ > 0 is a regularization parameter. The objective combines a data fitting term with a regularizer that penalizes the coefficients of the regression variables, shrinking some of them to zero. The larger λ, the higher the number of coefficients α * d that are set to zero. From an FS perspective, the approach is to solve the optimization for different values of λ, select the proper value of λ based on either the fitting error or the number of active coefficients and, then, construct the feature set D ′ with the indexes of the vector α * associated with the non-zero coefficients after the shrinking process.
Since in this work we deal with MTS, the input data are not vectors but matrices, and this calls for using a generalization of the LASSO method referred to as Group LASSO [42]. Intuitively speaking, group LASSO splits the input features into different groups and then either activates or sets to zero all the variables within each group. Mathematically, we are given .., α d W ] whose entries are associated with the W samples recorded for feature d. Since we have D of those vectors, the total number of coefficients to learn is DW, which coincides with the number of entries in the inputX i . Recalling that x d i is a vector collecting the entries of the d-th row ofX i , the optimal regularized linear regressor for the MTS case can be obtained as the solution to where The optimization resembles that in Eq. (4), but accounting for the multidimensional nature of the input and replacing |α d | with ∥α d ∥ 2 . This way, if the optimal solution sets α d * = [0, 0, ..., 0] ⊤ , then the d-th row of matrices {X i } I i=1 is not selected.

Strategies to handle imbalance data
Most (binary) classification architectures are trained assuming that the number of samples in each class is approximately the same [43]. However, there are many real-world applications, specifically in the healthcare domain, where that is not the case. Thus, in the task tackled in this paper, the number of AMR patients is lower than the number of non-AMR patients. When learning is performed with unbalanced classes, models can be biased to the majority class and led to poor generalization performance [44].
There are different strategies to deal with imbalance classes [45], including data-level approaches or cost-sensitive methods. In this work, we focus on two simple but effective methods: i) undersampling the majority class, and ii) defining asymmetric misclassification costs. When following an undersampling strategy, samples from the majority class are randomly discarded until the number of elements in the majority and minority populations is similar. The cost function used in this work to train models applying undersampling is the Binary Cross-Entropy (BCE) cost.
In the cost-sensitive approach, errors in a sample from the minority class are penalized more heavily than those from the majority class. A simple way to achieve this is to use the Balanced Binary Cross-Entropy (BBCE) function, a modification of the well-known binary cross-entropy cost function [46]. More specifically, upon setting the value of the weight β ∈ (0, 1), the BBCE cost is defined as where I ′ is the number of patients in the training set. If the training set is balanced, then β = 0.5, and Eq.(6) is the BCE cost function. When y i = 1 is associated with the minority class, then β must be chosen closer to one. On the other hand, if y i = 0 is the minority class, then β must be chosen closer to zero. Following this approach, the value of β in this paper has been set as the number of samples of the majority class divided by the number of total samples.

Strategies to handle missing values in MTS
Missing values, which affect most real-world datasets, are pervasive when dealing with time series. In the clinical context, data is recorded irregularly, with measurement frequency varying between patients and even over time. Moreover, the values are typically not missing at random but reflect the patient's health status or decisions by caregivers [47]. Equally important, when working with windowed data, there may be cases where the window length is larger than the length of the patient's record and, hence, one has to decide how to fill the initial (or last) part of the record.
Common approaches to deal with missing values include filling missing values with zeros, (linear) interpolation, or statistical imputation approaches [6]. Given that most of our data features are binary and partially inspired by the methodology proposed by Lipton et al. [48] for RNN-based prediction using missing values in clinical data, we consider three strategies to handle missing values inX i : 1. Remove from the populations the patients with missing data ("Removing"). This (filtering) approach bypasses the problem altogether, but reduces the number of training samples, hence impacting generalization. As a result, it is more suitable in setups with an abundant number of training examples. 2. Impute with zeros the missing values, including those at the beginning of the window ("Zero Padding"). This is an extremely common approach, especially dealing with binary data where the 0 value represents the "by-default" state (e.g., absence of a medical condition or a drug not being prescribed to a patient). 3. Use advanced ML architectures able to apply a masking scheme that accounts explicitly for missing values ("Masking"). This strategy is suitable for the three RNNbased architectures (GRU, LSTM, and Bi-LSTM). We implement a modified version that, for each input sample, uses as an additional input a mask indicating the positions of the input vector with missing values [49]. Due to their ability to deal with discrete data and unveil complex non-linear dependencies, artificial NNs are ML approaches widely used to deal with classification tasks [50]. Therefore, to address our binary classification task (i.e., predicting if a patient is infected by an AMR bacteria in the ICU), we consider different NN architectures. We start with a simple MLP, which will serve as a baseline, and then describe three more sophisticated RNN-based deep architectures that are able to account for the sequential (time) structure present in our MTS.

Multi-Layer Perceptron
The MLP is a feed-forward NN formed by 3 types of layers: an input layer, one (or more) hidden layers, and an output layer. Each neuron in the hidden layer computes the output of a scalar non-linear function whose input is a linear combination of the outputs of the previous layer and some linear weights. The weights are tunable during the learning process, which is performed by optimizing a non-convex (data-fitting error) cost using stochastic gradient-based approaches [50]. MLPs are fully connected architectures (meaning that there exists a weight between any pair of neurons) and, as the number of neurons grows, they are universal approximators capable of implementing any non-linear mapping [50]. In this paper, we have set the number of hidden layers of the baseline MLP to one, considered the Leaky ReLU [51] as the scalar non-linear activation function, and used the Adam algorithm to optimize the cost function [52]. We implemented an early-stopping technique to avoid overfitting, choosing the learning rate as a hyperparameter. At every epoch, the early-stopping procedure evaluates the evolution of the data-fitting cost in the validation set (20 % of the training set in this work) and stops the training if the cost deteriorates or stagnates. Also, a dropout rate has been used as a regularization technique to reduce overfitting to the training set.
It is important to emphasize that both the training cost and the optimization algorithmic approach (Adam with early stopping) described here for the MLP are also used for the NN architectures described in Secs. 2.2.2 and 2.2.3.

GRU networks
RNNs are a type of NNs where the layers (i.e., the connections between the neurons) form a directed path along a sequence, rendering them suitable to deal with time series. Similar to linear filters, RNNs use an internal state to preserve an 'artificial memory' of the previous inputs [53]. However, standard RNNs exhibit problems when dealing with long MTS, due to the successive application of gradient steps that either decay or blow exponentially (see, e.g., for more details on the so-called vanishing gradient problem [53]).
In this context, GRUs are a gating mechanism aimed at avoiding the gradient's problems of RNNs [54]. A "gate" is an NN located between two consecutive elements of the sequence chain of an RNN whose purpose is to regulate the flow of information going along the sequence chain. Taking this into account, a gate can be used, e.g., to amplify a gradient that is vanishing, guaranteeing that the error goes through all the elements of the sequence. The GRU has two mechanisms to regulate the information: i) the reset gate eliminates the information of previous time steps that is not incorporated into the hidden state (i.e., the input of the gate is the output of the previous state and the input associated with the current time step); and ii) the update gate is in charge of generating the output of the neuron; deciding what information to throw away and what new information to add. GRU networks require fewer parameters than other RNNs and this is a desirable feature in clinical applications, where the number of samples is typically limited.

LSTM and Bi-LSTM networks
The LSTM network, another RNN-based architecture, takes the definition of a GRU one step further by considering a new mechanism to transfer information from previous time steps (the cell state) and a new gate (the output gate) [10]. The cell state provides the model with a memory of past events that is longer than that of the hidden state. To handle the states, an LSTM cell implements the three different gates: the forget gate, the input gate, and the output gate. The forget gate decides the information from the previous cell state that has to be deleted.
Once the non-relevant information from the previous cell state has been eliminated, the input gate chooses the new information to store in the current cell state. Finally, the output gate layer is in charge of computing the final output of the neuron, which is a combination of the current cell state and the current input time step. While more sophisticated than their GRU counterpart, LSTMs have more parameters to learn and, as a result, require larger training datasets.
The last NNs considered in this work are Bi-LSTMs, which are MTS-processing architectures consisting of two LSTMs [11]. The first LSTM processes the MTS in a forward direction while the second one carries out the processing in a backward direction. As in classical smoothing methods for time-varying stochastic processes, the main benefit of the Bi-LSTM is the ability to leverage information from both the past and the future. While this additional ability tends to boost estimation performance, the number of parameters in Bi-LSTM models is larger and, as a result, performance gains must be expected only if the number of training samples is sufficiently high.

Model validation
This section presents the figure of merit considered for measuring the goodness of the generated models, both in terms of performance and interpretability. Performance concerns the ability of a model to make correct predictions, while interpretability concerns to what degree the model allows for human understanding. Models exhibiting the former property are many times more complex and opaque, while interpretable models may lack the necessary accuracy. The trade-off between accuracy and interpretability for predictive models is considered of paramount concern for pragmatic purposes.

Performance
Performance metrics evaluate the ability of a model to make correct predictions. Accuracy measures the ratio between the 6 J o u r n a l P r e -p r o o f Journal Pre-proof correctly classified samples and all the samples under consideration, and it is defined as follows: where TP (True Positives) are samples labeled as AMR and correctly classified; TN (True Negatives) are samples labeled as non-AMR and correctly classified; FP (False Positives) are samples labeled as non-AMR but wrongly classified as AMR; and FN (True Negatives) are samples labeled as AMR but wrongly classified as non-AMR.
In addition, we have considered two complementary metrics used worldwide in the context of binary classification problems: Specificity and Sensitivity.
On the one hand, Specificity, also known as TN rate, measures the ratio of non-AMR patients correctly classified by the model as non-AMR. On the other hand, Sensitivity, also known as TP rate, measures the ratio of AMR patients actually classified as AMR.
Finally, the Receiver Operating Characteristics (ROC) curve, and the Area Under such a Curve (AUC), measures the ability of the model under study to deal properly with both classes (AMR and non-AMR). Thus, ROC AUC provides additional details about how Specificity and Sensitivity interact.

Interpretability
Interpretability metrics evaluate the ability of a model to be understood by humans [55]. It is worthy to note that measuring interpretability is a challenging task that depends on the inherent transparency and complexity of the model (what is usually referred to as structural interpretability) but also depends on the background and expertise of the person who is expected to interpret such a model. Accordingly, there are neither a universal definition nor interpretability metrics universally recognized and used worldwide.
In practice, in the case of models that are deemed as interpretable by design, structural interpretability is measured in terms of complexity. For example, the number of parameters in a linear model, the number of nodes and leaves in decision trees, or the number of premises and rules in rule-based systems.
On the other hand, in the case of black-box models, such as neural networks, there are two main trends: i) extrinsic evaluation of post-hoc interpretability; and ii) intrinsic evaluation of interpretability of surrogate models.
In this paper, we evaluate the post-hoc interpretability of LSTMs with SHAP [28], which is inspired by Game Theory 1 . The so-called Shapley values assign a contribution ϕ j (x t j ) to each feature x t j . SHAP is a model-agnostic approach for generating local explanations as linear combinations of binary variables. All features are ranked in terms of their relevance for each single classification. SHAP is distributed as open source 2 .
In addition, we have used ExpliClas [56] and GUAJE [57] for building explainable fuzzy systems [58] that are inherently interpretable by design and ready to generate local (and global) factual (and counterfactual) explanations in natural language. Among the algorithms provided by ExpliClas for generating interpretable models, we have selected the C4. 5 Quinlan's algorithm [59] and the Fuzzy Unordered Rule Induction Algorithm (FURIA) [60]. ExpliClas provides us with a linguistic approximation of models that can be exported into an XML format complying with the IEEE Std 1855-2016 for fuzzy systems modeling [61]. These linguistic models are endowed with global semantics thanks to the use of meaningful and commonsense linguistic terms that are defined by strong fuzzy partitions and grounded on clinical expert knowledge. Accordingly, the models satisfy all required constraints to be deemed as interpretable [58]. In addition, each model includes a list of readable IF-THEN fuzzy rules (e.g., "IF Feature j is Small AND Feature k is Big THEN class is AMR"). Such rules and their interaction can be analyzed in depth by clinicians with the assistance of GUAJE, which provides them with visual and textual explanations. Moreover, GUAJE offers several metrics for measuring the interpretability of the given linguistic models. In this work, we will report the number of rules and the total rule length, i.e., the total number of premises and conclusions in the rule base.

Dataset and pre-processing
This section describes the clinical data in detail and elaborates on the pre-processing techniques adopted.

Dataset description
The data analyzed in this work corresponds to clinical MTS recorded for ICU patients at the University Hospital of Fuenlabrada in Madrid, Spain. Data were registered for 16 years, from 2004 to 2020 (both included), counting a total of 3,470 patients (627 of them were identified as AMR). For determining the AMR acquisition, the result of a clinical procedure named antibiogram is considered together with the patient culture to test if a bacterium is resistant to one or more antibiotics. Since getting the antibiogram result can take more than 48 hours, and a single patient may have several cultures with multiresistant bacteria throughout his/her stay, we limit our research to the first culture identified as multiresistant. Moreover, given that our focus is on early prediction of AMR using clinical time series, we discarded two kinds of patients: i) non-AMR patients with an ICU length stay shorter than 24 hours, and ii) AMR patients whose multiresistance was detected in the first 24 hours in the ICU. Taking into account the previous considerations, our dataset is made up of 3,178 patients (433 with AMR).
The families of the antibiotics taken by the patients during their ICU stay are: Aminoglycosides (AMG), Antifungals . We also use the feature "Others" to identify any other antibiotic not belonging to the previous list. For any given patient (say the i-th one), the feature associated with each family of antibiotics (say the d-th one) is a sequence of binary variables x d i ∈ {0, 1} T i indicating whether the patient has taken (or not) that family of antibiotics during each of the T i 24-hour periods that the patient spent in the ICU. Regarding mechanical ventilation, it is represented as a sequence of binary variables, each of them denoting whether the patient has been connected (or not) to a breathing machine at any time during the 24-hour period at hand. Furthermore, we characterize the ICU occupation and a summary of the treatment provided to the rest of the ICU patients (neighbors) during the same time interval as the one considered for the patient who is being characterized. Thus, a total of 17 additional numeric features were created: the number of neighbors of the patient, the number of patients identified with AMR bacteria (# of AMR neighbors), and the number of neighbors taking each of the 15 antibiotic families listed before. To avoid any confusion between the time series describing if the i-th patient took a particular drug and that describing the number of the neighbors of i taking the same drug, we use the subscript n to denote features referring to neighbors (e.g., AMG is the feature indicating if the patient took the drug and AMG n is the feature counting how many of his/her neighbors took the drug).
For completeness, we detail next the clinical criteria considered to identify multi-drug resistance for the most common bacteria in the ICU at HUF: Pseudomonas, Stenotrophomonas, Acinetobacter, Enterobacter, Acinetobacter, Staphylococcus Aureus, and Enterococcus. In general, Pseudomonas were considered multidrug resistant when they were resistant to three or more of the following families: CF4, CAR, QUI, AMG, POL or PAP. Staphylococcus aureus was resistant to OXA; Enterococcus was resistant to vancomycin (GLI Family); whereas Stenotrophomonas and Acinetobacter were considered resistant only upon appearance, regardless of the antibiogram result. Figure 2: Histogram and boxplots of the elapsed time (in days) from the ICU admission to the ICU discharge. Gray color is associated with non-AMR patients staying in the ICU for more than 24 hours. The green color is used for AMR patients whose first culture flagged as positive occurs at least 24 hours after their ICU admission.

Temporal windowing
We subsequently analyzed the temporal windowing for AMR and non-AMR patients. Towards that end, Figure 2 shows the elapsed time from the ICU admission to: i) the ICU discharge for non-AMR patients, and ii) the first AMR culture for AMR patients. From these representations, we concluded that the identification of the first AMR usually occurs within the first few days of the AMR patients' stay. It can be observed that 50% of the AMR patients have the first culture flagged as positive before the fifth day after ICU admission. This value is very close to the median of the duration of the stay for non-AMR patients (4 days). Taking into consideration these values, when conducting the experiments we considered four different window lengths: W = 3, W = 4, W = 5, and W = 6 days.
To gain insights on the drugs administered during the duration of the windowing, we show in Figure 3 the proportion (rate) of AMR patients taking each drug (green bars) and its counterpart for non-AMR patients (gray bars). Each rate has been computed over a different number of patients: 433 for AMR patients, and 2,745 for non-AMR ones.
Note that these figures do not carry information about the temporal nature of each family of antibiotics, only their presence/absence ('1'/'0') during the window length. The results reveal that antibiotics like CAR, GLI, or ATF are administered in higher proportion to AMR patients, while PEN is more frequently administered to non-AMR patients. No significant differences are observed for QUI, LIP, and PAP.
For a more detailed explanation about the construction of the data-patient matrix in Eq. (1), Figure 4 sketches the temporal windowing with W = 5 for two patients (patient i, who is AMR, and patient j, who is non-AMR). Each observation in the time series is defined by a 24-hour interval, with the starting time depending on the patient. More specifically, for the i-th patient, we consider the last time instant of the window t end i to be associated with the day where the culture flagged as AMR was taken and, then, defining the W − 1 remaining days backwards.
For the j-th patient and provided that his/her stay in the ICU was longer than W, the first slot of the window t ini j corresponds to the day the (non-AMR) patient was admitted to the ICU. In both cases, if the duration of the patient's stay is shorter than the window length (W), we set to zero the values associated with the first slots of the time series ("Zero Padding"). As already explained, temporal patient features contain information about the evolution of the patient during his/her stay in the ICU. Hence, the rows of the data matrixX i represent: a) the family of antibiotics taken by the patient during the time instants associated with the window, b) if the patient was under mechanical ventilation, c) the number of patients in the ICU during the time instant associated with the window, and d) the number of patients in the ICU taking each of the antibiotics. Finally, we also created a new W-dimensional binary variable called "mask" whose entries indicate if the patient was indeed present in the ICU during those W days. The default value for all the entries is "mask" = 1 and, as a result, if one of the entries of the "mask" vector (say the t-th one) is zero, the meaning is that the patient was not present in the ICU that day. This readily implies that all the 8 J o u r n a l P r e -p r o o f  values in the corresponding t-th column ofX i will be zero, according to the (zero-padding) imputation procedure described before (cf. the left-most column of patient i in Figure 4).

Experiments and results
This section starts by defining and discussing the experimental setup. We then present and discuss the FS results. After that, we analyze the prediction performance of the different ML models considered. Finally, we close the section by analyzing the interpretability of the generated models.

Experimental setup and parameter tuning
The dataset was randomly split into two independent subsets, the training set (70% of the samples) and the test set (30% of the samples). The training set was used to design the model, while its performance was evaluated with the test set. We have evaluated several strategies to deal with imbalance classes (undersampling and cost-sensitivity learning) and to handle missing values in MTS ("Removing", "Zero padding", "Masking"). Table 1 shows the total number of patients for each dataset. As expected, the number of patients changes in terms of W. In the "Removing" approach, we only consider patients who were in the ICU W days for non-AMR patients, or who stayed in the ICU at least W days before the first AMR. The number of patients in the training set decreases when W increases (354 for W = 3 and 234 for W = 6). With "Zero Padding" or "Masking" we discard those patients who did not take any drug and were not connected to a breathing machine during the window time under consideration. For this reason, the number of patients in the training set increases when the window length increases. For comparison purposes, the number of patients in the test set is the same for a specific window length regardless the procedure used for dealing with missing values. The size of the test set decreases as W increases (908 patients with W = 3; 773 with W = 4; 653 with W = 5; 531 with W = 6).
In the training set, a 5-fold cross-validation approach was followed to select the hyperparameters minimizing either the BCE or the BBCE cost function. The hyperparameters associated with the MLP, GRU, LSTM, and Bi-LSTM network ar-

FS results
We shift now our attention to Figure 5, which indicates the features selected by each of the three methods presented in Sec. 2.1.2 for four different window lengths W ∈ {3, 4, 5, 6}. A green box means that feature was selected by the method and a gray box that it was not. A two-fold strategy was considered to obtain the final feature set D ′ . Firstly, for each FS approach, we consider the d-th feature as relevant if it was selected by two or more values of W. Secondly, we implemented another majority rule scheme where the d-th feature was finally considered as relevant if it was selected by two or more FS methods. Figure 5 shows that the method using the CI obtained by bootstrap resampling selected a higher number of features (40 features out of 50) compared to Conditional MI or Group LASSO (19 features were selected for each approach). Note that Group LASSO selected a high number of variables related to the patient, whereas Conditional MI selected more features related to the ICU environment. It is also remarkable the stability of the Group LASSO results across the different time window lengths. After voting across methods, a total of 26 features were selected, being 14 of them associated with the antibiotics taken by the patients (AMG, ATF, CAR, CF1, CF3, CF4, GLI, NTI, OXA, PAP, PEN, POL, QUI, and Others), the MV, and 11 features associated with the ICU environment (# of patients, # of AMR patients, CAR n , CF3 n , GCC n , GLI n , MON n , PAP n , POL n , TTC n and Others n ). Since feature importance is considered a way of endowing explainability to make an early prediction of AMR, we discuss in detail which ones are deemed clinically relevant to train the models.
All antibiotic families involved in the clinical criteria for the appearance of AMR germs (cf. the last paragraph in Sec. 3.1) were considered. Also, it is observed how some antibiotics were always identified as relevant, despite the window length and the FS method considered (among them, ATF, CF3, PAP, MV, # of AMR patients, and CAR n ). Clinicians have validated these results, concluding that they can be a suitable alternative for building appropriate data-driven models.

Early prediction of AMR using NNs
The purpose of this work is the early prediction of AMR with MTS recorded in the EHR before the actual complication occurs. Therefore, we pay attention first to MTS with W = 5, i.e., the median of the elapsed time from the ICU admission until the first AMR culture for AMR patients (see Figure 2 for details). We compare the classification performance of conventional NNs (MLP) and RNN approaches (LSTM, GRU, and Bi-LSTM) using different methods for a) FS, b) handling class imbalance, and c) dealing with missing values.   Table 2: Mean ± standard deviation of the performance (Accuracy, Specificity, Sensitivity, and ROC AUC) on 5 test partitions when training NNs considering a 5-days window with: non-FS and with FS (first row); undersampling and BBCE as strategies to handle class imbalance (second column); "Removing", "Zero Padding" and "Masking" strategies to handle irregular MTS (third column); and MLP, GRU, LSTM, and Bi-LSTM as classifiers (fourth column). The highest performance for each figure of merit is in bold. Table 2 shows the mean and the standard deviation on 5 test partitions provided by different NNs models in terms of Accuracy, Specificity, Sensitivity, and ROC AUC. Note that, to keep the comparison fair, the same 5 test sets have been considered when evaluating all the methods. Several conclusions can be drawn from this table. In general, better performance is achieved when considering an FS strategy. For ease of comparison, the mean of the ROC AUC obtained for non-FS and FS results was computed (60.84 vs 62.09), verifying that FS offers better performance.
If we shift focus now to the strategies to handle imbalance data, we note that undersampling the training set works better than BBCE (the mean ROC AUC values are 63.95 and 62.30, respectively). However, in this work, the limited number of patients is a critical problem, and BBCE has the advantage of using a larger number of patients to train without neglecting the importance of the class imbalance problem. It can be seen also that "Masking" is the best approach of the three strategies to handle missing values in MTS, slightly outperforming the others approaches in all experiments in terms of ROC AUC (mean ROC AUC value for "Masking" is 63.66, whereas for "Removing" and "Zero Padding" is 60.56 and 61.58, respectively). The ML classifier with the best results (considering the ROC AUC as the most relevant figure of merit) is the LSTM scheme with BBCE and "Masking", achieving a ROC AUC level of 66.73%. It is also the best in terms of Sensitivity (68.58%), while for Accuracy (66.54%) and Specificity (64.88%) the best performance was achieved when BBCE was replaced with undersampling.
Since, in general, better results are obtained with LSTM, for completeness, we compare the obtained results with those provided with different windowed modelings for the GRU and Bi-LSTM models. Figure 6 shows the boxplots of the performance on five test partitions in terms of Specificity, Sensitivity, and ROC AUC when considering the selected features after a voting strategy and training with a BBCE cost. Furthermore, we explore strategies to handle irregular MTS. As previously discussed, the number of patients changes with the length of the window W. Therefore, though the comparison of results among different windows does not allow us to conclude which the best window length is, it allows us to know which one works best for this particular problem. The four and five-day windows obtained better performance than three-day and six-day windows (see Figure 6). Also, we can conclude that models based on GRU and LSTM perform slightly better than Bi-LSTM based models. The underperformance of the Bi-LSTM may be due either because the architecture is to complex or because joint consideration of past and future data is not relevant to our classification problem.

LSTM post-hoc interpretability
In the previous section, we concluded that the LSTM architecture with 26 input features, W = 5 and "Masking" provided good performance for both undersampling and BBCE. However, LSTM networks are not easy to interpret. Therefore, we present here the results of an LSTM post-hoc interpretability analysis based on SHAP (see Sec. 2). We have explored the potential of SHAP to characterize the entire population (all given patients) according to the model prediction. The first SHAP analysis was carried out for the LSTM built with W = 5, undersampling and "Masking". Then, we paid attention to the behavior of individual patients when considering LSTM models with different window lengths (W = 3, W = 4, W = 5, W = 6), undersampling and "Masking". We calculated the Shapley values related to the contributions of all time steps for each single patient separately and then computed their average. Figure 7 shows a SHAP graph with the distribution of the Shapley values generated from the LSTM model trained with undersampling and "Masking" (considering all the 26 previ-J o u r n a l P r e -p r o o f Journal Pre-proof ously selected features). Features are depicted in order of relevance. Each dot represents a patient, the dot color indicates the real value of the feature, and the position of the dot in the x-axis represents the contribution this feature has to the model output (sum of the Shapley values). The further a point deviates from the mean of predictions (which is 0 in this case), the more impact this particular feature has on the model output for that patient. For example, the Shapley values associated with Mech.Vent. are positive when the Mech.Vent. value is high. That is, for our prediction model, by SHAP interpretation, we get the result that AMR patients are more likely to appear when ICU patients are receiving Mech.Vent. Accordingly, in short, the five most important features are Mech.Vent., ATF, CF1, # of AMR patients and GLI n . These results are in agreement with clinicians' intuition and fit well with the literature. Notice that, controlling the isolation of patients with AMR germs and invasive devices are crucial tasks in tackling multi-resistance. Moreover, it is well known that the use of drugs such as CF1 is likely to reduce the chance of patients to become AMR. Interestingly, this fact is pointed out by our SHAP analysis.   Figure 8 shows the model output values and the Shapley values for four different patients. The selected patients repre-sent the four main types of patients we have in our database: AMR and non-AMR patients with stays longer than 5 days ("full data") and with stays shorter than 5 days ("no full data"). Once again, our SHAP analysis pays attention to LSTM models that were trained with undersampling and "Masking" (when considering different time windows and all the 26 features previously selected). Features are ranked and depicted in terms of relevance, with the one in the top being the most relevant feature. Gray vertical lines represent the base values associated with the underlying SHAP models. Each colored line corresponds to one specific patient. It is easy to observe how the contribution of each single feature to the model prediction varies from one patient to another. All contributions together with the baseline values form the final model outputs (see the top bar in each panel).
When joinly analyzing the four panels (each one associated with a different window length), even though slight differences exist, the following features emerge as the most relevant ones: # of AMR patients, Mech.Vent. and some drugs such as ATF, AMG and OXA. We observe that all models, except for W = 4, correctly classify (model output greater than 0.5) the AMR patient full data, whereas the AMR patient with no full data is correctly classified only by models with W = 5 and W = 6. A similar effect occurs with the non-AMR patients: the patient non-full data is correctly classified (model output lower than 0.5) by all models except for the model with W = 6, while the patient with full data is only correctly classified by models with W = 3 and W = 4. These findings illustrate that selecting the right window length is a very challenging task.

Linguistic interpretability
With the aim of providing readers with a quantitative assessment of the balance between interpretability and performance in our proposal, we built two interpretable by design models (a decision tree generated with the C4.5 algorithm [59] and a fuzzy rule-based classifier with the FURIA algorithm [60]) for the case of FS, undersampling and "Zero Padding" with W=5, that according to clinicians was the simplest to explain among all previously reported (see Table 2). Both models are enriched with linguistic interpretability, i.e., the original tree (C4.5) and rule list (FURIA) are translated into two explainable fuzzy systems as described in Sec. 2.3.2. Both linguistic models share the same global semantics, which is expressed by the same linguistic terms (Very small, Small, Medium, Big, Very big) associated with the same underlying strong fuzzy partitions (which were carefully defined in agreement with a clinician for each single feature to be meaningful). As a result, the knowledge embedded in this kind of model is described by a list of linguistic IF-THEN rules easy to understand by humans. Table 3 quantifies the structural interpretability of the generated models in terms of their number of rules (NR) and total rule length (TRL). These models were generated and validated with the same training and test datasets that were considered in Table 2. Nevertheless, for the sake of explainability, the temporal information was first aggregated to produce meaningful features. For example, "# of AMR patients std" is the standard deviation (std) of the number of patients (# of AMR patients) AMR full data non-AMR full data AMR no full data non-AMR no full data AMR full data non-AMR full data AMR no full data non-AMR no full data AMR full data non-AMR full data AMR no full data non-AMR no full data   in the considered time window; GLI n mean is the average of all temporal values associated with GLI n corresponding to antibiotics of the family Glycopeptides (GLI) taken by the neighbors of the patient; or GLI cons is '1' if the patient took GLI at any time instant and '0' otherwise. For comparison purposes, we also report performance values (Accuracy and ROC AUC) that can be compared to those reported previously in Table 2. The results reveal that the highest interpretability of FURIA (NR=8.8 and TRL=32.8) comes with a reduction of performance. However, it is worth noting that comparing FURIA to C4.5, we observe how the interpretability gain is much larger than the reduction of performance. We carried out the Friedman Aligned Ranks non-parametric statistical test [62] (all versus all with significance level α = 0.05) in order to detect significant differences among reported results for Accuracy in Table 3. The hypothesis H 0 (the means of the results of two or more algorithms are the same) is rejected when comparing FURIA versus MLP, and accepted in the rest of comparisons. Moreover, as illustrated below, FURIA rules are fairly simple and easy to interpret by clinicians: These rules provide clinicians with useful information that is complementary to that observed in Figure 7. Interestingly, the 7 rules describe the interaction among only 7 out of the 26 features in Figure 7. In addition, four of these features are in the top-5 ranking given by SHAP. Notice that, even if Mech.Vent. was identified by SHAP as the most relevant feature when considering single contributions, our linguistic analysis reveals that the interaction of the next four top features (ATF, Others, # of AMR patients and CF4) is also very relevant. Moreover, remind that ATF and # of AMR patients were also previously pointed out as two of the most relevant features by all FS methods (see Sec. 4.2).
In addition, when considering single cases like those illustrated in Figure 8, only specific rules are fired. For example, in the case of the patient with "AMR no full data" in Figure 8(c), rule R 3 is fired and we obtain the following explanation, which is a mixture of factual and counterfactual pieces of information: The patient is classified as AMR. In accordance with the third rule, a patient is AMR in case that the standard deviation of the number of AMR patients is big. It would be non-AMR if such standard deviation were smaller (0.345). On the other hand, in the case of the patient with "non-AMR full data", R 1 is fired and the related textual explanation is as follows: The patient is classified as non-AMR. It is very likely non-AMR, because in accordance with the first rule, a patient is non-AMR when the standard deviation of the number of AMR patients is very small and the consumption of ATF is small. It would be AMR if such consumption were bigger (1.430). It is worth noting that this kind of explanations highlights the interaction among different features, being a useful insight that complements the ranking of relevance given by SHAP.

Discussion and conclusions
The high rate of infections occurring in the ICU (20-30 % of all ICU admissions) [63] makes this unit one of the epicenters of the development of AMR. Previous clinical studies have analyzed the risk factors for getting AMR bacteria [64]. They concluded that the treatment with invasive devices (particularly the intensity and duration of the treatment) and the exposure to antibiotics are the principal causes. Minimizing the occurrence of AMR bacteria could be beneficial to reduce the duration of invasive devices treatment as well as to optimize its dosage [65].
Thus, AMR is nowadays a growing problem due to the inappropriate use of antimicrobials. Indeed, some bacteria that were previously treatable have become now a challenge to deal with, especially in the ICUs. In these units, AMR has created a high impact on morbidity, hospital costs, and sometimes patient survival. It is necessary to be aware of the growing problem caused by AMR, for which new research, efforts, and approaches are needed to prevent the further spread of AMR.
AI models can contribute to solving problems related to the clinical environment. These models reduce the time of detection of infectious diseases, resulting in a reduction in the number of deaths as well as health economic costs. In the healthcare domain, there has been a developing interest in breaking down clinical information as time-series sequences since it allows clinical specialists to better evaluate the progression of the patients. However, the complexity and irregular patterns present in clinical data render modeling MTS a hard and challenging task. Fortunately, RNNs have arisen as an appropriate choice to model and deal with MTS. In this work, we have explored the use of well-known RNNs such as GRU, LSTM, and Bi-LSTM. Although RNNs have demonstrated to achieve high classification performance, their lack of interpretability is a bottleneck for developing and deploying clinical MTS-based decision support systems where interpretability is of foremost concern. Note that, for the sake of interpretability, RNNs were trained using just data within short time windows. This was carried out in agreement with clinicians, who considered that the use of longer windows was harder to justify from a clinical point of view.
This work has paved the way towards comprehensible MTS analysis in the context of early AMR prediction in ICUs. We have applied different FS approaches, in combination with interpretable ML techniques, with the aim of extracting valuable insights about AMR. Our proposal has been validated with realworld data. Namely, we considered 3,178 patients, with 433 of them confirmed as AMR from 2004 to 2020 at University Hospital of Fuenlabrada in Madrid, Spain. With our study, we identified relevant clinical features for the onset of AMR which were afterward confirmed by clinicians. For example, our proposal revealed how the family of antibiotics taken by patients as well as the time each patient has been assisted with mechanical ventilation, turn up as vital indicators to isolate a patient in advance and thus controlling the spread of the bacteria among other ICU patients.
It is worth noting that we have shown how findings provided by post-hoc interpretability analysis of data-driven models may be supportive to clinical decisions before the antibiogram result. More precisely, we used SHAP to assist clinicians in understanding the outputs given by black-box models. The SHAP results have shown the importance of mechanical ventilation for the predictions, which is in accordance with the literature. The importance of ATF is also noteworthy, with results showing that AMR patients took ATF more frequently than non-AMR patients. The # of AMR neighbors is relevant in our results: the higher the # of AMR neighbors, the higher the AMR probability returned by the models. In addition, we built explainable fuzzy systems to better understand how relevant and meaningful features previously identified with SHAP actually interact. As a result, we extracted linguistic IF-THEN rules that described how early AMR prediction can be explained in terms of the interaction among several features. Moreover, such rules were automatically interpreted and translated into narrative explanations in natural language to facilitate understanding by clinicians. All in all, clinicians were satisfied with the reported results and expressed how their trust in MTS-based AMR results was higher once they understood the model output with the assistance of both visual and textual explanations.
This novel methodology can save valuable time to start the adequate treatment for an ICU patient. This study was conducted using only MTS related to the antibiotics taken by the patients and the mechanical ventilation. To generalize the conclusions, different MTS should be considered, as well as demographic and clinical data such as age, gender, or diagnoses and procedures. As future research, we plan to endow with other interpretable NNs that take into account the importance of each time step, such as attentional NNs [66] or GRU-D [49]. Interpretability is expected to make those models more trustful and acceptable by clinicians. Finally, we plan to develop individual models for predicting each type of AMR bacteria emergence, since previous studies [67] have reported promising results when using this line of work.

Availability of Data and Materials
Access to the data can be provided upon official request if approved by the Committee of Ethics of the University Hospital of Fuenlabrada.