Antimicrobial Resistance Prediction in Intensive Care Unit for Pseudomonas Aeruginosa using Temporal Data-Driven Models

One threatening medical problem for human beings is the increasing antimicrobial resistance of some microorganisms. This problem is especially difficult in Intensive Care Units (ICUs) of hospitals due to the vulnerable state of patients. Knowing in advance whether a concrete bacterium is resistant or susceptible to an antibiotic is a crux step for clinicians to determine an effective antibiotic treatment. This usual clinical procedure takes approximately 48 hours and it is named antibiogram. It tests the bacterium resistance to one or more antimicrobial families (six of them considered in this work). This article focuses on cultures of the Pseudomonas Aeruginosa bacterium because is one of the most dangerous in the ICU. Several temporal data-driven models are proposed and analyzed to predict the resistance or susceptibility to a determined antibiotic family previously to know the antibiogram result and only using the available past information from a data set. This data set is formed by anonymized electronic health records data from more than 3300 ICU patients during 15 years. Several data-driven classifier methods are used in combination with several temporal modeling approaches. The results show that our predictions are reasonably accurate for some antimicrobial families, and could be used by clinicians to determine the best antibiotic therapy in advance. This early prediction can save valuable time to start the adequate treatment for an ICU patient. This study corroborates the results of a previous work pointing that the antimicrobial resistance of bacteria in the ICU is related to other recent resistance tests of ICU patients. This information is very valuable for making accurate antimicrobial resistance predictions.


I. Introduction
A ntimicrobial resistance occurs when a germ develops the capacity to not respond to the drugs designed to combat them [1]. Nowadays, antimicrobial resistance is one of the greatest threats to the global health system [2]. Apart from the health consequences, the economic impact deriving from antimicrobial resistance is not a trivial issue, resulting in a 7% reduction in the Gross Domestic Product by 2050 [3]. Indeed, it has become more acute in recent years due to the excessive use of antibiotics in many facets of daily life [4].
The acquisition of antimicrobial resistance is favoured in hospital environments, being even worsened for patients admitted to the Intensive Care Unit (ICU). This could be motivated by the duration and intensity of the drug treatment, as well as by the use of life support devices. The critical health status of ICU patients pushes actions to anticipate the result of the cultures provided by the microbiology laboratory, which usually takes 48 hours. A culture is a biological sample collected to isolate a bacterium, aiming to analyze its susceptibility to different antibiotics. The test used to measure this susceptibility is called antibiogram, and its result (susceptible/ resistant) is commonly used by clinicians to determine the antibiotic treatment [5]. It is interesting to note that several families of antibiotics may have similar susceptibility when tested on a given germ species [6]. There are several species with high prevalence, for example, Acinetobacter spp.; Enterococcous fecalis and Enterococcus faecium; Escherichia coli; Klebsiella pneumoniae; Pseudomonas aeruginosa; and Staphylococcus aureus, among others. In this paper, we focus on Pseudomonas aeruginosa for the following reasons: (1) its virulence, specially in the ICU;(2) its ability to cause chronic infectious diseases; and (3) its ability to develop multi-drug resistance [7], [8].
For all these reasons, anticipation to the culture result in case of resistance, is vital to isolate the patient and control the spread of antimicrobial resistance among other ICU patients. Computational tools inspired on data-driven models may be supportive to clinical decisions previous to the antibiogram result. The article [6] introduces the concept drift observed in antimicrobial resistance data sets, and it uses a windowing scheme together with dynamic classifiers to perform resistance prediction. It classifies cultures as susceptible or resistant to some antibiotics using a database of EHR which includes years from 2002 to 2004, considering cases of meningitis. A high number of the state-of-the-art studies use whole genome sequencing [9]- [ 12]. Because of its considerable cost, in this study we propose to predict resistant bacteria based on Electronic Health Records (EHR) data from ICU, together with historic antibiogram results. This data is already available in most hospitals, and therefore the methodology proposed in this paper can be straightforward extrapolated. Comparable approaches are studied in previous works [6], [13]- [18]. In [17], bacterial infection in the ICU using EHR data is predicted (binary classification task) by applying a set of machine learning (ML) methods. The prediction is carried out at the patient level in order to determine which patients no longer require more antimicrobial treatment. Longitudinal data from 2001 to 2012, extracted over the 24-hour, 48-hour or 72-hour window following their first antibiotic dose, are considered. No temporal modelling was explicitly taken into account. The work in [18] presents an study for predicting bacterial resistance also using EHR data, from 2013 to 2015. An ensemble of ML methods is used to classify isolated bacterial cultures as susceptible or resistant to a particular antibiotic. The temporal relation among instances is considered here, with features indicating the proportion of past antibiotic resistance infections identified as having the highest average impact. This study also concludes that the feature encoding the date of the culture has some effect on the prediction, probably due to the fluctuating resistance frequencies through time.
Owing to the dynamics of antimicrobial resistance, we analyze in this paper electronic health records collected during 15 years, from 2004 to 2019, by the University Hospital of Fuenlabrada (UHF) in Madrid, Spain. This data have been partly considered in previous studies carried out by the authors [14], [15], [16], [19]. In particular, authors in [14] used a reduced dataset taking into account two years less (from 2004 to 2017) than in the current work. All patients admitted in the ICU in this period were considered in [14], regardless of their length stay. Additionally, authors in [14] used ML to determine whether a Pseudomona Aeruginosa bacterium will be resistant or not (binary target) to different families of antimicrobials without considering information about historic antibiogram results. In [15], we analyzed for the first time the dynamics on Pseudomona Aeruginosa by considering incremental time windows on a period of time from 2004 to 2013, with two families of antibiotics. It was also our first incursion on the use of features taking into account the result provided by previous antibiograms of other ICU patients. This current paper extends the work in [15] while considering the predictive window length (one month) that best results provided in [15]. Specifically, to carry out predictions, the Random Forest (RF) method has been added to previously considered method, Logistic Regression (LR). We have increased both the number of years under study and the number of antibiotics (from 2 to 6). We have also considered as features the result provided by previous antibiograms of each patient, weighted by a factor depending on the time elapsed since the last antibiogram was tested. Furthermore, two approaches have been explored to analyze the dynamic of antimicrobial resistance by evaluating the models in several time horizons.
The rest of the paper is as follows. In Section II, we describe the data set analyzed in this paper and provide a graphical exploration of it. Section III introduces the data preprocessing as well as the methods used for temporal modelling. Results and discussion and provided in IV. Finally, the conclusions are presented in Section V.

A. Data Set Description
Data considered in this work correspond to 3812 admissions of 3346 ICU patients, collected at the UHF during a period of 15 consecutive years (from July 2004 to May 2019). Note that, since the number of ICU admissions exceeds the number of patients, there are patients with more than one ICU admission during this period. A total of 43658 cultures were collected. Although there are more than 290 different types of bacteria and 27 antimicrobial families, we only take into account here the cultures where Pseudomonas have been detected, ending up in a total of 764 cultures. For this bacterium, the antibiograms considered in this work test the response (encoded as susceptible (s) or resistant (r)) against the following set of family of antibiotics a = {amg, car, cf4, pap, pol, qui}. Elements in the set a refer to: Aminoglycosides (AMG), Carbapenems (CAR), 4th Generation Cephalosporins (CF4), Extended-spectrum penicillins (PAP), polymyxins (POL) and Quinolones (QUI), respectively.
Since data-driven models are based on learning from instances, we consider here the target c&a i , as the antibiogram result for a specific family of antibiotic a i , for every culture collected to any patient. The feature vector associated with each target is represented by the 40 features described in Table I. We define here the instance as the pair composed by the feature vector (input features to the data-driven models) and the target (outcome of the data-driven models). As for the input features, we first analyze demographic data: age, gender, group of illness A (cardiovascular events), B (kidney failure, arthritis), C (respiratory problems), D (pancreatitis, endocrine), E (epilepsy, dementia), F (diabetes, arteriosclerosis) and G (neoplasms), and pluripathology (indicating whether the patient has more than two comorbidities). The median age of patients admitted to the ICU was 64 years (interquartile range 55-73, range 18-87), with a majority of men (70%). Pluripathological patients are 40.6% of the patients, with comorbidities mostly related to respiratory problems (33.4%), diabetes (26.3%) and neoplasms (33.1%).
We then focus on the information about the ICU admission: date of admission to the ICU, department of origin before ICU admission (surgery, internal medicine, urology,...), reason for admission (serious infection, acute respiratory failure, hypovolaemia,...) and patient category (medical or surgical). The clinical origin before the ICU admission more common was surgery (31.1% patients) and emergency department (18.4%). The reason of admission more common was serious infection (22.5% patients) and acute respiratory failure (18.4% patients). The most common patient category was medical (52.2 %).
This work also analyses the information related to the cultures. Specifically, we consider the culture type (exudate, drainage, biopsy, sputum, bronchoaspirate, etc.); first level grouping for the type of culture, which classifies the cultures into surface, liquids, respiratory, etc.; and the second level grouping for the type of culture, used to identify a clinical sample or a surface culture. Besides, the date of the culture, the weekday the culture was collected , as well as the month and the year.
Finally, to collect temporal information in each instance associated to patient p, the current study proposes to generate two kind of features linked to previous resistant antibiograms. In particular, we consider: (1) previous resistant results of the same patient, and (2) previous resistant results of all patients who recently stayed in the ICU.
Own past cultures features. The first kind of features is associated with the detection of resistant bacteria in previous antibiograms for a specific patient p, and aims to quantify the current "intensity" of these bacteria. These features consider the result of antibiograms of Pseudomonas Aeruginosa during an interval between 21 days and 48 hours previous to the current culture being studied for patient p, c ( p ) . The 48-hour limit is considered since it is usually the time the results of the antibiogram take to be available. Furthermore, cultures are gathered until 21 days before the date d of current culture c, because if the antibiogram result is positive, from a clinical point of view, it is kept as positive for the following 21 days.
Thus, when a culture is collected, a total of six features, one per antimicrobial family, are generated: p&amg, p&car, p&cf4, p&pap, p&pol and p&qui. Each feature takes into account the antibiogram results for the corresponding antimicrobial family, e.g. p&pap just consider previous results associated with patient p for the family of antibiotics PAP. Because of that, the group of own past cultures of patient p, named C (p) , is divided into six data sets . To illustrate how the value for each feature p&a i , i = 1, 2, ···, 6 is obtained, let us consider that the subset has cultures, i.e. . Each culture has associated: (1) a date when it was collected ; and (2) a susceptibility test result , which is susceptible or resistant depending on whether the bacterium is susceptible or resistant to a i , respectively. To calculate the potential contribution of a culture to the feature p&a i , i = 1, 2, ... , 6, the Negative Exponential Function (NEF) is applied as follows: (1) where the value of parameter λ is experimentally set to 0.095. To compute the feature value p&a i for the instance associated with culture c (p) of patient p, the maximum outcome in Equation (1) is obtained according to Equation (2): (2) ICU-patients past cultures features. The second kind of features are named r&amg, r&car, r&cf4, r&pap, r&pol and r&qui. These features aim to encode the "intensity" of resistant bacteria in the ICU during the time previous to the date d of the current instance and culture. Differently from the previous set of six features p&a i , now the "intensity" takes into account the number of patients (different from current patient p) that were infected by a resistant bacterium and, for each of them, the time elapsed since the bacterium was detected. For a particular feature, a single value is calculated by considering the result of past susceptibility tests of Pseudomonas Aeruginosa for the P patients, denoted as p j with j = 1 ··· P, in the ICU during the time interval between 21 days and 48 hours previous to date d of culture c (p) of patient p. An exponential decay is again considered to weight the result of each susceptibility test.
The group C (p') of past cultures of other patients is divided into six subsets too. Every particular subset is split into n disjoint subsets, as many as patients: where is composed of the antibiogram results for a i in patient p j . As previously mentioned, the set of cultures of patient p are excluded from .
Since every culture has a susceptibility test result and a date , the application of the NEF expression equivalent to that in Equation (1), just replacing , and by , and , respectively. Then, each feature r&a i is obtained by adding up the maximum value of Equation (1) for each patient p j , as indicated in Equation (4). (4)

B. Graphical Exploration
Owing to the high number of features, we start by identifying the most relevant features per family of antibiotics. For this purpose, we consider a filter approach with the Mutual Information (MI) score [20]. Thus, for each family of antibiotics, Fig. 1 shows the five features with the highest MI values, comprising among them the date of culture and the information about the previous cultures both for the own patient and for the UCI environment. According to the mutual information score, the most relevant feature is date_culture for each of the antimicrobial families considered. This results supports the importance of the antimicrobial resistance dynamics, which is common for all families of antibiotics.
To get a deeper insight on this issue, Fig. 2 graphically illustrates the evolution of the number of susceptible antibiograms (a) and resistant antibiograms (b) for each family of antimicrobials tested on Pseudomonas along time. Not all families of antibiotics were tested during the whole period considered. Specifically, clinicians first agreed to modify the range of tested antibiotics in the ICU of the UHF, first by including POL in 2007 and then by stop susceptibility testing antibiograms of QUI in 2018, due to its high resistance. Furthermore, there is a very noticeable fall in the number of resistant and susceptible antibiograms in 2013. This decrease is probably motivated because of integration problems due to software update in the ICU health information system in 2013. As stated in the literature, the number of susceptible antibiograms tend to decrease in the most recent years.
In this line, we also analyze the annual ratio of resistant antibiograms results for each family of antimicrobials. To obtain this ratio, the number of resistant cultures per year has been divided by the number of total cultures per year (both resistant and susceptible cultures). The general trend is that, as time progresses (and therefore the value of date_culture increases), a higher percentage of instances tend to be resistant.
The second most relevant feature for the antimicrobial families AMG, CAR and QUI are p&amg, p&car and p&qui, respectively. This shows the importance of the outcome of previous antibiograms of the same patient for the family under consideration. In the case of CF4, p&cf4 is the 4th most important feature. Though not presented in Fig.  1, p&pap is ranked on the 7th position for PAP, and p&pol in the 11th position for POL. It is interesting to remark here that, in all cases, the MI score for a particular family of antibiotics is higher for the p&a i feature corresponding to that particular family than to any of the other five p&a i features. This points out the relevance of considering the particular antimicrobial family when using results of previous antibiograms. Fig. 4 shows the boxplots for each of the six features named p&a i , associated to the antibiogram results of the same patient for each family of antibiotics (in rows). Blue boxplots refer to p&a i for resistant results, while black ones refer to p&a i for susceptible results. In general, we observe that the median of p&a i is higher when the culture c was resistant than when it was susceptible. The results shown in Fig. 4 for CAR and QUI are particularly interesting for susceptible cultures (black boxplots) for all the families, with most of the previous antibiogram results being susceptible. However, for CF4 and PAP, most of antibiogram results are susceptible for p&cf4, p&pap and p&pol, whereas for POL it only happens for p&pol. Note that, regardless the family of antibiotics tested, the boxplot of p&car and p&qui for resistant cultures (blue bloxplots) is very similar to the boxplot associated to the corresponding family of antibiotics considered (e.g, see p&amg, p&car and p&qui in Fig. 4 for AMG, or p&pap, p&car and p&qui for PAP. The r&a i features are also among the most relevant features according to the MI score. In this case there is no clear distinction on the ranking depending on the antimicrobial family. It supports the importance of taking into account the existence of any resistant germ in the ICU. The feature r&pol (not included among the top five features in Table  I) seems to be the one providing less information, probably because of low number of antibiograms with a resistant result for this family. Fig. 5 presents the bloxplots for the r&a i features. In comparison with boxplots in Fig. 4, note that boxplots of the r&a i features are not limited to a maximum of one, since the number of patients contributing in Equation (4) is n (usually greater than 1). For each antibiogram a i , the median values of the r&a i features resistant and susceptible results is much closer between them than when comparing the p&a i features. It is also remarkable that boxplots associated with r&pol show a median value very close to zero both for resistant and susceptible cultures, in line with previous comments. Furthermore, when analyzing POL, the median value is higher for susceptible than for resistant cultures, excepting for r&pol, showing a different behavior of this antibiotic.
Finally, among the features in the top five with a higher MI score, we also find days_to_culture (for POL) and age (for QUI). Both features are also among the top ten for the rest of the antimicrobial families. From a clinical viewpoint, it is known that both age and a longer ICU stay are risk factors to become infected [14].

A. Data Preprocessing
Before using the data set to predict the result of the susceptibility test, a previous stage of preprocessing is needed. The first aspect to be considered is that six binary classifiers are going to be built in order to predict whether a culture is susceptible or resistant to each of the six different antimicrobial families. A different approach to tackle this problem would be to train a multi-class classifier. However, generating different classifiers allows to individually tune the hyperparameters of each of them and also makes the interpretation and analysis of results easier. To train them, the main data set is divided in six smaller data sets, each of them just considering one binary target c&a i . After that, all the instances representing cultures from patients that have stayed less than 48 hours in the ICU, are removed from every of the six data sets. Table I, the number of features is 40 for every data set, considering the respective target feature. The number of instances are 755, 643, 749, 749, 483 and 708 for AMG, CAR, CF4, PAP, POL and QUI data sets, respectively. Since instances represent cultures, and cultures have an intrinsic temporal ordering, instances are sorted in a temporal manner, with older instances at the beginning of the data set and the newer ones towards the end.

As indicated in
The missing values of the data sets are found in the 12 generated features (r&a i and p&a i ). The percentages of missing values for each of the data sets and features are detailed in Table II and Table III. It is remarkable that the percentages of missing values for p&a i features are higher than those of r&a i features. This happens because, in general, during the same time interval the number of cultures associated to a group of patients will be higher than the number cultures associated to just one patient. It is also notable that, overall, p&pol and r&pol have a high percentage of missing values with respect to the rest of the features of their respective type. This is caused by the very few resistant instances there are for POL family , probably because POL started to be tested in 2007 and the rest of antimicrobial families in 2004.   In the clinical setting, dealing with missing values is an interesting and challenging topic which may have different implications. In this study, missing values are replaced by zeros because of the clinical meaning of p&a i and r&a i features. The reason for a p&a i feature not having a value is that, for the particular patient and time interval considered, it is not found a resistance test result for the specific antimicrobial family studied. If that is the case, it means that, probably, clinicians have considered that the patient may not be infected by a bacterium resistant to the antimicrobial family. Therefore, it can be inferred that likely, in the time prior of the culture being analyzed, the patient was not infected with a resistant bacterium. It seems reasonable to assign a zero in this case, since the feature gets a higher value the more recent a resistant bacterium was detected. Regarding r&a i features, a similar reasoning can be done. If in the time interval observed, none of the patients in the ICU were tested for resistance to the particular antimicrobial family, it implies clinicians considered it was unlikely to find this kind of resistant bacterium. Thus, it is probable that, prior to the culture, there were no patients infected with a bacterium resistant to the feature's antimicrobial family, causing zero to be an appropriate value.
The categorical features in the data sets are converted into numerical before using them with the machine learning methods considered in this work. The two features representing dates (date_ culture and start_date) are categorical and ordered. Because of that, dates are encoded with integers, assigning lower values to older dates, and higher values to recent dates, indicating, in that way, the ordering among them. The value of a particular date is calculated as the difference, in number of days, between the particular date to be encoded and the first date in the data sets of the specific feature.
Having all features expressed as numerical, Pearson correlation is applied to detect the most correlated ones. If two features (both different from the target feature) are highly correlated, they are adding redundant information to the prediction, and therefore one of them should be removed. In this study it is considered that two features are highly correlated if their correlation coefficient is higher than 0.9 or lower than -0.9. In all of the six data sets, the same four features (date_culture, year_culture, start_date and year_ admission) are highly correlated among them. Because of that, just date_culture is maintained and the other three are removed from the data sets. After that, the number of features in every data set is 37 including the target feature.

B. Predictive Methods
In this section, we describe briefly the data-driven classifier considered in this work. Specifically, LR is tested as base line method, and it was also used in our previous work [15]. In this study, RF has been added to carry out predictions since its interpretability capabilities.
The LR method, very common in the clinical literature, allows us to conduct a linear analysis when the dependent variable is binary. It was used in our previous study [15] because of its simplicity to serve as a baseline, and to evaluate the feasibility of learning from data. In this work, it is again used to classify the instances, now with a greater amount of data and a higher number of antimicrobial families to be analyzed. This is done in order to have a more solid insight on whether the target can be predicted with the available features and the performance this method can provide. Before using LR, each feature is standardized by removing the mean and scaling to unit variance.
The another data-driven method explored here is RF, a machine learning approach commonly used for regression and classification [21], [22]. It is an ensemble method, that is, a RF model is built from multiple decision trees named estimators, which are able to generate individual predictions. RF combines the different predictions of its decision trees (which, individually, tend to over fitting to the training set) to provide a better prediction, providing a better generalization to data not considered in training. The RF method is very robust, since it can handle data sets with an extensive number of features, high dimensionality and heterogeneous features, while having very few hyperparameters. Because of this, RF is often used as a first approach to develop machine learning systems, as it enables to get an overview of the performance on a particular task.

C. Temporal Modeling
Analyzing the problem to be solved, some special characteristics have to be considered when designing the experiments.
The first one is the temporal ordering among instances of the data sets. Since instances are associated with cultures with a susceptibility test, they have an inherent order marked by the date when they were collected. This forces to maintain this same order when predicting instances, that is, past instances cannot be predicted with instances in their respective future. This particularity arises from the fact that, in the real world, when predicting an antibiogram result, future results are not available.
Antimicrobial resistance is a phenomenon that changes over time as bacteria mutates. It allows bacteria to be more resistant to antibiotics as time progresses. As previously mentioned, the features considered include demographic data, information about the patient's admission, and information about the culture and antibiogram results. Since bacteria's mutations are not among the available features, the feature's values telling apart one class from another may change along time. This fact has been previously described as the concept drift in which the concept being studied depends on some hidden context, not explicitly given in the form of predictive features [6]. An approach that is normally used to tackle this type of problems is the so called windowing, which generalizes from a sliding window that moves over the data set instances and applies the knowledge gathered to predict only in the immediate future.
The other particularity is the data scarcity. As previously mentioned, the maximum number of cultures (755) is observed for the AMG antimicrobial family. With the time interval considered (15 years, from 2004 to 2019), there is at most an average of 50 cultures per year. Data scarcity is a trouble spot when using windowing, because in this paradigm, usually, just a small fraction of the data set (the one considered by the sliding window at each particular time) is used for training.
A solution proposed in the previous work [15] was to build an incremental training window as the one depicted in panel (d) of Fig. 6. This type of window, which grows in length, contains instances that are as temporarily close as possible to the test instances. Then, the concept drift can be avoided by predicting temporarily close instances to the training set, but it also contains instances far in the past, so that the number of available instances for training is higher than when using sliding window. In addition to the incremental training window, this work considers a more commonly used sliding training window with fixed size to compare their prediction performance. Below we first describe the characteristics of the test window, which is the same for both types of training windows. After that, we present the characteristics of the two types of training windows considered in this work.
The test window consists in a sliding window with a fixed size of just 1 month. Considering just a small amount of time, it is ensured that test instances are as close as possible to the training set. In the experiments of this study, this window begins just considering the first month (January) of 2016. After that, in each prediction step, the test window shifts one month towards later dates. In Fig. 6, steps are indicated at the end of each row as (1), (2), (3), ... (N) for every approach. In the last step, this window considers the last month of the data set. The test window, when shifted, does not overlap with its previous position, that is, in each step predicted instances are different from instances predicted in any other step.
The incremental training window, as previously mentioned, is a window of increasing size. In the experiments, this window starts containing instances from 2004 to 2015. In the following steps, the window increases in size one month at a time. In the last step, the training window includes all the instances in the data set except the last month, which is the one considered by the test window.
The sliding training window with a fixed size consists in a window just considering 4 years of instances. In every step, this window shifts 1 month towards last instances of the data set, in the same way as the test window does. Since the train and test windows always shift the same amount of time, the distance between them, if any, is always the same. The last step, as previously explained, is the one in which the test window considers the last month of the data set. This kind of window is tested with three different configurations, 0 years approach, 2 years approach and 4 years approach, which are represented in panels (a), (b) and (c) of Fig. 6. In the 0 years approach, the distance between the training and test windows is 0 years, that is, the training window is next to the test one. In this case, the training window considers years from 2012 to 2015 in the initial step. In the 2 years approach the distance among windows is 2 years, therefore taking into account that the test window initially contains the first month of year 2016, the training window includes years from 2010 to 2013, so that the desired distance is respected. Similarly, in the 4 years approach, the window starts considering years from 2008 to 2011, because of the same reason. These three different configurations are considered in order to observe how the prediction evolves as the windows move away from each other, and therefore, the concept drift is more noticeable.
For both types of training windows, at each step, a classifier is trained, and the performance is evaluated on a test set with each of the two methods considered (LR and RF). It is relevant to take into account that patients from training and test windows are different. That is, when predicting a particular patient's susceptibility test result, it is ensured that there are not other susceptibility results of the same patient in the training set. Also, in the approaches where training and test windows are next to each other (as in the incremental training window and the 0 years approach), a margin of 48 hours is considered between them, since it is the time required for getting the antibiogram's results.
As the windows traverse the data set, they encounter class imbalance, due to the temporal evolution of bacterial resistance. This causes that, in the time interval considered by test windows, there is a higher number of instances from one class. Because of that, in order to evaluate the prediction of the classifiers, is not enough to consider the global accuracy. To get a realistic approximation of the classifier performance, the success in susceptible instances and the success in resistant instances are also calculated. The names assigned to these figures of merit are Total Accuracy (A T ot ), Resistant Accuracy (A Rst ) and Susceptible Accuracy (A Scb ), respectively. For a test window with Ns susceptible instances and Nr resistant instances, if the method succeeds in predicting Ss susceptible instances and Sr resistant instances, these figures of merit are computed as follows: These three figures of merit are calculated for the test set of the particular approach considered. In order to get the mean value of these measurements, for every step, the values of Ns, Nr, Ss and Sr are accumulated and, at the end, the three figures of merit are obtained. This accumulation is carried out because test windows may have a different amount of instances, due to the fact that not all 1-month time intervals contain the same number of antibiograms. For that reason, an average would not be adequate, since some instances would have more weight than others depending on the number of instances in their test window.
In addition to the experiments using the different windows, a series of experiments are carried out considering different aspects of the prediction. First, it is analyzed the prediction contribution of the most relevant features according to the MI score. In particular, the features studied are date_culture and the two groups of features related to p&a i and r&a i . To assess their contribution, the target is predicted with and without considering these features, and the two outcomes are compared.
Secondly, since the incremental training window considers a high amount of instances (from the beginning of the data set) it is proposed to assign weights to its training instances. The purpose is to give a higher importance to the training instances that are temporarily closer to the test, which theoretically would have a more similar distribution to the test instances, and lower importance to instances far from the test. Equation (8) details how the weight is generated for each instance. (8) where d l represents the date of the last culture in the training window, and d c is the culture date for the instance which weight is being calculated. In the equation, the difference of these two dates is expressed in days. The parameter λ is empirically chosen for each experiment as the one providing the best results among the following: 0, 1e-05, 1e-04, 1e-03, 1e-02, 0.1 and 1. If λ is very small, all instances get a very similar weight, regardless of how far they are from the end of the training window. For instance, for λ = 0, all instances has a weight of 1. On the other hand, if the value of λ is high, only a very few instances very close to the end of the training set get a weight close to 1, and the great majority of instances get a weight very close to 0. Note that when the value of λ is zero, it is the same case as the incremental training window without weights. In the case of high values for λ, it is more similar to the 0 years approach of the sliding training window with a fixed size. So, in the end, these weights allow to regulate the amount of past instances considered for prediction.
To encode the models obtained from different combinations of windowing and features, a number is assigned to each model, with the following description: M1. Sliding training window with a fixed size and following the 0 years approach. It uses neither r&a i nor p&a i features.
M2. Sliding training window with a fixed size and following the 2 years approach. It uses neither r&a i nor p&a i features.
M3. Sliding training window with a fixed size and following the 4 years approach. It uses neither r&a i nor p&a i features.
M4. Sliding training window with a fixed size and following the 0 years approach. It uses r&a i features but not p&a i features.
M5. Sliding training window with a fixed size and following the 2 years approach. It uses r&a i features but not p&a i features.
M6. Sliding training window with a fixed size and following the 4 years approach. It uses r&a i features but not p&a i features.
M7. Sliding training window with a fixed size and following the 0 years approach. It uses both r&a i and p&a i features. M13. Incremental training window with instance weighting. It uses r&a i features but not p&a i features.
M14. Incremental training window with instance weighting. It uses both r&a i and p&a i features. Each of the above kind of models are designed with and without considering the date_culture feature, also with the two aforementioned machine learning methods, LR and RF.
After studying the outcomes of the different experiments, the feature relevance is calculated again, now with an embedded method from the RF model. Also, date_culture and the p&a i set of features is analyzed in more depth by making the predictions with just one of these features at a time.

IV. Results and Discussion
The Results and Discussion section is divided in two different subsections. In the Subsection A, the performance of the predictive methods is assessed by considering different experiments. In the Subsection B, the features identified as the most relevant along the study are further analyzed.

A. Prediction
The prediction results are detailed in Tables IV, V, VI, VII, VIII and

Sliding Training Windows with Temporal Distance Variation Among Training and Test Windows
The figures of merit provided by models considering the temporal distance between the training and test sets are in rows with numbers 1 to 9 in the M column of Tables from IV to IX.
In the case of the LR method when considering the feature date_culture, the evolution of the figures of merit is not consistent among antimicrobial families when analyzing the separation between training and test windows. In some families, the Total Accuracy increases as the training window approaches the test window, while the opposite happens for other families. The same is observed with Resistant Accuracy and Susceptible Accuracy, its behavior varies depending on the antimicrobial family being predicted.
Predicting with RF and using feature date_culture, the evolution of the figures of merit is more similar among the different antimicrobial families. In general, Total Accuracy increases, Resistant Accuracy increases and Susceptible Acurracy decreases as the training window approaches test window. When this pattern is less evident, it may be helpful to analyze when both r&a i and p&a i features are considered. Also, the general performance of the three figures of merit appears to be better when both r&a i and p&a i features are used.
For LR and not using the feature date_culture, the aforementioned pattern appears, in which Total Accuracy increases, Resistant Accuracy increases and Susceptible Accuracy decreases when reducing the distance between windows. Comparing these results with those provided by LR and date_culture, two remarks deserve to be underscored: for the families in which this pattern was not previously evident (such as AMG, CAR and QUI), now windows 4 and 2 years apart have lower Total Accuracy and lower Resistant Accuracy, with similar figures of merit in the 0 years-apart windows; on the other hand, for the families where this pattern was reasonably evident (such as CF4, PAP and POL), the figures of merit usually improve, while maintaining the same pattern. Also using both the r&a i and p&a i features tend to improve the performance.
Considering RF for prediction and not using the feature date_culture, the same behavior as in LR without date_culture, is  observed for all antimicrobial families: note the same pattern for the evolution of the figures of merit (Total Accuracy increases, Resistant Accuracy increases and Susceptible Accuracy decreases as the distance between train and test windows decreases). Comparing these results to previous ones of RF using date_culture, it is noticed that now, for all families, windows of 4 and 2 years apart have lower Total Accuracy and lower Resistant Accuracy, with similar or improved figures of merit in the 0 years-apart windows. Furthermore, using both r&a i and p&a i features tend to provide a better performance. In the considered experiments (from model 1 to model 9), it is also noticeable how results change depending on the antimicrobial family. It is specially remarkable for the CAR and POL families. Considering CAR, it is observed that, for the majority of models, the values of Total Accuracy and Resistant Accuracy are very high, while Susceptible Accuracy values are very low, in most cases zero. On the other hand, for the POL family, Total Accuracy and Susceptible Accuracy are very high and Resistant Accuracy is low in general, with many zero values. These results suggest that the outcomes depend on the class distribution along time, for each antimicrobial family. In Fig. 3 it is noticed that CAR is the family with the highest ratio of resistant instances (almost 1 for the last years of the data set), and POL is the family with the lowest ratio of resistant instances. Although less obvious, the rest of the families also appear to be influenced by their respective class distribution.
Firstly, it is interesting to discuss the common pattern observed in almost all families, which causes Total Accuracy to increase, Resistant Accuracy to increase and Susceptible Accuracy to decrease as the distance between train and test windows gets smaller. The reason of this behavior is the temporal class imbalance, that is, in the first years of the data set, the majority of instances belong to the susceptible class, but as time progresses, the majority of instances become resistant, as it is depicted in Fig. 3. Using sliding training windows with fixed size and the approach with 4 years of distance between windows, the training window has to shift towards the past since the test window starts in 2016 for all experiments, therefore containing years from 2008 to 2011 for the first step of the training window, as explained in Section III.C. Being in the past, it contains a higher number of susceptible instances compared to resistant ones, which causes to perform better in predicting susceptible instances (better Susceptible Accuracy) and worse in predicting resistant instances (worse Resistant Accuracy). The opposite happens when the distance between windows is 0 years. In this case the window is near the last years of the data set, therefore it contains more resistant instances (improving Resistant Accuracy) and less susceptible instances (decreasing Susceptible Accuracy). The Total Accuracy improves when the distance is small because in test window the majority of instances are, mostly, resistant. If the majority class is well predicted, the Total Accuracy is high. We conclude that not all the three figures of merit improve as expected when distance is diminishing, in fact one of them gets worse. Applying oversampling to the minority class in this kind of fixed-size temporal windows, in order to balance the number of the two kind of instances, could improve the  accuracy in the minority class.
Secondly, it is relevant the change in behavior of prediction when date_culture is not considered in both LR and RF methods. Overall, when using date_culture for prediction in the 4 years and 2 years approaches, the Resistant Accuracy increases and the Susceptible Accuracy decreases compared to models not using date_culture. This probably happens because date_culture is compensating the lack of resistant instances of training windows in 4 and 2 years approaches, by telling the classifier the most probable class in test years, which tend to be resistant, and hence Resistant Accuracy is high in most cases, causing Susceptible Accuracy to decrease. The disadvantage of using date_culture is that it causes the minority class to worsen its prediction, since it introduces bias towards classifying instances as the most probable class of the time interval. Since, in the 0 years approach, without considering the date_culture feature, the results are similar or better than when date_culture is taken into account, we conclude that it is convenient not to use this feature.

Incremental Window
The experiments concerning the results of prediction by using an incremental training window are in rows with numbers from 10 to 12 in the M column of Tables from IV to IX.
In the case of using the LR method and including the feature date_culture, adding just features r&a i does not generally improve figures of merit. With the addition of both features r&a i and p&a i , half of the antimicrobial families (AMG, CF4 and PAP) improve their results, although this improvement is mild.
With RF and using the date_culture feature, the inclusion of the r&a i features does not improve performance. Conversely, adding r&a i and p&a i features improves results in 5 out of the 6 families (AMG, CF4, PAP, POL and QUI), with no worsening of the figures of merit of the CAR family.
For both LR and RF models without date_culture, it is noticed that including just the r&a i features does not provide an improvement in performance. However, taking into account both the r&a i and p&a i features, there is a significant improvement for almost all antimicrobial families. Total Accuracy and Resistant Accuracy are, in general, considerably lower when r&a i and p&a i features are not used together, in comparison with the results provided by including date_culture.
Taking into account the results with sliding windows of fixed size of 4 years and the current ones with an incremental training window, it is observed that, in general, the best results are obtained with an incremental training window. Though for some antimicrobial families, a specific combination of sliding windows can outperform the results of the incremental training window, there is not a common approach of sliding windows with better results for all families. Furthermore, when the incremental training window outperforms, it is for very little. The exception is the POL antimicrobial family, which achieves clearly better results with the 0 years approach. With the incremental training window, best results are mostly achieved by not including date_ culture, and adding both the r&a i and p&a i features. This confirms that the use of incremental training window represents a useful temporal approach to tackle the task presented in this study.
It is notable that, although MI suggested that the set of r&a i features contain relevant information to predict the targets, its use in conjunction with other features does not appear to improve performance. On the other hand, the p&a i features show a great potential to predict the result of the susceptibility test, since they improve performance in almost all cases.
It is also worth to analyze the fact that, if date_culture is not used, Total Accuracy and Resistant Accuracy get a low value when the r&a i and p&a i features are not jointly used, in comparison with the results obtained by using date_culture. The reason of this behavior is similar as the one indicated in previous experiments when not using the date_culture feature. Without date_culture, classifiers tend to predict much of the test instances as susceptible, because it is usually the majority class in incremental training windows (windows starting at the beginning of the data set). The date_culture feature compensates this by introducing bias towards predicting the majority class in the time interval, which in test (near the end of the data set) is resistant. In any case, using date_culture worsens the Susceptible Accuracy. By adding the p&a i features, it is not necessary to count with date_culture to get a good performance. Moreover, results with p&a i features and without date_culture, improve both Resistant Accuracy and Susceptible Accuracy because this kind of features do not introduce a temporal bias towards one of the two classes.

Incremental Window with Weights
The prediction results using an incremental training window and instance weighting are in rows with numbers 13 and 14 in the M column of Tables from IV to IX. The λ values for each particular case are expressed in Table X. It is observed that, using instance weighting, results improve for most of the antimicrobial families. The following are the best figures of merit of A T ot -A Rst -A Scb provided by applying instance weighting: • AMG: 79.55%-75.76%-90.91%. Obtained using RF, without date_culture and with both the r&a i and p&a i sets of features. The weight hyperparameter is λ =1e-05. • CF4: 60.67%-58.62%-64.52%. Obtained using RF, without date_ culture and with both the r&a i and p&a i sets of features. The weight hyperparameter is λ =1e-05.
Our results show that M13 and M14 performance, in the majority of families, improves or is maintained when the p&a i set of features is taken into account, confirming what was observed in the two previous groups of experiments. The only exception to that is the POL antimicrobial family. When the date_culture feature is used, just the POL family gets better results; in any other case, it is better to not consider this feature. The substantially different behavior of POL is probably due to the very small number of resistant instances for this family, which makes it very dependent on the date_culture feature. Besides that, for half of the families (CAR, PAP and POL), the best method is LR, while for the other half (AMG, CF4 and QUI), RF gets the best results.
It is also important to analyze the hyperparameter λ used to assign weights to instances. As previously explained, when the value of λ is small, a greater number of instances get a similar high weight (close to 1); otherwise, when λ is high, just a few instances, temporally close to the test set, get a high weight and the rest of instances get very small weights. For AMG, CF4 and PAP, λ is very small and results are very similar to those of the respective incremental window without weights. This happens because almost all instances are being considered. On the other hand, families CAR, POL and QUI, with a greater λ, show results that are, mostly, more similar to the respective sliding training window with a fixed size than to the incremental window.
Comparing the results of the incremental window with the performance for the rest of experiments, it is noticed that it improves the results for 3 of the 6 families, which are AMG, PAP and QUI. In the case of CAR, the whole incremental training window achieves better results than the version with weights. As before, the family CF4 gets better performance with a specific combination of sliding windows, probably because some particularity of its distribution; POL notably gets its best result with the 0 years approach windows, without date_culture and with neither the r&a i nor p&a i sets of features.

B. Relevant Features Analysis
Taking into account previous results, it seems that some features with high MI score, such as r&a i , do not help to predict the target feature. The feature date_culture, which has the highest MI score, increases the performance in some particular cases, but also introduces bias, and the best results in previous experiments are achieved when this feature is not used. On the other hand, the set of features p&a i , also with high MI scores, appears to improve performance in almost all antimicrobial families.
Our analysis reveals the inconsistency between features ranked as relevant according to MI and those that actually increase prediction performance. In order to contrast feature relevance, they are now obtained with an embedded method. Since RF has been used as classifier, tree-based estimators have been selected to compute the new feature importance, with Fig. 7 showing the ranking in relevance. Now, the most relevant feature for AMG, CAR, CF4, PAP and QUI are p&amg, p&car, p&cf4, p&pap and p&qui, respectively. In the case of POL, p&pol is ranked on the 7th position. Regarding date_culture, it is still very important. In the case of POL, date_culture is the most important one. The set of features r&a i are not considered important overall. The new ranking in feature relevance agrees to a greater extent with the prediction performance observed. The set of p&a i features are the most important ones, except for the POL family, where the most relevant feature is date_culture. These results make sense, since date_culture was the only feature improving performance in the POL family, due to small number of resistant instances. Also, the r&a i features get low relevance values, as expected. The reason why this method provides more insightful results is probably because it takes into account all other features in the data set, while in MI the feature relevance is calculated separately for each feature.
To further analyze the impact of the most relevant features, the antibiogram result has been predicted using just one feature. Two experiments have been carried out, each for one of the most important features in the data set (the p&a i features and date_ culture). Results with the respective p&a i features are detailed in  similar and the figures of merit are relatively high for most of the families. This evidences the high prediction power of this kind of features, even when using for prediction just one of them. Table XII presents the results with just date_culture. We observe that the prediction is dramatically biased towards the majority class when the LR method is considered, which in most cases is resistant due to the fact that test instances are in the future with respect to training instances. In the case of the POL antimicrobial family, results are biased towards the susceptible class since it generally is the majority class. Using RF, prediction is also biased, although to a lesser extent. As expected, the only family improving its performance when using just date_culture feature is POL.

V. Conclusions
One important and increasing problem in daily operation of worldwide health systems, and in particular, of hospitals is antimicrobial resistance. This resistance in some microorganisms (bacterium, viruses, etc.) appears when these microorganisms become to be resistant to antimicrobial drugs to which they were susceptible before. This change is due to a mutation of the microorganism or to the acquisition of the resistance gen. This problem is even more difficult in hospital ICUs, due to the critical condition of those patients. Therefore, a reliable and anticipated prediction for a given bacterium of being resistant or not to one or more antimicrobial families in a patient culture would greatly help physicians in their fight against those microorganisms.
In this study, a real anonymized data set with information about patients staying at the ICU in the University Hospital of Fuenlabrada (UHF) has been used. The data set is related to 3812 admissions of 3346 ICU patients, collected at the UHF during a period of 15 consecutive years (from July 2004 to May 2019). The collected data set from UHF was browsed to generate the final data set under study with the information regarding the patients and their different cultures. Originally there were 40 features, but after the application of some pre-processing techniques they were reduced to 37 to avoid the use of high correlated features.
The analysis have been focused on the Pseudomonas Aeruginosa bacteria because is one of the most dangerous bacteria in the ICU and its proved ability to develop multi-drug resistance. Furthermore, six antimicrobial families were considered: Aminoglycosides (AMG), Carbapenems (CAR), 4th Generation Cephalosporins (CF4), Extendedspectrum Penicillins (PAP), Polymyxins (POL) and Quinolones (QUI).
Logistic Regression and Random Forest models were tested. Different temporal modeling strategies were proposed based on different windowing schemes (sliding training window, incremental training window) to capture the concept drift phenomenon related to the resistance process of microorganisms. In addition, some new temporally-oriented features (p&a i and r&a i features) capturing the resistance/susceptibility information regarding past cultures of the same patient or regarding the other patients were proposed and evaluated to improve the prediction accuracy of the different models. A temporal weighting scheme of the instances was proposed and it improved the prediction accuracy. Using or not some important features, according to the MI score, like date_culture, p&a i features and r&a i features were tested in fourteen models (M1 to M14). The results show that the Random Forest method with an incremental win-dow approach, using temporal weighting of the instances and the temporally-oriented features of past cultures is better, especially because both the accuracy for resistant bacteria and susceptible bacteria is more balanced.
Regarding previous studies such as [6], [17] and [18], some similarities and differences are observed with this study. There are many differences between [6] and our work, such as the time interval considered in the data set, the number of instances, the generation of new longitudinal features or the methods used, but the concept drift is observed in both works. It is even more noticeable in our work due to the long time interval considered, with the windowing approach showing great benefits when applied to this problem. Unlike the work in [17], our study applies temporal modelling with windowing, including data from the 21 days previous to the antibiogram result to be predicted. In this line, authors in [18] also consider the date of culture and apply a temporal modelling, but without windowing.
Remarkable contributions of our study are the new generated sets of features that consider temporal data contained along the data set, which regards the previous resistance of bacteria for the patient under study (p&a i ), and the resistance of bacteria previously detected in the ICU (r&a i ). In line with [18], our work also reveals that data from past cultures contain a relatively high amount of information to predict antimicrobial resistance. Particularly, the p&a i set of features showed to be the most useful for correct prediction when used in combination with some other features or even, in the case of some antimicrobial families, when used alone. Another relevant contribution of our study is the incremental training window scheme applied together with instance weighting. It allows to accurately classify cultures when the underlying data distribution dramatically changes along time. Our method introduces a more general and robust solution than those previously proposed, since it can be applied to heterogeneous data sets either with just a few or many years to be predicted, which is able to evolve along time and tackle the scarcity problem. Furthermore, it is able to provide high performance results for the majority of families, similar to the ones in other studies despite not using many of the most important risk factors identified in the literature, such as the antibiotics administered to patients. In addition, the thorough analysis of the relevance and interaction of different features will largely help in the development of future works.
There are different challenges to be addressed for future work.
On the one hand, oversampling techniques on training can be tested to check their influence on the model performance. On the other hand, we also consider including other features that could have some influence on the appearance of resistance bacteria in the ICU, like some additional patients' details about their admission, whether they required intubation or not and whether they needed mechanical ventilation or not. It would also be interesting to consider the inclusion of features encoding the antibiotic usage in a temporal context, at a patient level and ICU level. In order to properly tackle the different resistant phenotypes observed in this study, the non-uniform distribution of genotypic resistance mechanisms could be considered. It is also relevant to analyze in a different manner (such as assigning particular weights) cultures isolated from some specific sites such as tracheostomy or environmental water sources, because of their ability to generate aerosols close to patients, increasing the probability of nosocomial bacterial transmission.