TEE4EHR: Transformer Event Encoder for Better Representation Learning in Electronic Health Records

Irregular sampling of time series in electronic health records (EHRs) is one of the main challenges for developing machine learning models. Additionally, the pattern of missing data in certain clinical variables is not at random but depends on the decisions of clinicians and the state of the patient. Point process is a mathematical framework for analyzing event sequence data that is consistent with irregular sampling patterns. Our model, TEE4EHR, is a transformer event encoder (TEE) with point process loss that encodes the pattern of laboratory tests in EHRs. The utility of our TEE has been investigated in a variety of benchmark event sequence datasets. Additionally, we conduct experiments on two real-world EHR databases to provide a more comprehensive evaluation of our model. Firstly, in a self-supervised learning approach, the TEE is jointly learned with an existing attention-based deep neural network which gives superior performance in negative log-likelihood and future event prediction. Besides, we propose an algorithm for aggregating attention weights that can reveal the interaction between the events. Secondly, we transfer and freeze the learned TEE to the downstream task for the outcome prediction, where it outperforms state-of-the-art models for handling irregularly sampled time series. Furthermore, our results demonstrate that our approach can improve representation learning in EHRs and can be useful for clinical prediction tasks.


Introduction
Machine learning has the potential to revolutionize healthcare by leveraging the vast amounts of data available in electronic health records (EHRs) to develop more accurate clinical decision support systems [1,2].EHRs store patient health information, such as medical history, medications, lab results, and diagnostic images, which can be used as input for machine learning algorithms to identify patterns and associations that could inform more precise diagnoses [3,4], better treatment plans [5] , and earlier interventions [6].Clinical decision support systems that use machine learning can provide evidence-based real-time recommendations to healthcare providers, reducing errors, and improving patient outcomes [7].
Irregular sampling is one of the data challenges for machine learning (ML) when using electronic health records (EHRs).EHR data is often collected at different times and frequencies, depending on a patient's healthcare needs and visit schedules, which can result in uneven and irregularly sampled time series data.From a data perspective, asynchronous and incomplete observation of certain clinical variables is regarded as missingness in the data.However, the sources of missing data in EHRs must be carefully understood.For instance, lab measurements are usually ordered as part of a routine care or diagnostic workup, and the presence or absence of a data point conveys information about the patient's state [8].As a result, in most cases, the missingness is not at random (MNAR) and must be handled with care.In this paper, we refer to this type of missingness as informative missingness.
The most straightforward solution to handle missing data in EHR time series is to use imputation techniques.Imputation refers to the process of filling in the missing data with plausible values based on the available information.One simple approach is to impute with mean, mode (most frequent), median, or more sophisticated machine learning-based imputation methods such as MissForest [9], MICE [10], and kNN imputer [11] have been developed that can better capture complex relationships in the data.Regardless of the technique used, it is important to carefully consider the impact of the imputed values on the analysis and report the imputation method used in the results.Besides, the amount of missing data in EHRs is often large, and imputation can be computationally expensive.In addition, the occurrence or non-occurrence of a measurement and how often it is observed can convey its own information and imputation may lead to an undesired distribution shift [12].Therefore, filling in the missing values may not always be preferable [13].
Recently, new machine learning models have been developed to handle irregularly sampled time series without imputation.Gaussian Process (GP) can handle missing data effectively by providing a coherent framework for imputing missing values while capturing the uncertainty associated with the imputations [14].Deep learning models have also been employed to handle irregularly sampled time series.Recurrent Neural Networks (RNNs) are well-suited for this task because they can process sequential data, taking into account not only the current input but also the previous inputs.Convolutional Neural Networks (CNNs) can also extract features from the data and then pass the processed data to an RNN for further analysis [15,16].Additionally, Attention weights can be integrated into RNNs and CNNs to help the model focus on important features in the data [17,18].These deep learning models offer a powerful toolset for handling irregularly sampled time series and providing useful insights into this type of data.However, it is important to note that these models do not explicitly account for the fact that the missing data can convey unique information.
Point process is a mathematical framework for describing the distribution of events in time or space.These events can be user activities on social media [19], financial transactions [20], gene positions in bioinformatics [21], or even the pattern of laboratory tests in EHRs that can be regarded as a sequence of events ordered by clinicians [22].At the core of the point process are the definition of conditional intensity functions (CIFs) and the corresponding log-likelihood, which simultaneously models the occurrence of events using the history of past events.More recently, Neural Point Processes (NPPs) have been developed to better characterize CIFs by leveraging the power of deep neural networks [23].These models can be used for tasks such as predicting future events, estimating the rate of event occurrence, or identifying correlations between events [24,25,26].They offer a flexible and powerful way to analyze event sequence data, as they can handle complex dependencies between events and incorporate prior knowledge about the process.

Aim of Study
Our primary objective is to enhance the capability of deep learning models for irregularly sampled time series by leveraging the capabilities of neural point processes.We propose a new framework, TEE4EHR, which is a Transformer Event Encoder (TEE) with a Deep Attention Module (DAM) for irregularly sampled time series in electronic health records (EHRs).Our TEE is inspired by attention-based neural point processes [25,26] which regards the pattern of irregularly sampled time series as a sequence of events and can be combined with any existing deep learning model for irregularly sampled time series.The code is available at https://github.com/hojjatkarami/TEE4EHR.

Contributions of this work
Our main contributions can be summarized as follows: • A new transformer event encoder (TEE) for learning CIFs and future event prediction that can be trained with different point-process loss functions.
• Our TEE can improve the performance metrics on benchmark datasets in neural point process literature.
• We present a new framework, TEE4EHR, for learning TEE jointly with an existing deep learning model that is compatible with irregularly sampled time series.
• We show the utility of the proposed approach on two real-world datasets, as evidenced by its superior performance in clinical outcome prediction as well as better representation learning.
The current work is an extension of the previous work presented at the 2023 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI '23).It provides a comprehensive analysis of the proposed transformer event encoder on benchmark datasets and a more in-depth explanation of the results.
2 Background & Related Works

Problem Formulation
An irregularly sampled data can be denoted as D = {U i } N i=1 where N is the number of samples.Each sample is represented as a sequence of tuples U i = {(t p , k p , v p )} Pi p=1 where P i is the total number of observations and t p , k p , v p represents the time, name, and value of p-th observation, respectively.
An event sequence data can be represented as D = {S i } N i=1 where N is the total number of samples.Each sample S i is represented as a sequence of events S i = {(t j , e j )} Li j=1 , where L i is the total number of occurred events, t j is the event's timestamp, and e j ∈ R M is the binary representation of event marks (one-hot for multi-class or multi-hot for multi-label).Furthermore, the history of events at time t is denoted as H t = {(t j , e j ) : t j < t}.

We can represent an EHR dataset as
, where N is the number of patients, U i is irregularly sampled time series data, and S i is the patient's event data such as visits and laboratory tests.Static variables such as demographics and patient's outcome, if available, are denoted by d i and y i , respectively.In this work, we consider sampling pattern of time series as events as they are informative.

Point Process Framework
The core idea of the point process framework is the definition of conditional intensity functions (CIFs) which is the probability of the occurrence of an event of type m in an infinitesimal time window [t, t + dt): Here, * denotes conditioning on the history of events (H t ).The Multivariate Hawkes process is the traditional approach to characterize CIFs by assuming a fixed form of intensity to account for the additive influence of an event in the past: where µ ≥ 0 (base intensity) is an exogenous component that is independent of history, while ϕ(t) > 0 (excitation function) is an endogenous component depending on the history that shows the mutual influences [25].The excitation function can be characterized using different approaches such as exponentials [27], or a linear combination of M basis functions [28].

Parameter Estimation
Based on the conditional intensity function equation (1), it is straightforward to derive conditional probability density function p * m (t) in the interval (t j , t j+1 ]: The parameters of the point process can be learned using Maximum Likelihood Estimation (MLE) framework.However, more advanced methods such as adversarial learning [29], and reinforcement learning have also been proposed [30,31].
In the multi-class setting (MC) where only one event can occur at a time, the log-likelihood (LL) of the point process for a single event sequence S i is defined as: Here, 1() is the indicator function.The log-likelihood can be explained with the help of two pieces of information: However, in many cases, such as in EHRs, it is common to have co-occurring events (multi-label setting (ML)).To handle this issue, [32] proposed using a binary cross-entropy function: Another approach is the marked case, which assumes that the marks and timestamps are conditionally independent given the history of events (H t ): 1(e j = m) log p * (e j = m) This marked case is essentially an AutoEncoder (AE) for the next mark prediction, with a single-dimensional point process for timestamps only.
It is important to note that the main advantage of point processes lies in their ability to model non-event likelihoods in the form of integrals.If we neglect the integrals in equations (2, 2.2, 2.2), we would end up with the cross-entropy and binary cross-entropy loss in the multi-class and multi-label settings, respectively, for predicting the next mark given the history of events.

Neural Point Process
Encoder-decoder architectures have proven to be effective in many time series applications.The main idea of a neural point process (NPP) is to first encode the history of events until t j using a neural network architecture as h j = Enc(H j+1 ; θ), and later use h j to parameterize the CIFs using a different decoder architecture λ * m (t) = Dec(h j ; ϕ) for t ∈ (t j , t j+1 ] [23]. Initial works have utilized recurrent encoders such as RNN [33], or LSTM [34], where the hidden state is updated after the arrival of a new event as h j+1 = U pdate(h j , (t j , e j )).The main advantage of these models is their ability to compute history embeddings in O(L) time.However, they are susceptible to ignoring long-term inter-event dependencies.In contrast, set aggregation encoders encode all past events directly into a history embedding.One approach to capture long-term dependencies is to use self-attention [26,25], which can be trained in parallel and is more computationally efficient.
We focus on neural point process models based on attention mechanisms that can potentially address the problems of slow serial computing and loss of long-term information.In addition, attention weights bring interpretability and can show peer influences between events.Self-attentive Hawkes Process (SAHP) [25] proposes a multi-head attention network as the history encoder.In addition, they use a sofplus function that can capture both excitation and inhibition effects.Similarly, Transformer Hawkes Process (THP) [26] adopts the transformer architecture [35] by using time encodings instead of positional encodings in the original architecture of transformers [35].However, their model only captures mutual excitations between events.In another interesting study [32], researchers studied different combinations of encoders (self-attention and GRU) and decoders on various datasets for comparison.They demonstrated that attention-based NPPs appear to transmit pertinent EHR information and perform favorably compared to existing models.One gap in the current literature is that they do not have any means for encoding additional information that can be useful for the characterization of CIFs.For example, in EHRs, there are other sources of information such time series values that could be useful for the characterization of CIFs.

Deep learning for irregularly sampled data
Recurrent neural networks have been modified to consider irregularly sampled time series.For example, GRU-D [24] adapts GRU to consider missingness patterns in the form of feature masks and time intervals to achieve better prediction results.RETAIN [36] is based on a two-level neural attention model that is specifically designed for EHRs data.SeFT [37] is based on recent advances in differentiable set function learning [38], which is extremely parallelizable with a beneficial memory footprint, thus scales well to large datasets of long time series and online monitoring scenarios.Their use of aggregation function is similar to transformers that compute the embeddings of set elements independently, leading to lower runtime and memory complexity of O(P ).Although these models are nearly end-to-end and eliminate the need for an imputation pipeline, it is still unclear how much they are affected by the missingness pattern in EHRs.Additionally, they do not explicitly model the missingness pattern in the data.

Proposed Model
The proposed model, TEE4EHR, consists of two modules: a Transformer Event Encoder (TEE) for handling event data (S i ) and a Deep Attention Module (DAM) for handling irregularly sampled time series (U i ) [37].The schematic of the model is depicted in Fig. 1.The irregularly-sampled time series of an example patient is depicted as an image where the x-axis represents time, the y-axis represents the variable and the color indicates the value.Both TEE and DAM modules encode the data from the beginning until t j into h j and y j embeddings respectively.The concatenated vector [h j , y j ] can be used to parameterize the conditional intensity functions in the next interval or any other downstream tasks.

Architecture
We use a transformer event encoder (TEE) similar to THP [26] with a few modifications.In the first step, we embed all event marks as E emb = E × W emb , where E ∈ R L×M is the binary encoding matrix of all event marks (multi-label or multi-class), and W emb ∈ R M ×d emb is the trainable embedding matrix.In addition, we encode the vector of timestamps t = [t 1 , t 2 , ..., t L ] to E time = [T E(t 1 ), T E(t 2 ), ..., T E(t L )] ∈ R L×dtime using the following transformation formula: Here, T represents the maximum time scale and d time is the time embedding dimension.This transformation closely resembles positional encodings in transformers [35], where the index is replaced by the timestamp.
Unlike THP and the original positional encoding [35], which assume d emb = d time and add the time encoding to the event embedding, we propose to concatenate these two vectors before providing them as input to the transformer block (our first modification): Finally, we use the standard transformer architecture with multiple layers and attention heads to encode the embedded events matrix X into the encoded matrix H = (h 1 , ..., h j , ..., h L ).
Causal masking is essential in our transformer architecture to prevent information leakage from the future to the past.Specifically, the vector h j should contain all the available information up to the j-th event, which will later be used to parameterize the CIFs within the interval (t j , t j+1 ].By default, the masking matrix M 0 must be an upper triangular matrix where the elements above the diagonal are all one (one indicates the elements to be masked).Our second modification is to consider an additional masking parameter w.For instance, if we perform a left column shift of w on M 0 , we can obtain M w .This results in h j that contains the information of the first (j − w) events.This type of masking could prevent overfitting and will be investigated in more detail in our experimental part.

Attention Aggregation
We can use the attention matrix of the transformer event encoder of each sample for model interpretability.In addition, we can aggregate attention matrices of a group of samples to extract an influence matrix that reveals the interaction between events.Consider the attention matrix of i-th sample A i Li×Li where the sum of the elements in each row equals one.we multiply each row by the number of unmasked events to compensate for different event counts.We define the event frequency matrix Similarly, the event interaction matrix I p specifies the significant attention values such that Finally, we can aggregate these matrices for N samples: Here, C agg mn can be interpreted as the average number of times that the event n occurs before the event m.Similarly, I agg mn reveals the fraction of C agg mn in which event n plays a significant role in the prediction of event m.

Deep Attention Module
We use a deep attention module (DAM) [37] for encoding all additional information including irregularly sampled time series.Each observation is represented by u p = (T E(t p ), k p , v p ) ∈ d s where T E(t p ) is the same transformation for time encoding in Equation (3).
We define U p to be the set of the first p observations.The goal is to calculate the attention weight a(U p , u k ), k ≤ p that is the relevance of k-th observation u k to U p .This is achieved by computing an embedding of the set elements using smaller set functions f ′ , and projecting the concatenation of the set representation and the individual set element into a d-dimensional space: Here, we compute the mean of the first p observations after passing the first p observations through a multilayer perceptron (MLP) neural network (h ′ (u k ; θ ′ )).Finally, a second transformation g ′ is performed to obtain embeddings Then we can compute the key values (K p ) using the key matrix W k ∈ R (d g ′ +ds)×d prod : Using a query vector w q ∈ R d prod , we compute the desired attnetion weight: Finally, we compute a weighted aggregation of set elements using attention weights similar to equation ( 7): We consider y ′ p ∈ R dg as the representation of the first p observations.The matrix Y ′ = [y ′ 1 , y ′ 2 , ..., y ′ P ] ∈ R P ×dg keeps all representations of the data.To use state information (Y ′ ) and event encodings (X) for CIFs characterization, we down-sample Y ′ to Y = [y 1 , y 2 , ..., y L ] where y j = y ′ p * , p * = argmax tp≤tj p.
Without loss of generality, we can also consider multiple heads by adding another dimension to keys and queries to create a more complex model.We also embed static variables d i using a separate MLP module and concatenate it with y p .

Event Decoder
Once we obtain a representation of a sample using embedded events (h j ) and states(y j ), we can parameterize conditional intensity functions (CIFs) of the events.
In neural point process literature, many approaches have been proposed to decode either conditional or cumulative intensity functions.We will use a decoder similar to [25] as it can model both exciting and inhibiting effects for CIFs: The function gelu represents the Gaussian Error Linear Unit for nonlinear activations which has been empirically proved to be superior to other activation functions for self-attention [39].Finally, we can express the intensity function as follows: for t ∈ (t j , t j+1 ], where the softplus is used to constrain the intensity function to be positive.Here, η m,j is the initial intensity at t j , µ m,j is the baseline intensity when t → ∞, and γ m,j controls the decay rate.

Experiments
We conduct various experiments to empirically demonstrate the effectiveness of each component in our model.First, we investigate the performance of our proposed TEE module compared to baseline models in the neural point process literature on four common event sequence datasets.Second, we show the advantage of our main model, TEE4EHR, on two real-world healthcare datasets for handling irregularly sampled time series.

Datasets
The overal characteristics of the datasets are summarized in Table 1.The event sequence datasets are as follows: ReTweets (RT) includes sequences of tweets, with each sequence consisting of an original tweet and subsequent follow-up tweets.For each tweet, the time of posting and the user tag is recorded.Additionally, users are classified into three groups ('small', 'medium', and 'large') based on their number of followers.We use two versions of this dataset for multi-class (RT-MC) [25] and multi-label (RT-ML) [32] scenario.
Stack Overflow (SO) is an online platform where people can ask questions and get answers.To encourage active participation, users are awarded badges, and it is possible for a user to receive the same badge multiple times.data is collected over the span of two years and each user's badge history is considered as a sequential sequence.Each event in the sequence represents the receipt of a specific medal.We used the same processed dataset in the literature [26].
Synthea(SYN) is a synthetic patient-level EHR that is generated using human expert-curated Markov processes [40].
Here, we reused the already processed version of this data by [32].
We have also used two real-world ICU datasets from PhysioNet [41]: Physionet 2012 Mortality Prediction Challenge (P12) contains 12,000 ICU stays each of which lasts at least 48 hours [42].At admission, general descriptors like gender or age are recorded for each stay.Throughout the stay, up to 37 time-series variables, such as blood pressure, Lactate, and respiration rate, are measured depending on the patient's condition.These variables may be measured at regular intervals, like hourly or daily, or only when necessary.The task is to predict in-hospital moratility and the prevalence is 14.2%.
The data format is similar to P12 and the task is to predict whether a patient will develop sepsis within 6 hours.The prevalence of patients with sepsis shock is 4.2%.

Preliminary evaluation of TEE
In the first series of experiments on event sequence datasets, we investigate TEE module with three different loss functions: • PP(multi-class or multi-label) uses multi-class or multi-label point process loss (equations ( 2) and (2.2) respectively).• PP(marked) is a marked point process (2.2) where we assume marks and time stamps are independent.
• AE is a simple auto-encoder for predicting future events from event embeddings without CIF characterization.
The loss function is cross-entropy or binary cross-entropy for multi-class and multi-label datasets respectively.We use this loss function as an ablation study to fully understand the utility of integral terms in the point process loss functions.
Throughout the experiments, time concatenation and a masking parameter of w = 1 are found to be more effective than other values which are used as the default setting for all of our experiments.However, we demonstrate the performance gain with different time-encoding strategies (concatenation and summation) and different values for the masking parameter (w ∈ [0, 1, 2, 3]) to see the independent effect of each component.

TEE for EHRs
After evaluating our TEE module using event sequence datasets, we aim to demonstrate its effectiveness in two real-world healthcare datasets, P12 and P19.Initially, we utilize both TEE and DAM to model the patterns of laboratory tests, which serve as our events, by characterizing their conditional intensity functions.Notably, this approach operates as a form of self-supervised learning, as we do not explicitly rely on any labels.In the subsequent phase, we employ the transferred TEE module from the initial step alongside DAM for outcome prediction, transitioning into a supervised learning framework.Our objective is to assess the potential utility of the TEE module, which encodes the patterns of laboratory tests, for accurate outcome prediction.

Self-supervised Learning
In this step, we train TEE together with DAM using different loss functions (equations 2-2.2).Here, we have two main differences compared to the neural point process framework.First, we use DAM separately for embedding irregularly sampled time series.Second, we use the embeddings of both TEE and DAM to parameterize CIFs.To assess the effectiveness of jointly learning DAM, we compare it with the case where we train TEE alone.
Here, we regard the occurrence of certain laboratory variables (or patterns of irregularly sampled laboratory tests) as events that are part of routine care (for more details, refer to Appendix A).
To assess the quality of learned representations during the self-supervised learning step, we use a simple multilayer perceptron layer for predicting the outcome from embeddings.This layer is detached from the main network and does not affect the training of TEE and DAM.

Supervised Learning
In the last series of experiments, we investigate the utility of learned TEE in the self-supervised learning step in a downstream task for predicting sepsis shock and hospital mortality in P19 and P12 respectively.In particular, we transfer and freeze the learned TEE so that it wouldn't be affected by the supervised loss function.

Baselines
For the preliminary experiments, we compare TEE with the following baselines: SAHP [25], which was the first model to utilize attention weights; THP [26], which was the first work to introduce transformers to event sequence data; Latent graphs [44], which is based on a probabilistic graph generator; and GRU-CP [32] which uses a GRU decoder with conditional Poisson (CP) decoder for CIF characterization.

Metrics
We report log-likelihood normalized by the number of events (LL/#events) as a goodness of fit for CIFs characterization [25,26].However, we do not compare this metric in the preliminary evaluation of TEE as it is problematic to compare across different loss functions (equations 2-2.2) and models with different event decoders.
For future event type prediction, we report the weighted measure of F1-score and area under the receiver operating characteristic curve (AUROC) in the multi-class and multi-label setting, respectively.In the supervised learning task for binary prediction, we report F1-score and area under the precision-recall curve (AUPRC), and AUROC.
We use t-Distributed Stochastic Neighbor Embedding (t-SNE) to show learned representations in the downstream tasks, which is a machine learning algorithm used to visualize high-dimensional data in a lower-dimensional space [47].
We also introduce a similarity metric between the pattern of laboratory tests of a patient and its 10 nearest neighbors in the embedding space (learned representations).To compute this metric, we first calculate the measurement density of each laboratory variable during the patient stay: 1(e j = m).
Here, d i m represents the average frequency of the m-th laboratory variable in one hour for the i-th patient.Then, for each patient i, we calculate the cosine similarity between d i and each of its 10 closest neighbors in the embedding space and determine its average (CS i avg ).The average of CS i avg for all positive patients is referred to as the 10 nearest neighbor pattern similarity (10nn-ps) and is reported for each dataset.

Preliminary evaluation of TEE
Table 2 shows the performance of our proposed TEE with different loss functions compared to various baselines.We can see that TEE with AE loss achieves an F1-score of 0.63 and 0.38 for RT-MC and SO respectively, and an AUROC of 0.90 for the SYN dataset.TEE with PP(ML) loss function achieves an AUROC of 0.74 for RT-ML.All of the achieved scores show improvements compared to the baselines.
The Stack Overflow and Retweets datasets have been widely used in the literature on neural point processes to demonstrate the advantages of modeling non-event likelihoods using the point process loss function.However, we show for the first time that a simple autoencoder model can perform even better.RT-ML is the only dataset where the PP(ML) loss function outperforms the AE loss.Upon examining this dataset, we observe that the occurrence of the same event in consecutive sequences (self-exciting behavior) is more frequent, which may explain why the point process loss function achieves better results than the AE loss.As we mentioned previously, the point process loss is the sum of the AE loss and an integral term that models non-event likelihoods.Therefore, for future studies on neural point processes, we recommend researchers compare their proposed models against ablated versions where the integral term is omitted.
Table 3 presents the effect of different time-encoding strategies and masking parameters on the performance of TEE with different loss functions.In general, time concatenation and a masking parameter of w = 1 yield the best results.
While adding time encodings to event embeddings has been the common practice in the point process literature [25,26], we demonstrate for the first time that using time concatenation can lead to improved results in terms of log-likelihood and prediction of the next event type.In natural language processing tasks, summation is typically preferred over concatenation due to its lower memory requirements, fewer parameters, and reduced runtime.With concatenation, the model already has access to the positional encodings, whereas, with the summation, the model must disentangle the positional information within the hidden layers [48].We speculate that the datasets used in our study may not be as comprehensive as NLP datasets, hence why concatenation might potentially yield better performance.We recommend researchers experimenting with the time concatenation strategy on their datasets to determine if it indeed enhances the results.
By using additional masking in our TEE, we effectively regularize the model by limiting its access to future information during training.This constraint can help prevent the model from memorizing specific patterns or sequences in the training data that may not generalize well to future events.Consequently, it encourages the model to learn more robust and generalized representations.This is very similar to time series forecasting, where additional masking has been shown to improve the performance of the model [49,50].We recommend that future research consider the masking parameter as a hyperparameter to be optimized.

Future pattern prediction
The results for the prediction of future laboratory tests are reported in Table 4 for different loss functions as well as different architectures (TEE and TEE+DAM).The full model (TEE+DAM) with PP(signle) and AE loss achieves the best AUROC in P12 and P19 respectively.
In all cases, adding DAM results in higher AUROC and LL/#events for CIF characterization.In the hospital setting, it is plausible that the absolute value of patient states could be advantageous for predicting the order of future laboratory events.To the best of our knowledge, this is the first work that investigates the effectiveness of a sequential neural network (DAM in our work) for characterizing CIFs as well as future event prediction.Although we have evaluated our model using a healthcare database, we believe that a further detailed assessment could be performed on other event sequence data that includes additional information.
We have also reported the AUPRC of the detached label prediction layer from the learned representation.We can see that the TEE+DAM model with PP(single) loss has the highest AUPRC.This indicates that although our model is not trained for the labels, the learned representations from the self-supervised learning approach have predictive value for the patient outcome in the ICU.

Model interpretability
Fig. 2 shows the results for our attention aggregation algorithm in P12 dataset.Fig. 2-a illustrates the t-SNE plot for the learned representations of all patients in the self-supervised learning task, where each data point is colored by its true label (red points are positive patients with mortality).Two subgroups of patients are selected for the aggregation analysis.The first subgroup (G1) contains more positive patients (22%) compared to the second subgroup (G2) with 7%.This indicates that the self-supervised learning of sampling patterns can distinguish different classes in our dataset.
Furthermore, We select the 10 most frequent patterns of laboratory tests as events (Fig. 2-b).For example, the second most frequent pattern (P2) consists of (PaCO2, PaO2, pH) laboratory tests.We show the matrices of aggregated event frequency (C agg ) and aggregated event interaction (I agg ) for G1 and G2.In G1, C agg is dominated by FiO2-FiO2 which is expected since the patients in this group are more likely to be intubated and have longer stay in the ICU.In G2, however, C agg reveals other event frequencies with lower magnitude compared to G1.I agg reveals totally different insights.In G1, patterns with more measurements (P4-P8-P9-P10) influence the other patterns with fewer measurements.In G2, however, the influences are smaller and more scattered.In this work, we considered patterns of laboratory tests as events, and explaining the event interaction with our knowledge might not be intuitive.However, our TEE can encode other medical events such as medications, procedures, and diagnoses.In this case, the aggregated event interaction matrix can reveal more interesting insights from a clinical perspective.

Model performance
The results of outcome prediction (mortality and sepsis shock for P12 and P19 respectively) are reported in Table 5. TEE module indicated inside brackets is the transferred module from the self-supervised learning task.For P12 dataset, one of our model's variations consistently performs better than baselines in all metrics.Within different variations of our proposed models, TEE+DAM with AE and PP(single) losses that are trained from scratch have better performance than the models with transferred TEE from the self-supervised learning step as well as DAM-only model.In P19, the RaindDrop model performs the best in AUPRC only while two variations of our model (TEE+DAM with AE and [TEE w PP(Single)]+DAM) perform better in AUROC and F1-score.
In our experiments, the three transferred modules do not consistently improve over the non-transferred models.
One possible reason could be the small size of the data in the pretraining task, especially in the P12 dataset with approximately 12,000 samples.In P19, however, the size of the data is larger (around 38,000 samples) and we observe that the transferred modules enhance performance in AUPRC and F1-Score compared to models trained from scratch.One advantage of transfer learning in this application is that we ensure that the TEE module is not biased by the labels, which could be problematic in healthcare applications [8,51].

Learned Representions
Despite minor improvements in outcome prediction, we contend that the utilization of TEE leads to more effective representation learning in our EHRs.Table 6 shows the 10nn-ps metric for different variations of our models in P12 and P19 datasets.We can see that adding TEE module with AE or PP(single) losses can significantly increase our similarity metric compared to the DAM-only model.
Fig. 3-a illustrates an example patient in P19 dataset.We have also visualized the four nearest neighbors in the embedding space learned by TEE+DAM and DAM models in 3-b and 3-c respectively.As can be seen, the four nearest neighbors in TEE+DAM have a much more similar pattern with respect to the example patient (some of the similar patterns are highlighted in yellow).

Conclusions
In this study, we introduce TEE4EHR, a new transformer event encoder for handling the pattern of irregularly sampled time series in EHRs based on point process theory.TEE achieves state-of-the-art performance in common event sequence datasets in neural point process literature.Furthermore, by combining TEE with an existing deep attention module, we could improve the performance metrics for outcome prediction in two real-world healthcare datasets.TEE can also achieve better patient embeddings by leveraging pattern laboratory tests.This means that the learned representations are better suited to keeping patients with similar patterns closer in the embedding space.More powerful representations of electronic health records (EHRs) could have various applications in healthcare, such as synthetic data generation, and patient phenotyping.
We also highlighted various drawbacks of current neural point process architectures and provided some guidelines for future research in this area.In the future, we plan to validate our TEE on other event sequence datasets.Another avenue of investigation involves substituting DAM with alternative architectures to explore potential performance enhancements when combined with TEE.

Figure 1 :
Figure 1: Schematic representation of TEE4EHR model combining a transformer-based event encoder (TEE) with a deep attention module (DAM) that can handle irregularly sampled time series.

Figure 2 :
Figure 2: Attention aggregation in P12 dataset.(a) t-SNE plot of patients with the aggregated frequency matrix and aggregated interaction matrix for two subgroups of patients.(b) Top ten frequent measurement patterns for laboratory tests.

Figure 3 :
Figure 3: Pattern of similar patients in the embedding space.(a) An example patient in P19 dataset.(b) The four nearest neighbors in the embedding space that are learned by TEE+DAM (b), and DAM (c).Similar patterns are highlighted in yellow.
dt represents the likelihood of observing an event of type m in the interval [t j , t j + dt) conditioned on past events H tj .Secondly, we can calculate the likelihood of not witnessing any events of type m in the rest of the interval [t 1 , t L ] by computing exp −

Table 2 :
Performance of TEE on common event sequence datasets.For each metric, mean(std) across splits are reported.The best results are highlighted in bold.NA: not applicable.NR: not reported in the original work.

Table 3 :
Performance of TEE on common event sequence datasets.For each metric, mean(std) across splits are reported.The best results are highlighted in bold.

Table 4 :
Self-supervised learning.For each metric, mean(std) across splits are reported.The best results are highlighted in bold.NA: not applicable.

Table 5 :
Supervised Learning results.For each metric, mean(std) across splits are reported.The best and second-best results are highlighted in bold and underlined respectively.[MODEL] indicates the transferred module from the self-supervised learning task.

Table 6 :
10-nearest Neighbor pattern similarity (10nn-ps) for different variations of our model.mean(std) across splits are reported.The best and second-best results are highlighted in bold and underlined respectively.