Time series cluster kernels to exploit informative missingness and incomplete label information

The time series cluster kernel (TCK) provides a powerful tool for analysing multivariate time series subject to missing data. TCK is designed using an ensemble learning approach in which Bayesian mixture models form the base models. Because of the Bayesian approach, TCK can naturally deal with missing values without resorting to imputation and the ensemble strategy ensures robustness to hyperparameters, making it particularly well suited for unsupervised learning. However, TCK assumes missing at random and that the underlying missingness mechanism is ignorable, i.e. uninformative, an assumption that does not hold in many real-world applications, such as e.g. medicine. To overcome this limitation, we present a kernel capable of exploiting the potentially rich information in the missing values and patterns, as well as the information from the observed data. In our approach, we create a representation of the missing pattern, which is incorporated into mixed mode mixture models in such a way that the information provided by the missing patterns is effectively exploited. Moreover, we also propose a semi-supervised kernel, capable of taking advantage of incomplete label information to learn more accurate similarities. Experiments on benchmark data, as well as a real-world case study of patients described by longitudinal electronic health record data who potentially suffer from hospital-acquired infections, demonstrate the effectiveness of the proposed methods.


Introduction
Multivariate time series (MTS) frequently occur in a whole range of practical applications such as medicine, biology, and climate studies, to name a few. A challenge that complicates the analysis is that real-world MTS are often subject to large amounts of missing data. Traditionally, missingness mechanisms have been categorized into missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) [1]. The main difference between these mechanisms consists in whether the missingness is ignorable (MCAR and MAR) or non-ignorable (MNAR) [1,2,3]. In e.g. medicine, non-ignorable missingness can occur when the missing patterns R are related to the disease under study Y. In this case, the distribution of the missing patterns for diseased patients is not equal to the corresponding distribution for the control group, i.e. p(R | Y = 1) p(R | Y = 0). Hence, the missingness is informative [4,5,6]. By contrast, uninformative missingness will be referred to as ignorable in the remainder of this paper.
Both ignorable and informative missingness occur in realworld data. An example from medicine of ignorable missingness occurs e.g. if a clinician orders lab tests for a patient and the tests are performed, but because of an error the results are not recorded. On the other hand, informative missingness could occur if it is decided to not perform lab tests because the doctor thinks the patient is in good shape. In the latter case, the missing values and patterns potentially contain rich information about the diseases and clinical outcomes for the patient. Efficient data-driven approaches aiming to extract knowledge, perform predictive modeling, etc., must be capable of capturing this information.
Various methods have been proposed to handle missing data in MTS [7,8,9]. One simple approach is to create a complete dataset by discarding the time series with missing data. However, this gives unbiased predictions only if the missingness mechanism is MCAR. As an alternative, a preprocessing step involving imputation of missing values with some estimated value, such as the mean, is common. Other so-called single imputation methods exploit machine learning based methods such as multilayer perceptrons, self-organizing maps, k-nearest neighbors, recurrent neural networks and regression-based imputation [10,11]. Alternatively, one can impute missing values using various smoothing and interpolation techniques [12,10]. Among these, a prominent example is the last observation carried forward (LOCF) scheme that imputes the last non-missing value for the following missing values. Limitations of imputation methods are that they introduce additional bias and they ignore uncertainty associated with the missing values.
Multiple imputation [13] resolves this problem, to some extent, by estimating the missing values multiple times and thereby creating multiple complete datasets. Thereafter, e.g. a classifier is trained on all datasets and the results are combined to obtain the final predictions. However, despite that multiple imputation and other imputation methods can give satisfying results in some scenarios, these are ad-hoc solutions that lead to a multi-step procedure in which the missing data are handled separately and independently from the rest of the analysis. Moreover, the information about which values are actually missing (the missing patterns) is lost, i.e. imputation methods cannot exploit informative missingness.
Due to the aforementioned limitations, several research efforts have been devoted over the last years to process incomplete time series without relying on imputation [6,14,15,16,17,18,19]. In this regard, powerful kernel methods have been proposed, of which the recently proposed time series cluster kernel (TCK) [20] is a prominent example. The TCK is designed using an ensemble learning approach in which Bayesian mixture models form the base models. An advantage of TCK, compared to imputation methods, is that the missing data are handled automatically and no additional tasks are left to the user. Multiple imputation instead requires a careful selection of the imputation model and other variables are needed to do the imputation [7], which particularly in an unsupervised setting can turn out to be problematic.
A shortcoming of the TCK is that unbiased predictions are only guaranteed for ignorable missingness, i.e. the kernel cannot take advantage of informative missing patterns frequently occurring in medical applications. To overcome this limitation, in this work, we present a novel time series cluster kernel, TCK I M . In our approach, we create a representation of the missing patterns using masking, i.e. we represent the missing patterns using binary indicator time series. By doing so, we obtain MTS consisting of both continuous and discrete attributes. To model these time series, we introduce mixed mode Bayesian mixture models, which can effectively exploit information provided by the missing patterns.
The time series cluster kernels are particularly useful in unsupervised settings. In many practical applications such as e.g. medicine it is not feasible to obtain completely labeled training sets [21], but in some cases it is possible to annotate a few samples with labels, i.e. incomplete label information is available. In order to exploit the incomplete label information, we propose a semi-supervised MTS kernel, ssTCK. In our approach, we incorporate ideas from information theory to measure similarities between distributions. More specifically, we employ the Kullback-Leibler divergence to assign labels to unlabeled data.
Experiments on benchmark MTS datasets and a real-world case study of patients suffering from hospital-acquired infections, described by longitudinal electronic health record data, demonstrate the effectiveness of the proposed TCK I M and ssTCK kernels.
The remainder of this paper is organized as follows. Section 2 presents background on MTS kernels. The two proposed kernels are described in Section 3 and 4, respectively. Exper-iments on synthetic and benchmark datasets are presented in Section 5, whereas the case study is described in Section 6. Section 7 concludes the paper.

Multivariate time series kernels to handle missing data
Kernel methods have been of great importance in machine learning for several decades and have applications in many different fields [22,23,24]. Within the context of time series, a kernel is a similarity measure that also is positive semidefinite [25]. Once defined, such similarities between pairs of time series may be utilized in a wide range of applications, such as classification or clustering, benefiting from the vast body of work in the field of kernel methods. Here we provide an overview of MTS kernels, and describe how they deal with missing data.
The simplest of all kernel functions is the linear kernel, which for two data points represented as vectors, x and y, is given by the inner product x, y , possibly plus a constant c. One can also apply a linear kernel to pairs of MTS once they are unfolded into vectors. However, by doing so the information that they are MTS and there might be inherent dependencies in time and between attributes, is then lost. Nevertheless, in some cases such a kernel can be efficient, especially if the MTS are short [26]. If the MTS contain missing data, the linear kernel requires a preprocessing step involving e.g. imputation.
The most widely used time series similarity measure is dynamic time warping (DTW) [27], where the similarity is quantified as the alignment cost between the MTS. More specifically, in DTW the time dimension of one or both of the time series is warped to achieve a better alignment. Despite the success of DTW in many applications, similarly to many other similarity measures, it is non-metric and therefore cannot non-trivially be used to design a positive semi-definite kernel [28]. Hence, it is not suited for kernel methods in its original formulation. However, because of its popularity there have been attempts to design kernels exploiting the DTW. For example, Cuturi et al. designed a DTW-based kernel using global alignments [29]. An efficient version of the global alignment kernel (GAK) is provided in [30]. The latter has two hyperparameters, namely the kernel bandwidth and the triangular parameter. GAK does not naturally deal with missing data and incomplete datasets, and therefore also requires a preprocessing step involving imputation.
Two MTS kernels that can naturally deal with missing data without having to resort to imputation are the learned pattern similarity (LPS) [31] and TCK. LPS generalizes the wellknown autoregressive modelsto local autopatterns using multiple lag values for autocorrelation. These autopatterns are supposed to capture the local dependency structure in the time series and are learned using a tree-based (random forest) learning strategy. More specifically, a time series is represented as a matrix of segments. Randomness is injected to the learning process by randomly choosing time segment (column in the matrix) and lag p for each tree in the random forest. A bag-of-words type compressed representation is created from the output of the leaf-nodes for each tree. The final time series representation is created by concatenating the representation obtained from the individual trees, which in turn are used to compute the similarity using a histogram intersection kernel [32].
The TCK is based on an ensemble learning approach wherein robustness to hyperparameters is ensured by joining the clustering results of many Gaussian mixture models (GMM) to form the final kernel. Hence, no critical hyperparameters have to be tuned by the user, and the TCK can be learned in an unsupervised manner. To ensure robustness to sparsely sampled data, the GMMs that are the base models in the ensemble, are extended using informative prior distributions such that the missing data is explicitly dealt with. More specifically, the TCK matrix is built by fitting GMMs to the set of MTS for a range of number of mixture components. The idea is that by generating partitions at different resolutions, one can capture both the local and global structure of the data. Moreover, to capture diversity in the data, randomness is injected by for each resolution (number of components) estimating the mixture parameters for a range of random initializations and randomly chosen hyperparameters. In addition, each GMM sees a random subset of attributes and segments in the MTS. The posterior distributions for each mixture component are then used to build the TCK matrix by taking the inner product between all pairs of posterior distributions. Eventually, given an ensemble of GMMs, the TCK is created in an additive way by using the fact that the sum of kernels is also a kernel.
Despite that LPS and TCK kernels share many properties, the way missing data are dealt with is very different. In LPS, the missing data handling abilities of decision trees are exploited. Along with ensemble methods, fuzzy approaches and support vector solutions, decision trees can be categorized as machine learning approaches for handling missing data [10], i.e. the missing data are handled naturally by the machine learning algorithm. One can also argue that the way missing data are dealt with in the TCK belongs to this category, since an ensemble approach is exploited. However, it can also be categorized as a likelihood-based approach since the underlying models in the ensemble are Gaussian mixture models. In the likelihood-based approaches, the full, incomplete dataset is analysed using maximum likelihood (or maximum a posteriori, equivalently), typically in combination with the expectation-maximization (EM) algorithm [7,9]. These approaches assume that the missingness is ignorable.

Time series cluster kernel to exploit informative missingness
In this section, we present the novel time series cluster kernel, TCK I M , which is capable of exploiting informative missingness.
A key component in the time series cluster kernel framework is ensemble learning, in which the basic idea consists in combining a collection of many base models into a composite model. A good such composite model will have statistical, computational and representational advantages such as lower variance, lower sensitivity to local optima and is capable of representing a broader span functions (increased expressiveness), respectively, compared to the individual base models [33]. Key to achieve this is diversity and accuracy [34], i.e. the base models cannot make the same errors on new test data and have to perform better than random guessing. This can be done by integrating multiple outcomes of the same (weak) base model as it is trained under different, often randomly chosen, settings (parameters, initialization, subsampling, etc.) to ensure diversity [35].
In the TCK I M kernel, the base model is a mixed mode Bayesian mixture model. Next, we provide the details of this model.

Notation
The following notation is used. A multivariate time series (MTS) X is defined as a (finite) combination of univariate time series (UTS), X = {x v ∈ R T | v = 1, 2, . . . , V}, where each attribute, x v , is a UTS of length T . The number of UTS, V, is the dimension of X. The length T of the UTS x v is also the length of the MTS X. Hence, a V-dimensional MTS, X, of length T can be represented as a matrix in R V×T . Given a dataset of N MTS, we denote X (n) the n-th MTS. An incompletely observed MTS is described by the pair . Further, we assume that U is generated from a finite mixture density, where G is the number of components, f is the density of the components parametrized by Φ = (φ 1 , . . . , φ G ), and Θ = (θ 1 , . . . , θ g ) are the mixing coefficients, 0 ≤ θ G ≤ 1 and G g=1 θ g = 1. Now, introduce a latent random variable Z, represented as a G-dimensional one-hot vector Z = (Z 1 , . . . , Z G ), whose marginal distribution is given by The unobserved variable Z records the membership of U and therefore Z g = 1 if U belongs to component g and Z g = 0 otherwise.
, and therefore it follows that U = (X, R) consists of two modalities X and R. We now naively assume that where f (X | R, µ g , Σ g ) is a density function given by and f (R | β g ) is a probability mass given by The parameters of each component are gv is the variance of attribute v, and β gvt ∈ [0, 1] are the parameters of the Bernoulli mixture model (5). The idea is that even though the missingness mechanism is ignored in f (X | R, µ g , Σ g ), which is only computed over the observed data, the Bernoulli term f (R | β g ) will capture information from the missing patterns.
The conditional probability of Z given U, can be found using Bayes' theorem, Similarly to [20], we introduce a Bayesian extension and put informative priors over the parameters of the normal distribution, which enforces smoothness over time and that clusters containing few time series, to have parameters similar to the mean and covariance computed over the whole dataset. A kernel-based Gaussian prior is defined for the mean, P(µ gv ) = N µ gv | m v , S v . m v are the empirical means and the prior covariance matrices, S v , are defined as S v = s v K, where s v are empirical standard deviations and K is a kernel matrix, whose elements are K tt = b 0 exp(−a 0 (t − t ) 2 ), t, t = 1, . . . , T. a 0 , b 0 are user-defined hyperparameters. An inverse Gamma distribution prior is put on the standard deviation σ gv , P(σ gv ) ∝ where N 0 is a user-defined hyperparameter.

Forming the kernel
We now explain how the mixed mode mixture model is used to form the TCK I M kernel.
We use the mixed mode Bayesian mixture model as the base model in an ensemble approach. To ensure diversity, we vary the number of components for the base models by sampling from a set of integers I C = {I, . . . , I + C}. For each number of components, we apply Q different random initial conditions and hyperparameters. We let Q = {q = (q 1 , q 2 ) | q 1 = 1, . . . Q, q 2 ∈ I C } be the index set keeping track of initial conditions and hyperparameters (q 1 ), and the number of components (q 2 ). Each base model q is trained on a random subset of MTS {(X (n) , R (n) )} n∈η(q) . Moreover, for each q, we select random subsets of variables V(q) as well as random time segments T (q).
Algorithm 2 Time series cluster kernel. Training phase.
Compute posteriors Π (n) (q) ≡ (π (n) 1 , . . . , π (n) q 2 ) T , by fitting a mixed mode mixture model with q 2 clusters to the dataset and by randomly selecting: ii. a time segment T (q) of length T min ≤ |T (q)| ≤ T max to extract from each X (n) and R (n) , iv. a subset of attributes V(q), with cardinality V min ≤ |V(q)| ≤ V max , to extract from each X (n) and R (n) , vi. a subset of MTS, η(q), with N min ≤ |η(q)| ≤ N, vii. initialization of the mixture parameters Θ(q) and Φ(q).

4:
Update kernel matrix, The inner products of the normalized posterior distributions from each mixture component are then added up to build the TCK I M kernel matrix. Note that, in addition to introducing novel base models to account for informative missingness, we also modify the kernel by normalizing the vectors of posteriors to have unit length in the l 2 -norm. This provides an additional regularization that may increase the generalization capability of the learned model. The details of the method are presented in Algorithm 2. The kernel for MTS not available during training can be evaluated according to Algorithm 3.

Semi-supervised time series cluster kernel
This section presents a semi-supervised MTS kernel, ssTCK, capable of exploiting incomplete label information. In ssTCK, the base mixture models are learned exactly in the same way as in TCK or TCK I M . I.e. if there is no missing data, or the missingness is ignorable, the base models will be the Bayesian GMMs. Conversely, if the missingness is informative, the base models are the mixed mode Bayesian mixture models presented in the previous section. Both approaches will associate each MTS X (n) with a q 2 -dimensional posterior , where π (n) g represents the probability that the MTS belongs to component g and q 2 is the total number of components in the base mixture model.
In ssTCK, label information is incorporated in an intermediate processing step in which the posteriors Π (n) are transformed, before the transformed posteriors are sent into Algorithm 2 or 3. More precisely, the transformation consists in mapping the posterior for the mixture components to a class "posterior" (probability), i.e. we seek to find a function M : . Hence, we want to exploit the incomplete label information to find a transformation that merges the q 2 components of the mixture model into N c clusters, where N c is the number of classes.
The mapping M can be thought of as a (soft) N c -class classifier, and hence there could be many possible ways of learning M. However, choosing a too flexible classifier for this purpose leads to an increased risk of overfitting and could also unnecessarily increase the algorithmic complexity. For these reasons, we restrict ourselves to searching for a linear transformation Since the N c -dimensional outputΠ (n) = M(Π (n) ) should represent a probability distribution, we add the constraint N c i=1 W ji = 1, j = 1, . . . , q 2 .
A natural first step is to first assume that the label information is complete and look at the corresponding supervised kernel. In the following two subsections, we describe our proposed methods for learning the transformation M in supervised and semisupervised settings, respectively.

Supervised time series cluster kernel (sTCK)
Supervised setting. Each base mixture model consists of q 2 components, and we assume that the number of components

Algorithm 4 Supervised posterior transformation
Require: Posteriors {Π (n) } N n=1 from mixture models consisting of q 2 components and labels {y (n) } N n=1 , 1: for i = 1, . . . , q 2 , j = 1, . . . , N c do 2: is greater or equal to the number of classes N c . Further, assume that each MTS X (n) in the training set is associated with a N c -dimensional one-hot vector y (n) , which represents its label. Hence, the labels of the training set can be represented via a matrix Y ∈ {0, 1} N×N c , where N is the number of MTS in the training set.
We approach this problem by considering one component at the time. For a given component g, the task is to associate it with a class. One natural way to do this is to identify all members of component g and then simply count how many times each label occur. To account for class imbalance, one can then divide each count by the number of MTS in the corresponding class. One possible option would then be to assign the component to the class with the largest normalized count. However, by doing so, one is not accounting for uncertainty/disagreement within the component. Hence, a more elegant alternative is to simply use the normalized counts as the weights in the matrix W. Additionally, one has to account for that each MTS can simultaneously belong to several components, i.e. each MTS X (n) has a only soft membership to the component g, determined by the value π (n) g . This can be done using Π (n) as weights in the first step. This procedure is summarized in Algorithm 4.

Semi-supervised time series cluster kernel (ssTCK)
Setting. Assume that the labels {y (n) } L n=1 , L < N, are known and {y (n) } N n=L+1 are unknown. In this setting, if one naively tries to apply Algorithm 4 based on only the labeled part of the dataset, one ends up dividing by 0s. The reason is that some of the components in the mixture model will contain only unlabeled MTS (the soft label analogy is that the probability that any of the labeled MTS belong to that particular component is zero or very close to zero). Hence, we need a way to assign labels to the components that do not contain any labeled MTS.
Note that each component is described by a probability distribution. A natural measure of dissimilarity between probability distributions is the Kullback-Leibler (KL) divergence [38]. Moreover, since the components are described by parametric distributions, the KL divergence has a simple closed-form expression. The KL divergence between two components, i and j,

is the density given in Eq. (4). The KL-divergence can be made symmetric via the transformation
The underlying idea in our semi-supervised framework is to learn the transformation W for the clusters with only unlabeled points by finding the nearest cluster (in the D S KL -sense) that contain labeled points. This leads to Algorithm 5.

Experiments on synthetic and benchmark datasets
The experiments in this paper consists of two parts. The purpose of the first part was to demonstrate within a controlled environment situations where the proposed TCK I M and ssTCK kernels might prove more useful than the TCK. In the second part (Sec. 6), we present a case study from a real-world medical application in which we compared to several baseline methods.
In the first part, we considered synthetic and benchmark datasets. The following experimental setup was considered. We performed kernel principal component analysis (KPCA) using time series cluster kernels and let the dimensionality of the embedding be 10. Thereafter, we trained a kNN-classifier with k = 1 on the embedding and evaluated performance in terms of classification accuracy on an independent test set. We let Q = 30 and I C = {N c , . . . , N c + 20}. An additional hyperparameter h was introduced for ssTCK. We set h to 10 −1 in our experiments. We also standardized each attribute to zero mean and unit standard deviation.
To simulate MNAR and inject informative missing patterns, we let x (n) i (t) have a probability p (n) of being missing, given that We let p (n) = 0.9 if y (n) = 1 and p (n) = 0.8 otherwise. By doing so, the missing ratio was roughly 63% in both classes.
Tab. 1 shows the accuracy on the test data for the different kernels. As expected, the TCK gives the lowest accuracy, 0.826. The ssTCK improves the accuracy considerably (0.854), and the supervised version (sTCK) gives further improvement (0.867). However, as we can see, the effect of explicitly modeling the missingness mechanism in the TCK I M is larger. In this case the accuracy increases from 0.826 to 0.933. The two corresponding embeddings are plotted in Fig. 1(a) and 1(d), respectively. In the TCK embedding, there are many points from different classes that overlap with each other, whereas for the TCK I M the number of overlapping points is much lower.
The ssTCK I M improves the accuracy to 0.967 (from 0.933 for TCK I M and 0.854 for ssTCK). The two embeddings obtained using the semi-supervised methods are shown in Fig. 1(b) and 1(e). The supervised version sTCK I M yields a slight improvement in terms of accuracy compared to ssTCK I M (0.970 vs 0.967). Plots of the supervised embeddings are shown in Fig. 1(c) and 1(f). We can see that for the sTCK I M the classes are clearly separated.

Performance of ssTCK on benchmark datasets
The purpose of the experiments reported in the following paragraph was to evaluate the impact of incorporating incomplete label information in the ssTCK. Towards that end, we considered benchmark datasets and artificially modified the number of labeled MTS in the training sets. We applied the proposed ssTCK to four MTS benchmark datasets from the UCR and UCI databases [39,40] and other published work [41], described in Tab. 2. Since some of the datasets contain MTS of  varying length, we followed the approach of Wang et al. [42] and transformed all the MTS in the same dataset to the same length, T , determined by T = T max Tmax 25 , where T max is the length of the longest MTS in the dataset and is the ceiling operator. The number of labeled MTS was set to max{20, 3 · N c }. ssTCK was compared to ordinary TCK and sTCK (assuming complete label information in the latter case). Tab. 3 shows the performance of ssTCK for the 4 benchmark datasets. As we can see, compared to TCK, the accuracy in general increases using ssTCK. For the Wafer dataset, ssTCK yields the same performance as the supervised kernel. For the three other datasets, the performance of ssTCK is slightly worse than sTCK. These experiments demonstrate that ssTCK is capable of exploiting incomplete label information. Further, we created 8 synthetic datasets by randomly removing 50% and 80%, respectively, of the values in each of the 4 benchmark datasets. As we can see from the results presented in Tab. 4, also in presence of missing data the accuracy in general increases using ssTCK, compared to TCK.
For comparison, in Tab. 4 we also added the results obtained  using three other kernels; GAK, the linear kernel, and LPS. GAK and the linear kernel cannot process incomplete MTS and therefore we created complete datasets using mean imputation for these two kernels. LPS 1 was run using default hyperparameters, with the exception that we adjusted the segment length to be sampled from the interval [6, 0.8T ] to account for the relatively short MTS in our datasets. In accordance with [43], for GAK 2 we set the bandwidth σ to 0.1 times the median distance of all MTS in the training set scaled by the square root of the median length of all MTS, and the triangular parameter to 0.2 times the median length of all MTS. Distances were measured using the canonical metric induced by the Frobenius norm. In the linear kernel we set the constant c to 0. As we can see, the performance of these kernels is considerably worse than the time series cluster kernels for 7 out of 8 datasets. For uWave with 50% missingness, the performance of GAK and the linear kernel is similar to the TCK kernels.

Exploiting informative missingness in synthetic benchmark datasets
To evaluate the effect of modeling the missing patterns in TCK I M , we generated 8 synthetic datasets by manually injecting missing elements into the Wafer and Japanese vowels datasets using the following procedure. For each attribute v ∈ {1, . . . , V}, a number c v ∈ {−1, 1} was randomly sampled with equal probabilities. If c v = 1, the attribute v is positively correlated with the labels, otherwise negatively correlated. For each MTS X (n) and attribute, a missing rate γ nv was sampled from the This ensures that the overall missing rate of each dataset is approximately 50%. y (n) ∈ {1, . . . N c } is the label of the MTS X (n) and E is a parameter, which we tune for each dataset in such a way that the absolute value of the Pearson correlation between the missing rates for the attributes γ v and the labels y (n) takes the values {0.2, 0.4, 0.6, 0.8}, respectively. The higher the value of the Pearson correlation, the higher is the informative missingness.
Tab. 5 shows the performance of the proposed TCK I M and three baseline models (TCK, TCK B , and TCK 0 ). The first baseline is ordinary TCK, which ignores the missingness mechanism. For the Wafer dataset, the performance of this baseline was quite similar across all four settings. For the Japanese vowels dataset, the performance actually decreases as the information in the missing patterns increases. In the second baseline, TCK B , we tried to model the missing patterns by concatenating the binary missing indicator MTS R to the MTS X and creating a new MTS with 2V attributes. Then, we trained ordinary TCK on this representation. For the Wafer dataset, the performance decreases considerably as the informative missingness increases. For the Japanese vowels, this baseline yields the best performance when the correlation is 20%. However, the performance actually decreases as the informative missingness increases. Hence, informative missingness is not captured with this baseline. In the last baseline, TCK 0 , we investigated if it is possible to capture informative missingness by imputing zeros for the missing values and then training the TCK on the imputed data. This baseline yields similar performance across all 4 settings for the Wafer dataset, and for Japanese vowels, TCK 0 has a similar behaviour as TCK B , i.e. it does not capture informative missing patterns. The proposed TCK I M achieves the best accuracy for 7 out of 8 settings and has the expected behaviour, namely that the accuracy increases as the correlation between missing values and class labels increases. The performance is similar to TCK when the amount of information in the missing patterns is low, whereas TCK is clearly outperformed when the informative missingness is high. This demonstrates that TCK I M effectively utilizes informative missing patterns.
To also test if TCK I M is capable of exploiting other types of informative missingness, we generated 8 synthetic datasets from uWave and Character trajectories using the following approach. Both of these datasets consists of 3 attributes. For each attribute v ∈ {1, . . . , V}, a number c v ∈ {−1, 1} was randomly sampled with equal probabilities. For the attribute(s) with c v = −1, we let it be negatively correlated with the labels by sampling the missing rate γ nv from U[0.7 − E · (y (n) − 1), 1 − E · (y (n) − 1)]. For the attribute with c v = 1, we let it be positively correlated with the labels by sampling the missing rate γ nv from U[0.3 + E · (y (n) − 1), 0.6 + E · (y (n) − 1)]. We let each element with x (n) v (t) > µ v have a probability γ nv of being missing, where µ v is the mean of attribute v computed over the complete dataset. The fact that the probability of being missing depends on the missing values means that, within each class, the missingness mechanism is MNAR. We tuned the parameter E such that the mean absolute value of the Pearson correlation between γ v and the labels took the values {0.2, 0.4, 0.6, 0.8}. By doing so, the overall missing rate was approximately 32% for uWave and 45% for the Characters. However, we note that in this case the overall missing rate varies slightly as a function of the Pearson correlation.
Tab. 5 shows the performance on the 8 synthetic datasets created from uWave and Char. traj. One thing to notice here is the poor performance of TCK B . This demonstrates the importance of using the mixed mode mixtures to model the two modalities in U = (X, R). To naively apply TCK based on the GMMs to the concatenated MTS do not provide the desired performance. Further, we see that TCK I M achieves the best accuracy for 7 out of 8 settings and the accuracy increases as the correlation increases. For the Characters, the performance of TCK I M is similar to TCK for low correlation but increases as the missingness information increases, whereas the performance of TCK actually decreases. One possible explanation is that for this dataset, two of the variables were positively correlated with the labels and therefore the missing rate increases with increasing correlation. Regarding the results for uWave, it is a bit surprising that the largest difference in performance between TCK and TCK I M occurs when the correlation is lowest. There might be several reasons to this: a peculiarity of the dataset and/or that the MNAR missingness created missing patterns that negatively affect TCK.

Case study: Detecting infections among patients undergoing colon rectal cancer surgery
In this case study, the focus was to detect Surgical Site Infection (SSI), which is one of the most common types of nosocomial infections [44] and represents up to 30% of hospitalacquired infections [45,46]. The importance of the topic of SSI prediction is reflected in several recent initiatives. For instance, the current study is part of a larger research effort by the current team, on SSI prediction and detection of postoperative adverse events related to gastrointestinal surgery within the context of improving the quality of surgery [21,24,47,48,49,50]. Clearly, the reason for this massive interest is that a reduction in the number of postoperative complications such as SSI will be of great benefit both for the patients and for the society.

Data collection
Ethics approval for the parent study was obtained from the Data Inspectorate and the Ethics Committee at the University Hospital of North Norway (UNN) [50]. In [50], a cohort consisting of 7741 patients was identified by extracting the electronic health records for all patients that underwent a gastrointestinal surgical procedure at UNN in the years 2004-2012. In this case study, we were particularly interested in detecting SSI, which is an infection particularly associated with colorectal cancer surgery [60]. Therefore, patients who did not undergo this type of surgery were excluded, reducing the size of the cohort to 1137 patients.
In collaboration with a clinician (author A. R.), we extracted data for 11 of the most common blood tests from the patient's EHRs. The value of a patient's blood test, e.g. his or hers hemoglobin level, can be considered as a continuous variable over time. However, blood tests are usually measured on a daily basis, and therefore, for the purpose of the current analysis, we discretized time and let each time interval be one day. Hence, the blood samples could naturally be represented as MTS and needed no further feature preprocessing in our framework.
All blood tests were not available every day for each patient, which means that the dataset contained missing data, and we expected the missing patterns to be informative since whether a test is performed depends on whether the doctor thinks it is needed. We focused on detection of SSI within 10 days after surgery and therefore the length of the time series is 10  day 1 until day 10 were removed from the cohort, which lead to a final cohort consisting of 858 patients. The average proportion of missing data in the cohort was 80.7%. Tab. 6 shows a list of the blood tests we considered in this study and their corresponding missing rate.
Guided by input from clinicians, the International Classification of Diseases (ICD10) or NOMESCO Classification of Surgical Procedures (NCSP) codes related to severe postoperative complications were considered to identify the patients in the cohort that developed postoperative SSI. Patients that did not have these codes and did not have the word "infection" in any of their postoperative text documents were considered as controls. This lead to a dataset with 227 infected patients (cases) and 631 non-infected patients (control).

Experimental setup
The objective of this case study was to evaluate how the proposed MTS kernels perform in a real-world application from medicine. We would like to emphasize that the proposed kernels are mainly designed for situations when there are no, or only a few, ground-truth labels available. However, in order to evaluate the quality of these kernels, we adopted a supervised scheme. Hence, we followed the scheme presented in Fig. 2, i.e. we computed the kernel from the MTS representations of the blood tests and performed KPCA, followed by kNN classification in the KPCA space. We set the dimensionality of the KPCA-representation to 10 in all experiments. The number of neighbors k was set using 5-fold cross validation.
Four baseline kernels were considered, namely TCK, LPS, GAK and the linear kernel. GAK and the linear kernel cannot work on incomplete datasets, and therefore, we created 2 complete datasets using mean and LOCF imputation. In order to investigate if it is possible to better exploit the information from the missing patterns for the LPS, GAK and linear kernels, we also created baselines by concatenating the binary indicator MTS R (n) to the MTS X (n) .
We performed 5-fold cross validation and reported results in terms of F1-score, sensitivity, specificity and accuracy. Sensitivity is the fraction of actual positives (has SSI) correctly classified as positive, whereas specificity is the fraction of actual negatives that are correctly classified as negative. F1-score is the harmonic mean of precision and sensitivity, where precision is the fraction of actual positives among all those that are classified as positive cases.

Results
Tab. 7 shows the performance in terms of 4 evaluation metrics for 11 baseline kernels as well as the proposed TCK I M kernel on the task of detecting patients suffering from SSI. We see that the kernels that rely on imputation performs much worse than other kernels in terms of F1-score, sensitivity and accuracy. These methods do, however, achieve a high specificity. However, any classifier can achieve a specificity of 1 simply by classifying all cases as negative, but this of course leads to lower F1-score and sensitivity. The main reasons why these methods do not perform better are probably that the imputation methods introduce strong biases into the data and that the missingness mechanism is ignored. The TCK and LPS kernels perform quite similarly across all 4 evaluation metrics (LPS slightly better). The F1-score, sensitivity and accuracy achieved for these methods are considerably higher than the corresponding scores for the GAK and linear kernel. One of the reasons why these methods perform better than the imputation methods is that ignoring the missingness leads to lower bias than replacing missing values with biased estimates. The performance of the linear kernel  and GAK improves a bit by accounting for informative missingness, whereas the performance of LPS decreases. TCK I M performs similarly to the baselines in terms of specificity, but considerably better in terms of F1-score, sensitivity and accuracy. This demonstrates that the missing patterns in the blood test time series are informative and the TCK I M is capable of exploiting this information to improve performance on the task of detecting patients with infections. Fig. 3 shows KPCA embeddings corresponding to the two largest eigenvalues obtained using 5 different kernels. While the representations obtained using GAK and the linear kernel are noisy and to a large degree mix the infected and noninfected patients, the two classes (SSI and non-SSI) are more separated in the representations obtained using TCK and LPS. The TCK I M is even better at forcing the SSI patients to stay in the same region or cluster while it at the same time spreads out the patients without infection, revealing the diversity among these patients.

Conclusions and future directions
In this work, we presented robust multivariate time series kernels capable of exploiting informative missing patterns and incomplete label information. In contrast to other frameworks that exploit informative missingness [6,16], which need complete label information, the time series cluster kernels are specially designed for situations in which no labels or only a few labels are available. Lack of labels and large amounts of missing data are two challenges that characterize the medical domain, and therefore, we think the proposed kernels will be particularly useful in this domain, which we also demonstrated in this work through a case study of postoperative infections among colon rectal cancer patients. However, the kernels are not limited to this domain. We believe that these kernels could be useful tools in other application domains facing similar challenges.
A limitation of TCK I M is that if the missingness is by no means correlated with the outcome of interest, there will be limited gain in performance compared to the TCK, or might even a decrease in performance. For this reason it is important that the user has some domain knowledge and has some understanding about the process that led to missing values in the data, as illustrated in our case study from healthcare.
An other limitation of the time series cluster kernels is that they are designed for MTS of the same length. A possible next step would be to work on a formulation that can deal with varying length. In further work, we would also like to investigate the possibility of introducing a Bayesian formulation for the discrete modality in the mixed mode mixture models by putting informative priors over the parameters in the Bernoulli part of the model.