TSInsight: A local-global attribution framework for interpretability in time-series data

With the rise in the employment of deep learning methods in safety-critical scenarios, interpretability is more essential than ever before. Although many different directions regarding interpretability have been explored for visual modalities, time-series data has been neglected with only a handful of methods tested due to their poor intelligibility. We approach the problem of interpretability in a novel way by proposing TSInsight where we attach an auto-encoder to the classifier with a sparsity-inducing norm on its output and fine-tune it based on the gradients from the classifier and a reconstruction penalty. TSInsight learns to preserve features that are important for prediction by the classifier and suppresses those that are irrelevant i.e. serves as a feature attribution method to boost interpretability. In contrast to most other attribution frameworks, TSInsight is capable of generating both instance-based and model-based explanations. We evaluated TSInsight along with 9 other commonly used attribution methods on 8 different time-series datasets to validate its efficacy. Evaluation results show that TSInsight naturally achieves output space contraction, therefore, is an effective tool for the interpretability of deep time-series models.


Introduction
Deep learning models have been at the forefront of technology in a range of different domains including image classification [13], object detection [7], speech recognition [5], text recognition [4] and image captioning [10]. These models are particularly effective in automatically discovering useful features. However, this automated feature extraction comes at the cost of lack of transparency of the system. Therefore, despite these advances, their employment in safety-critical domains like finance [12], self-driving cars [11] and medicine [31] is limited due to the lack of interpretability of the decision made by the network.
Numerous efforts have been made for the interpretation of these black-box models. These efforts can be mainly classified into two separate directions. The first set of strategies focuses on making the network itself interpretable by trading off some performance. These strategies include Self-Explainable Neural Network (SENN) [2] and Bayesian non-parametric regression models [8]. The second set of strategies focuses on explaining a pretrained model i.e. they try to infer the reason for a particular prediction. These attribution techniques include saliency map [29] and layer-wise relevance propagation [3]. However, all of these methods have been particularly developed and tested for visual modalities which are directly intelligible for humans. Transferring methodologies developed for visual modalities to time-series data is difficult due to the non-intuitive nature of timeseries. Therefore, only a handful of methods have been focused on explaining time-series models in the past [14,22].
We approach the attribution problem in a novel way by attaching an autoencoder on top of the classifier. The auto-encoder is fine-tuned based on the gradients from the classifier. Rather than optimizing the auto-encoder to reconstruct the whole input, we optimize the network to only reconstruct parts which are useful for the classifier i.e. are correlated or causal for the prediction. In order to achieve this, we introduce a sparsity inducing norm onto the output of the auto-encoder. In particular, the contributions of this paper are twofold: -A novel attribution method for time-series data which makes it much easier to interpret the decision of any deep learning model. The method also leverages dataset-level insights when explaining individual decisions in contrast to other attribution methods.
-Detailed analysis of the information captured by 11 different attribution techniques using suppression test on 8 different time-series datasets. This also includes analysis of the different out of the box properties achieved by TSInsight including generic applicability and contraction in the output space.

Related Work
Since the resurgence of deep learning in 2012 after a deep network comprehensively outperformed its feature engineered counterparts [13] on the ImageNet visual recognition challenge comprising of 1.2 million images [19], deep learning has been integrated into a range of different applications to gain unprecedented levels of improvement. Significant efforts have been made in the past regarding the interpretability of deep models, specifically for image modality. These methods are mainly categorized into two different streams where the first stream is focused on explaining the decisions of a pretrained network while the second stream is directed towards making models more interpretable by trading off accuracy. The first stream for explainable systems which attempts to explain pretrained models using attribution techniques has been a major focus of research in the past years. The most common strategy is to visualize the filters of the deep model [30,23,29,17,3]. This is very effective for visual modalities since images are directly intelligible for humans. [30] introduced deconvnet layer to understand the intermediate representations of the network. They not only visualized the network, but were also able to improve the network based on these visualizations to achieve state-of-the-art performance on ImageNet [19]. [23] proposed a method to visualize class-specific saliency maps. [29] developed a visualization framework for image-based deep learning models. They tried to visualize the features that a particular filter was responding to by using regularized optimization. Instead of using first-order gradients, [3] introduced a Layer-wise Relevance Propagation (LRP) framework which identified the relevant portions of the image by distributing the contribution to the incoming nodes. [24] introduced the SmoothGrad method where they computed the mean gradients after adding small random noise sampled from a zero-mean Gaussian distribution to the original point. [26] introduced the Integrated gradients method which works by computing the average gradient from the original point to the baseline input (zero-image in their case) at regular intervals. [8] used Bayesian non-parametric regression mixture model with multiple elastic nets to extract generalizable insights from the trained model. Recently, [6] presented the extremal perturbation method where they solve an optimization problem to discover the minimum enclosing mask for an image that retains the network's predictive performance. Either these methods are not directly applicable to time-series data, or are inferior in terms of intelligibility for time-series data. [17] introduced yet another approach to understand a deep model by leveraging auto-encoders. After training both the classifier and the auto-encoder in isolation, they attached the auto-encoder to the head of the classifier and finetuned only the decoder freezing the parameters of the classifier and the encoder. This transforms the decoder to focus on features which are relevant for the network. Applying this method directly to time-series yields no interesting insights ( Fig. 2c) into the network's preference for input. Therefore, this method is strictly a special case of the TSInsight's formulation.
In the second stream for explainable systems, [2] proposed Self-Explaining Neural Networks (SENN) where they learn two different networks. The first network is the concept encoder which encodes different concepts while the second network learns the weightings of these concepts. This transforms the system into a linear problem with a set of features making it easily interpretable for humans. SENN trade-offs accuracy in favor of interpretability. [11] attached a second network (video-to-text) to the classifier which was responsible for the production of natural language based explanation of the decisions taken by the network using the saliency information from the classifier. This framework relies on LSTM for the generation of the descriptions adding yet another level of opaqueness making it hard to decipher whether the error originated from the classification network or from the explanation generator. [14] made the first attempt to understand deep learning models for time-series analysis where they specifically focused on financial data. They computed the input saliency based on the first-order gradients of the network. [22] proposed an influence computation framework which enabled exploration of the network at the filter level by computing the per filter saliency map and filter importance again based on first-order gradients. However, both methods lack in providing useful insights due to the noise inherent to first-order gradients. Another major limitation of saliency based methods is the sole use of local information. Therefore, TSInsight significantly supersedes in the identification of the important regions of the input using a combination of both local information for that particular example along with generalizable insights extracted from the entire dataset in order to reach a particular description.
Due to the use of auto-encoders, TSInsight is inherently related to sparse [16] and contractive auto-encoders [18]. In sparse auto-encoders [16], the sparsity is induced on the hidden representation by minimizing the KL-divergence between the average activations and a hyperparameter which defines the fraction of nonzero units. This KL-divergence is a necessity for sigmoid-based activation functions. However, in our case, the sparsity is induced directly on the output of the auto-encoder, which introduces a contraction on the input space of the classifier, and can directly be achieved by using Manhattan norm on the activations as we obtain real-valued outputs. Albeit sparsity being introduced in both cases, the sparsity in the case of sparse auto-encoders is not useful for interpretability. In the case of contractive auto-encoders [18], a contraction mapping is introduced by penalizing the Fobenius norm of the Jacobian of the encoder along with the reconstruction error. This makes the learned representation invariant to minor perturbations in the input. TSInsight on the other hand, induces a contraction on the input space for interpretability, thus, favoring sparsity inducing norm. The overview of our methodology is presented in Fig. 1. As the purpose of TSInsight is to explain the predictions of a pretrained model, we train a vanilla auto-encoder on the desired dataset as the first step. Once the autoencoder is trained, we stack auto-encoder on top of the pretrained classifier to obtain a combined model. We then only fine-tune the auto-encoder within the combined model using the gradients from the classifier using a specific loss function to highlight the causal/correlated points. We will first cover some basic background and then dive into the formulation of the problem presented by Palacio et al. [17]. We will then present the proposed formulation adapting the basic one for the interpretability of deep learning based time-series models.

Pretrained Classifier
A classifier (Φ : X → Y) is a mapping from the input space X to the output space Y. As the emphasis of TSInsight is interpretability, we assume the presence of a pretrained classifier whose predictions we are willing to explain. For this purpose, we trained a classifier using standard empirical risk minimization on the given dataset. The objective for the classifier training can be represented as: where Φ defines the mapping from the input space X to the output space Y.

Auto-Encoder
An auto-encoder (D • E : X → X ) is a neural network where the defined objective is to reconstruct the provided input by embedding it into an arbitrary feature space F, therefore, is a mapping from the input space X to the input space itself X after passing it through the feature space F. The auto-encoder is usually trained through mean-squared error as the loss function. The optimization problem for an auto-encoder can be represented as: (2) where E defines the encoder with parameters W E while D defines the decoder with parameters W D . Similar to the case of classifier, we train the auto-encoder using empirical risk minimization on a particular dataset. A sample reconstruction from the auto-encoder is visualized in Fig. 2b for the forest cover dataset. It can be seen that the network did a reasonable job in the reconstruction of the input.

Formulation by Palacio et al. [17]
Palacio et al. (2018) [17] presented an approach for discovering the preference the network had for the input by attaching the auto-encoder on top of the classifier. The auto-encoder was fine-tuned using the gradients from the classifier. The new optimization problem for fine-tuning the auto-encoder can be represented as: where W * E and W * D are initialized from the auto-encoder weights obtained after solving the optimization problem specified in Eq. 2 while W * is obtained by solving the optimization problem specified in Eq. 1. This formulation is slightly different from the one proposed by Palacio et al. (2018) where they only finetuned the decoder part of the auto-encoder, while we update both the encoder as well as the decoder as it is a much natural formulation as compared to only finetuning the decoder. This complete fine-tuning is significantly more important once we move towards advanced formulations since we would like the network to also adapt the encoding in order to better focus on important features. Finetuning only the decoder will only change the output without the network learning to compress the signal itself.

TSInsight: The Proposed Formulation
In contrast to the findings of Palacio et al. (2018) [17] for the image domain, directly optimizing the objective defined in Eq. 3 for time-series yields no interesting insights into the input preferred by the network. This effect is amplified with the increase in the dataset complexity. Fig. 2c presents an example from the forest cover dataset. It is evident from the figure that the resulting reconstruction from the fine-tuned auto-encoder provides no useful insights regarding the causality of points for a particular prediction. Therefore, instead of optimizing this raw objective, we modify the objective by adding the sparsity-inducing norm on the output of the auto-encoder. Inducing sparsity on the auto-encoder's output forces the network to only reproduce relevant regions of the input to the classifier since the auto-encoder is optimized using the gradients from the classifier. However, just optimizing for sparsity introduces misalignment between the reconstruction and the input as visualized in Fig. 2d. In order to ensure alignment between the two sequences, we additionally introduce the reconstruction loss into the final objective. Therefore, the proposed TSInsight optimization objective can be written as: where L represents the classification loss function which is cross-entropy in our case, Φ denotes the classifier with pretrained weights W * , while E and D denotes the encoder and decoder respectively with corresponding pretrained weights W * E and W * D . We introduce two new hyperparameters, γ and β. γ controls the autoencoder's focus on reconstruction of the input. β on the other hand, controls the sparsity enforced on the output of the auto-encoder. After training the autoencoder with the TSInsight objective function, the output is both sparse as well as aligned with the input as evident from Fig. 2e.
The hyperparameters play an essential role for TSInsight to provide useful insights into the model's behavior. Performing grid search to determine this value is not possible as large values of β results in models which are more interpretable but inferior in terms of performance, therefore, presenting a trade-off between performance and interpretability which is difficult to quantify. Although we found manual tuning of hyperparameters to be superior, we also investigated the employment of feature importance measures [22,28] for the automated selection of these hyperparameters (β and γ). The simplest candidate for this importance measure is saliency. This can be written as: where L denotes the number of layers in the classifier and a L denotes the activations of the last layer in the classifier. This saliency-based importance computation is only based on the classifier. Once the corresponding importance values are computed, they are scaled in the range of [0, 1] to serve as the corresponding reconstruction weight i.e. γ. The inverted importance values then serve as the corresponding sparsity weight i.e. β.
Therefore, the final term imposing sparsity on the classifier can be written as: In contrast to the instance-based value of β, we used the average saliency value in our experiments. This ensures that the activations are not sufficiently penalized so as to significantly impact the performance of the classifier. Due to the low relative magnitude of the sparsity term, we scaled it by a constant factor C (we used C = 10 in our experiments).

Experimental Setup
This section will cover the evaluation setup that we used to establish the utility of TSInsight in comparison to other commonly used attribution techniques. We will first define the evaluation metric we used to compare different attribution techniques. Then we will discuss the 8 different datasets that we used in our experimental study followed by the 11 different attribution techniques that we compared.

Evaluation Metric
A commonly used metric to compare model attributions in visual modalities is via the pointing-game or suppression test [6]. Since the pointing game is not directly applicable to time-series data, we compare TSInsight with other attribution techniques using the suppression test. Suppression test attempts to quantify the quality of the attribution by just preserving parts of the input that are considered to be important by the method. This suppressed input is then passed onto the classifier. If the selected points are indeed causal/correlated to the prediction generated by the classifier, no evident effect on the prediction should be observed. On the other hand, if the points highlighted by the attribution technique are not the most important ones for prediction, the network's prediction will change. It is important to note that unless there is a high amount of sparsity present in the signal, suppressing the signal itself will result in a loss of accuracy for the classifier since there is a slight mismatch for the classifier for the inputs seen during training. We compared TSInsight with a range of different saliency methods.

Datasets
We employed 8 different time-series dataset in our study. The summary of these datasets is available in Table 1. We will now cover each of these datasets in detail. Synthetic Anomaly Detection Dataset: The synthetic anomaly detection dataset [22] is a synthetic dataset comprising of three different channels referring to the pressure, temperature and torque values of a machine running in a production setting where the task is to detect anomalies. The dataset only contains point-anomalies. If a point-anomaly is present in a sequence, the whole sequence is marked as anomalous. Anomalies were intentionally never introduced on the pressure signal in order to identify the treatment of the network to that particular channel. Electric Devices Dataset: The electric devices dataset [9] is a small subset of the data collected as part of the UK government's sponsored study, Powering the Nation. The aim of this study was to reduce UK's carbon footprint. The electric devices dataset is comprised of data from 251 households, sampled in two-minute intervals over a month. Character Trajectories Dataset: The character trajectories dataset 3 contains hand-written characters using a Wacom tablet. Only three dimensions are kept for the final dataset which includes x, y and pen-tip force. The sampling rate was set to be 200 Hz. The data was numerically differentiated and Gaussian smoothen with σ = 2. The task is to classify the characters into 20 different classes. FordA Dataset: The FordA dataset 4 was originally used for a competition organized by IEEE in the IEEE World Congress on Computational Intelligence (2008). It is a binary classification problem where the task is to identify whether a certain symptom exists in the automotive subsystem. FordA dataset was collected with minimal noise contamination in typical operating conditions. Forest Cover Dataset: The forest cover dataset [27] has been adapted from the UCI repository for the classification of forest cover type from cartographic variables. The dataset has been transformed into an anomaly detection dataset by selecting only 10 quantitative attributes out of a total of 54. Instances from the second class were considered to be normal while instances from the fourth class were considered to be anomalous. The ratio of the anomalies to normal data points is 0.9%. Since only two classes were considered, the rest of them were discarded.
WESAD Dataset: WESAD dataset [20] is a classification dataset introduced by Bosch for person's affective state classification with three different classes, namely, neutral, amusement and stress. ECG Thorax Dataset: The non-invasive fetal ECG Thorax dataset 5 is a classification dataset comprising of 42 classes. UWave Gesture Dataset: The wave gesture dataset [15] contains accelerometer data where the task is to recognize 8 different gestures.

Attribution Techniques
We compared TSInsight against a range a commonly employed attribution techniques. Each attribution method provided us with an estimate of the features' importance which we used to suppress the signal. In all of the cases, we used the absolute magnitude of the corresponding feature attribution method to preserve the most-important input features. Two methods i.e. −LRP and DeepLift were shown to be similar to input gradient [1], therefore, we compare only against input gradient. We don't compute class-specific saliency, but instead, compute the saliency w.r.t. all the output classes. For all the methods computing class specific activations maps e.g. GradCAM, guided GradCAM, and occlusion sensitivity, we used the class with the maximum predicted score as our target. The description of the 11 different attribution techniques evaluated in this study is provided below: None: None refers to the absence of any importance measure. Therefore, in this case, the complete input is passed on to the classifier without any suppression for comparison. Random: Random points from the input are suppressed in this case. Input Magnitude: We treat the absolute magnitude of the input to be a proxy for the features' importance.
Occlusion sensitivity: We iterate over different input channels and positions and mask the corresponding input features with a filter size of 3 and compute the difference in the confidence score of the predicted class (i.e. the class with the maximum score on the original input). We treat this sensitivity score as the features' importance. This is a brute-force measure of feature importance and employed commonly in prior literature as served as a strong baseline in our experiments [30]. A major limitation of occlusion sensitivity is its execution speed since it requires iterating over the complete input running inference numerous times. TSInsight: We treat the absolute magnitude of the output from the autoencoder of TSInsight as features' importance. Palacio et al.: Similar to TSInsight, we use the absolute magnitude of the auto-encoder's output as the features' importance [17]. Gradient: We use the absolute value of the raw gradient of the classifier w.r.t. to all of the classes as the features' importance [22,14].

Gradient
Input: We compute the Hadamard (element-wise) product between the gradient and the input, and use its absolute magnitude as the features' importance [26]. Integrated Gradients: We use absolute value of the integrated gradient with 100 discrete steps between the input and the baseline (which was zero in our case) as the features' importance [26].
SmoothGrad: We use the absolute value of the smoothened gradient computed by using 100 different random noise vector sampled from a Gaussian distribution with zero mean, and a variance of 2/(max j x j − min j x j ) where x was the input as the features' importance measure [24]. Guided Backpropagation: We use the absolute value of the gradient provided by guided backpropagation [25]. In this case, all the ReLU layers were replaced with guided ReLU layers which masks negative gradients, hence filtering out negative influences for a particular class to improve visualization. GradCAM: We use the absolute value of Gradient-based Class Activation Map (GradCAM) [21] as our feature importance measure. GradCAM computes the importance of the different filters present in the input in order to come up with a metric to score the overall output. Since GradCAM visualizes a class activation map, we used the predicted class as the target for visualization. Guided GradCAM: Guided GradCAM [21] is a guided variant of GradCAM which performs a Hadamard product (pointwise) of the signal from guided backpropagation and GradCAM to obtain guided GradCAM. We again use the absolute value of the guided GradCAM output as importance measure.

Results
The results we obtained with the proposed formulation were highly intelligible for the datasets we employed in this study. TSInsight produced a sparse Fig. 3: Output from different attribution methods as well as the input after suppressing all the points except the top 5% highlighted by the corresponding attribution method on an anomalous example from the synthetic anomaly detection dataset (best viewed digitally).
representation of the input focusing only on the salient regions. In addition to interpretability, with a careful tuning of the hyperparameters, TSInsight outperformed the pretrained classifier in terms of accuracy for most of the cases which is evident from Table 2. However, it is important to note that TSInsight is not designed for the purpose of performance, but rather for interpretability. Therefore, we expect that the performance will drop in many cases depending on the amount of sparsity enforced.
In order to qualitatively assess the attribution provided by TSInsight, we visualize an anomalous example from the synthetic anomaly detection dataset in Fig. 3 along with the attributions from all the commonly employed attribution techniques (listed in Section 4.3). Since there were only a few relevant discriminative points in the case of forest cover and synthetic anomaly detection datasets, TSInsight suppressed most of the input making the decision directly interpretable. As described in Section 4.1, we compare the performance of different attribution techniques using the input suppression test. The results with different amount of suppression are visualized in Fig. 4 which are computed based on 5 random runs. Since the datasets were picked to maximize diversity in terms of the features, there is no perfect method which can perfectly generalize to all the datasets. The different attribution techniques along with the corresponding suppressed input is visualized in Fig. 3 for the synthetic anomaly detection datasets. TSInsight produced the most plausible looking explanations along with being the most competitive saliency estimator on average in comparison to all other attribution techniques. Alongside the numbers, TSInsight was also able to produce the most plausible explanations.

Properties of TSInsight
We will now discuss some of the interesting properties that TSInsight achieves out-of-the-box which includes output space contraction, its generic applicability and model-based (global) explanations. Since TSInsight induces a contraction in the input space, this also results in slight gains in terms of adversarial robustness. However, these gains are not consistent over many datasets and strong adversaries, therefore, omitted for clarity here. In depth evaluation of adversarial robustness of TSInsight can be an interesting future direction.
Model-based vs Instance-based Explanations Since TSInsight poses the attribution problem itself as an optimization objective, the data based on which this optimization problem is solved defines the explanation scope. If the optimization problem is solved for the complete dataset, this tunes the auto-encoder to be a generic feature extractor, enabling extraction of model/dataset-level insights using the attribution. In contrary, if the optimization problem is solved for a particular input, the auto-encoder discovers an instance's attribution. This is contrary to most other attribution techniques which are only instance specific. Auto-Encoder's Jacobian Spectrum Analysis Fig. 5 visualizes the histogram of singular values of the average Jacobian on test set of the forest cover dataset. We compare the spectrum of the formulation from [17] and TSInsight. It is evident from the figure that most of the singular values for TSInsight are close to zero, indicating a contraction being induced in those directions. This is similar to the contraction induced in contractive autoencoders [18] without explicitly regularizing the Jacobian of the encoder.
Generic Applicability TSInsight is compatible with any base model. We tested our method with two prominent architectural choices in timeseries data i.e. CNN and LSTM. The results highlight that TSInsight was capable of extracting the salient regions of the input regardless of the underlying architecture. It was interesting to note that since LSTM uses memory cells to remember past states, the last point was found to be the most salient. For CNN on the other hand, the network had access to the complete information resulting in equal distribution of the saliency. A visual example is presented in Fig 6.