Innovative And Additive Outlier Robust Kalman Filtering With A Robust Particle Filter

In this paper, we propose CE-BASS, a particle mixture Kalman filter which is robust to both innovative and additive outliers, and able to fully capture multi-modality in the distribution of the hidden state. Furthermore, the particle sampling approach re-samples past states, which enables CE-BASS to handle innovative outliers which are not immediately visible in the observations, such as trend changes. The filter is computationally efficient as we derive new, accurate approximations to the optimal proposal distributions for the particles. The proposed algorithm is shown to compare well with existing approaches and is applied to both machine temperature and server data.

, with σ A = 1, σ I = 0.1, and outliers defined by W 100 = 3600, V 400 = 100, and W 700 = 10000. Conversely (b) second example was simulated using the model defined in Equation (3) using σ A = 1, σ Here the additive noise, t ∈ R p , and the innovations ν t ∈ R q are both i.i.d. standard multivariate Gaussian. The diagonal matrices Σ A and Σ I denote the covariance of the additive and innovation noise respectively. The diagonal matrices V t and W t are used to capture additive and innovative outliers respectively, with large diagonal entries of V t corresponding to additive outliers and large diagonal entries of W t corresponding to innovative outliers. The classical Kalman model is recovered by setting W t = I and V t = I for all times t.
The model in Equation (1) can be used to model a range of time series behaviours. We will use the following two examples throughout the paper: Example 1: The random walk model with both changepoints and outliers, similar to the problem considered by [17]. It can be formulated as (2) Here atypically large values of V t correspond to outliers, whilst atypically large values of W t correspond to changes. A realisation of this model can be found in Figure 1a.
Example 2: A time series with changes in trend, level shifts, as well as outliers, similar to the model considered by [18]. It can be formulated as with the first component of the hidden state denoting the current position and the second indicating the trend. Here, outliers are modelled by large values of V t whilst level shift and changes in trend are modelled by atypically large values of W (1) t and W (2) t respectively. A realisation of this model can be found in Figure 1b.
A key feature of this second model is that an outlier in the trend component, X t , may only become detectable many observations after the outlier -this challenging issue mentioned in the introduction is addressed via the methods in Section 4. A wide rage of other commonly used time series features, such as auto-correlation, moving averages, etc. can be incorporated in the model.
To infer the locations of anomalies we use the model for 1 ≤ i ≤ p and 1 ≤ j ≤ q. The random variables λ (i) t ∼ Ber(r i ) and γ (j) t ∼ Ber(s j ) are indicators that determine whether an anomaly is present or not for 1 ≤ i ≤ p and 1 ≤ j ≤ q respectively. For additional interpretability, we impose that at most one anomaly is present at any given time t, and define r i and s j to be the probabilities that λ (i) t = 1 and γ (j) t = 1 respectively. The inverse scale, or precision, of an anomaly (if present) is given by the random variables V (i,i) t ∼σ i Γ(a i , a i ) andW (j,j) t ∼σ j Γ(b j , b j ) for 1 ≤ i ≤ p and 1 ≤ j ≤ q respectively.
The proposed model bears similarities to the model used by [11]. Both use a mixture of Gaussian and heavy tailed noise. The main difference is that the anomalous behaviour is characterised by noise which is the sum of a Gaussian and a t-distribution in our model as opposed to just a t-distribution in the model used by [11]. This ensures that anomalies coincide with strictly greater noise and makes the result more interpretable. In practice, however, the noise distribution considered in this paper and in [11] are likely to be of very similar shape.

Particle Filter
We now turn to filtering the model defined by Equations (1) and (4). The main feature we exploit is the fact that if we knew the value of (V t , W t ) at all times t, we could just run the classical Kalman filter over the data. Consequently, our approach will consist of sampling particles for (V t , W t ), conditional on which the classical Kalman update equations for the hidden state x t can be used. This approach, very similar to the mixture Kalman filter [15,19] is summarised by the pseudocode in Algorithm 1.
For each time, t, the code loops over the existing particles, (V t , W t ), and simulates M descendants for each of them in step 4. They are stored in a set of candidate particles. If we have N particles at time t, keeping all candidates would produce N M particles at time t + 1. To avoid growing the number of particles exponentially with t, Step 7 resamples the candidates to keep just N particles. The filtering distribution for each of these particles is then calculated using the Kalman Filter updates in step 10.
Adopting ideas from [16] and [20], we overcome the above challenge by sampling particles from an approximation to the conditional distribution of (V t+1 , W t+1 ) given observation Y t+1 . Denote the model's prior distribution for (V t+1 , W t+1 ) in (4) by π 0 (·). The conditional distribution π(W t+1 , V t+1 |Y t+1 ) for the descendants of a particle whose filtering distribution for x t is N (µ, Σ) is then proportional to Here we have dropped time indices for convenience, and L (x, µ, Σ) denotes the likelihood of an observation x under a N (µ, Σ)-model. Since at most one component is anomalous, we can re-write this as a sum over which, if any, component is anomalous Here, we use the shorthandπ i Ṽ (i,i) = π I, I + Since the target distribution π(W, V|Y) is intractable, we construct an approximation to it, which we denote q(W, V|Y), and use this as our proposal distribution. This proposal is proportional to Clearly, there is no benefit in simulating multiple identical descendants, so we wish to sample precisely one dependent that corresponds to no outliers. To do this, and also to have the same number of descendant particles for each possible type of outlier, we set β 0 = , and use stratified subsampling as in [19]. This leads to M = M (p + q) + 1 total descendants per particle, M for each of the p additive and q innovative outliers, and one for no outlier. Each of these particles is then given a weight proportional to .
The main challenge now consists of obtaining proposal distributionsq i (·) for 1 ≤ i ≤ p andq j (·) for 1 ≤ j ≤ q that provide good approximations to the conditional posteriors which are proportional toπ i (·) andπ j (·) respectively. In the next subsection, we therefore derive proposal distributions that provide leading order approximations to the conditional posteriors. To simplify notation, we define the predictive varianceΣ = CAΣA T C T + Σ A + CΣ I C T and use it throughout the remainder of this paper. We also begin by assuming that C contains no 0-columns. The proposal introduced in the following subsection also forms the basis of back-sampling introduced in Section 4, which allows to relax this on C.

Proposal Distributions
For 1 ≤ i ≤ p, we would like the proposal distributionq i Ṽ (i,i) for the precision,Ṽ , to be as close as possible tõ .
It should be noted that the intractable terms, can both be expanded using the matrix determinant lemma and the Sherman Morrison formula respectively, as they are rank 1 updates of a determinant and inverse respectively. Indeed, by the matrix determinant lemma, the leading order term is conjugate to the prior ofṼ (i,i) . Moreover, by the Sherman Morrison formula the second term in Equation (5) is equal toΣ Crucially, the first two terms are constant inṼ , while the third is linear inṼ and therefore returns a term which is conjugate to the prior ofṼ . Furthermore, we are most concerned about accurately sampling the particle when an anomaly occurs in the ith component, which happens when the precision,Ṽ (i,i) , and the higher order terms, become small.
Keeping only the leading order terms in the determinant and the exponential term results in the proposal distributioñ . More detailed derivations, including the associated weight are given by Theorem 1 in the appendix. This proposal has the property that as the observed anomaly in the ith component becomes larger, i.e. as diverges from the prior mean and behaves asymptotically like Consequently, the variance and the squared residual will be on the same scale, thus achieving computational robustness.
A very similar approach can be used to obtain a proposal distributionq j W (j, j) which provides a leading order approximation for the distribution proportional to π I + 1 W (j,j) I (j) , I|Y . The proposal consists of sampling and is of very similar form to the proposal distribution for particles with an additive outlier and well defined if C has no 0-columns. Further details, including the associated weight, are given in Theorem 2 in the appendix. Like the proposal distribution for particles with an additive anomaly this proposal is computationally robust: it ensures that the squared residual and the variance will be on the same scale as the anomaly in the jth innovative component becomes stronger.
Finally, the "proposal" for particles without anomalies consists of deterministically setting V = I and W = I. The weight associated with this particle is proportional to the likelihood, the closed form of which is given in Theorem 3 in the appendix.

Choices of Parameters
The choice of hyper-parameters, particularlyσ i andσ i , has a significant effect of the performance of the proposed filter. One reason for this is that an outlier observation could be the result of either an additive or an innovative outlier. It may be that the root cause can only be determined after further observations are made. Thus, we wish to choose hyper-parameters in such a way as to ensure that observed anomalies, which are equally well explained by different classes of anomalies, are given similar importance weights. The following result describes such a choice: Theorem 4 Let the prior for the hidden state X t be N (µ, Σ) and an observation Y t+1 := Y be available. Wheñ , and a 1 = ... = a p = b 1 = ... = b q = c, the weights of additive and innovative anomalies are asymptotically proportional to , respectively, as δ → ∞ The above choice of hyper-parameters therefore leads to all components being given equal asymptotic importance weight under an anomaly they are able to account for. I.e. one which satisfies . Setting all the a i s and b j s to the same constant is advisable due to the fact that the convolution of two t-distributions whose means drift further and further apart yields two stable, i.e. non-vanishing modes if and only if they have the same scale parameter.
While,Σ −1 is not fixed but time dependent, it nevertheless converges to a limit under an observable Kalman filter model. In practice, we therefore use this limit to setσ i andσ j .

Example 1 -revisited
The proposed filter can be applied to the data displayed in Figure 1a to detect anomalies in an online fashion. It is worth pointing out that the filter re-evaluates past anomalies as more data becomes available. This can be seen in Figure 2: When initially encountering the anomaly at time t = 100 the filter gives approximately equal weight to the possibility of it being an additive outlier and to it being an innovative one. It is only when the next observation becomes available, that the filter (correctly) classifies it as an innovative anomaly. Note that only N = 20 particles were used and only M = 1 descendent of each anomaly type was sampled per particle.

Particle Filter With Back-Sampling -CE-BASS
As mentioned in the introduction, it is possible that innovative outliers may not immediately be observed. One such example are innovative outliers in the trend component of the model described in (3). The filter as described in Algorithm 1 can not deal with such anomalies as it only inflates the variance of the innovative process at time t when there is evidence in the observation at the same time t that an outlier occurred. This can be remedied by back-sampling particles representing innovative outliers at a later time, t + k, once more observations and therefore evidence for an anomaly are available. This can be done using nearly identical approximation strategies as used in the previous section and allows to relax the assumptions made in the previous section from C not having any 0-columns to requiring that the system be observable.

Back-Sampling Particles Using the Last k + 1 Observations
The proposed back-sampling strategy at time t consists of sampling particles for Specifically, we sample particles with a innovative single anomaly in W t+1−k assuming no other innovative anomalies or additive anomalies. Conditional on these augmented particles classical Kalman updates can once more be used as shown in Algorithm 2. It should be noted that Algorithm 1 is a special case of Algorithm 2 which arises from setting

14:
Inn_Des ← BS_inn(µ, Σ,Ỹ, A, C, ΣA, ΣI , M, hor) 15: for (V, W, prob) ∈ Inn_Des do 16: Cand To sample a particle with an innovative anomaly in the jth component of W t+1−k , we define an augmented observation vectorỸ (k) T . This is normally distributed with meanC (k) Aµ t−k and variancẽ .., A k T T denotes the augmented matrix mapping the hidden states to the observations, In a similar spirit, we define the augmented predictive variance to bê As a result of this reformulation, we retrieve update equations consisting of a single Kalman step, albeit with slightly different dimensions of the observation, (k + 1)p instead of p. It is therefore possible to use the sampling procedure for innovative outliers introduced in Section 3.1. This consists of sampling particles forW for the residualz The associated weight is given in Theorem 5 in the appendix.
As in Section 3.2, we want to give different particles equal weights if they explain anomalies equally well. In particular, we therefore want to balance out the weights given to the back-sampled particles and the descendants of particles with an anomaly sampled at time t − k + 1 using just Y t+1−k . In order to do so, consider observations Y t+1 , ..., Y t+1−k which are such that they perfectly fit an innovative outlier in the ith innovative component at time t − k + 1, i.e.
As δ grows, the importance weight behaves as up to the likelihood term and the 1 − p i=1 r i − q j=1 s j k factor. However, these terms are also present in the weights of the descendants of the particles sampled at t + 1 − k if no further anomaly was sampled at times t + 2 − k, ..., t + 1. Therefore, settinĝ results in the same asymptotic probabilities as the one obtained in Section 3.2. Givenσ j can only take a single value we setσ A range of observations guide the choice of the sets B j for 1 ≤ j ≤ q. We assume that the Kalman model is observable, i.e. that there exists a k such that the matrix (C) T , (CA) T , ..., CA k T has full column rank. Let k * denote the lowest such k. It is advisable to choose the set B j such that it contains at least one element greater or equal to k * . The reason for this being that any innovative anomaly capable of eventually influencing the observations must do so within k * observations from occurring. It should also be noted that a horizon h can only be in the set B j if the jth column of the augmented mapping from the hidden states to the observations,C , is non-zero as this is required by the proposal.

Example
With back-sampling, we are now able to tackle the example from Figure 1b. We used B 1 = {1, ..., 40}, B 2 = {1, ..., 40}, to sample back up to 40 observations. We maintained N = 40 particles and sampled M = 1 descendants of each type. The output of the particle filter can be seen in Figure 3. As before, the filter updates its output as new observations become available. Whilst the trend innovation occurs at time t = 800, the anomaly is first detected around time t = 820. Even then, there is a large amount of uncertainty regarding the precise location of the anomaly which only gets resolved at a later time.

Simulations
We now turn to comparing CE-BASS against other methods. In particular, we compare against the t-distribution based additive outlier robust filter by [8], the Huberisation based additive outlier robust filter by [9], the Huberisation based innovative outlier robust filter by [9], and the classical Kalman Filter [4]. All these algorithms are implemented in the accompanying package.
We consider four different models and generate 1000 observations for each. For each of the four models, we consider a case in which no anomalies are present, a case in which only additive anomalies are present, a case in which only innovative anomalies are present, and a case in which both additive and innovative anomalies are present. When anomalies are added, they are added at times t = 100, t = 300, t = 600, and t = 900. Specifically we considered the following three models: 1. The model of Example 1 with σ A = 1 and σ I = 0.1. We consider a case with only additive outliers, a case with only innovative outliers, and a case where an additive outlier at t = 100, is followed by two innovative outliers at times t = 300 and t = 600, which were then followed by an additive outlier at time t = 900. To simulate additive anomalies, we set V

The random walk model with two measurements
where σ A = 1 for i = 1, 2 and σ I = 0.1. We consider a case with only additive outliers (one in the first component, then two in the second, then one in the first), a case with only innovative outliers, and a case where an additive outlier in the first component at time t = 100 is followed by two innovative outliers at times t = 300 and t = 600, which are then followed by an additive outlier in the second component at time t = 900.
For additive anomalies, we set V = 0.01. We consider a case with only additive outliers, a case with only innovative outliers (one in the second component, then one in the first, then one in the second, then one in the first), and a case with an additive outlier at t = 100, followed by an innovative outlier affecting the first component of the hidden state at times t = 300, followed by an innovative outlier affecting the second component of the hidden state at times t = 600, followed by an additive outlier at time t = 900. The additive anomalies were instances where we set V 1 2 t t = 30 and the innovative outliers were instances where we set W 4. An extension of Example 2 where the position is also observed. The equations governing the hidden state are as before whilst the equations governing the observations are where σ We evaluate the different methods based on average predictive log-likelihood and average predictive mean squared error. We exclude all observations corresponding to anomalies from the calculation of these averages since the filters can not be expected to predict them. When calculating the average mean squared error we additionally remove one observation after the anomaly in the first setting and two observations in the third setting from the performance metric. This is to give the filter enough information to determine which type of anomaly the outlier corresponds to and return to a unimodal posterior, as the MSE is only an appropriate metric for unimodal posteriors.
The average log-likelihoods across all models can be found in Figure 4, while the qualitatively very similar results for the mean squared error can be found in the appendix. We see that the performance of CE-BASS compares favourably with that of the competing methods. In particular it is as accurate as the Kalman filter in the absence of anomalies and is more accurate than the additive outlier and innovative outlier robust filters even when only additive or innovative outliers are present, i.e. the settings for which these algorithms were designed.

Application
In this section, we apply CE-BASS to two real datasets. We will use different types of models for the two applications to illustrate the way in which CE-BASS can be used. The first dataset is a labelled benchmark dataset which consists of temperature readings on a large industrial machine. Here, we will use a model which considerably restricts the movements of the hidden states when no anomalies are present, and thus emulates a changepoint model. The second is (a) Raw data with labels (b) CE-BASS output Figure 5: Machine temperature dataset. The labelled anomalies are: a planned shutdown, an early warning sign of a problem, and the catastrophic system failure caused by the problem.
an unlabelled dataset which consist of repeated throughput measurements on a router. For that application we will use a model which has a considerable amount of flexibility and where the hidden states tend to follow the observations and therefore detect localised anomalies.

Machine Temperature Data
We now apply CE-BASS to the machine temperature data taken from the Numenta Anomaly Benchmark (NAB, [21]) which can be accessed at https://github.com/numenta/NAB. The data consists of over 20000 readings from a temperature sensor on a large industrial machine and is displayed in Figure 5a along the three periods of anomalous behaviour labelled by an engineer. The first corresponds to a planned shutdown and the second to an early warning sign of the third anomaly -a catastrophic failure.
In order to do so, we use the random walk model from Example 1 with the aim of detecting persistent changes in mean. We therefore use a maximum backsampling horizon of 250 by setting B 1 = {1, 5, 10, 20, 40, 80, 150, 250} and fix σ I = 1/10000σ A to ensure that long and weak anomalies will not be interpreted as a persistent shift in the typical state. We use the first 15% of the data, marked by [21] as train data, to estimate the standard deviation σ A as well as the initial mean µ 0 using the median absolute deviation and the median respectively. Using robust covariance methods we also detect very strong auto-correlation (ρ = 0.99) and therefore took the default probabilities for anomalies to the power of 1 1−ρ . The results of this analysis can be seen in Figure 5b. We note that all anomalies flagged by the engineer are also being detected by CE-BASS. Two additional innovative anomalies around a prolonged drop which preceded the planned shutdown are also detected. They could be a false positive or an early warning sign of an anomaly prevented by the shutdown which has not been noticed by the engineer.

Router Data
The online analysis of aggregated traffic data on servers is an important challenge in both predictive maintenance and cyber security. This is because anomalies in throughput can point towards problems in the network such as malfunctions or malicious behaviour. Detecting anomalies as soon as possible therefore means that the root cause can be addressed more quickly -potentially even before user experience is affected or harm caused.  Figure 6: CE-BASS applied to 9 days of de-seasonalised router data. Lines correspond to innovative anomalies, i.e. spikes or level shifts.
In this section, we consider 19 days worth of data from a network IP router which has been gathered at a frequency of one observation every 30 seconds. To preserve confidentiality, we de-seasonalised the data for days 11 to 19 using a seasonality model trained on days 1 to 10 and, for the purpose of this paper, consider only the de-seasonalised data for days 11 to 19 which can be found in Figures 6a to 6i. The main features apparent in the daily series are spikes, outliers, and changepoints. In order to capture these, we use an AR(1) model with slowly changing mean to model the observations Y t . Formally, we used the model Here, anomalies in t correspond to isolated outliers, anomalies in η (1) t correspond to level shifts and outliers in η (2) t correspond to spikes.
We use the first 1000 observations of the first day, to obtain the estimates σ A = 0.0516, σ I = 0.516, and ρ = 0.815. The result obtained from running CE-BASS with these parameters on the daily router data is displayed in Figures 6a to 6i. We note that very few of the anomalies returned can be classed as false positives. At the same time, a large number of anomalies are flagged, including a large number of outliers and spikes, but also some level shifts (Day 14). Discussion with engineers highlighted that the anomalies detected matched well with their knowledge of the data. This shows CE-BASS's ability to return a large number of diverse features which can be used as inputs to a supervised algorithm should labels become available.
Proof: We wish to sample from the posterior distribution ofṼ where f i () denotes the PDF of aσ i Γ(a i , a i )-distribution. The intractable part in the above consists of where I (i) = e i e T i is a matrix which is 0 everywhere with the exception of the ith entry of the ith row, which is 1. Note that I (i) has rank 1 and therefore, by the Sherman Morrison formula, . Furthermore, given tr(Σ −1 I (i) ) = Σ −1 , the above is equal tô Crucially, the first term is constant inṼ Furthermore, given that we can rewrite the posterior ofṼ in Equation (6) as Using conjugacy, we can therefore sample M particles forṼ and give each particle an importance weight proportional to

Theorem 2
Theorem 2 Let the prior for the hidden state X t be N (µ, Σ) and an observation Y t+1 := Y be available. Then the samples forW The proof is almost identical to that of Theorem 1 and has been omitted.

Theorem 3
Theorem 3 Let the prior for the hidden state X t be N (µ, Σ) and an observation Y t+1 := Y be available. Then the proposal particle (I p , I q ) for (V t , W t ) has weight proportional to |Σ| . This is immediate from the Gaussian likelihood and the Bernoulli priors for λ for the particles containing an anomaly in the ith additive component, and for the particles containing an anomaly in the jth innovative component.
As mentioned in Section II that the mean of the proposal of the ith additive component behaves asymptotically as Proof: Identical (up to variable names) to that of Theorem 2.

Additional Simulations
Violin plots for the predictive mean squared error are displayed in Figure 7 8.3 Complete pseudocode