Error sensitivity analysis of Delta divergence - a novel measure for classiﬁer incongruence detection

The state of classiﬁer incongruence in decision making systems incorporating multiple classiﬁers is often an indicator of anomaly caused by an unexpected observation or an unusual situation. Its assessment is important as one of the key mechanisms for domain anomaly detection. In this paper, we investigate the sensitivity of Delta divergence, a novel measure of classiﬁer incongruence, to estimation errors. Statistical properties of Delta divergence are analysed both theoretically and experimentally. The results of the analysis provide guidelines on the selection of threshold for classiﬁer incongruence detection based on this measure.


Pl e a s e n o t e:
C h a n g e s m a d e a s a r e s ul t of p u blis hi n g p r o c e s s e s s u c h a s c o py-e di ti n g, fo r m a t ti n g a n d p a g e n u m b e r s m a y n o t b e r efl e c t e d in t his ve r sio n.Fo r t h e d efi nitiv e ve r sio n of t hi s p u blic a tio n, pl e a s e r ef e r t o t h e p u blis h e d s o u r c e.You a r e a d vis e d t o c o n s ul t t h e p u blis h e r's v e r sio n if yo u wi s h t o cit e t hi s p a p er.
Thi s v e r sio n is b ei n g m a d e a v ail a bl e in a c c o r d a n c e wit h p u blis h e r p olici e s. S e e h t t p://o r c a .cf. a c. u k/ p olici e s. h t ml fo r u s a g e p olici e s.Co py ri g h t a n d m o r al ri g h t s fo r p u blic a tio n s m a d e a v ail a bl e in ORCA a r e r e t ai n e d by t h e c o py ri g h t h ol d e r s .

Introduction
Many sensor data analysis systems involve multiple classifiers to interpret input data, which leads to improved performance by virtue of exploiting complementary information derived from multiple modalities of sensing, multiple representations, contextual information, and hierarchical structuring of the interpretation process.In addition to increased performance, an important corollary of involving multiple experts in decision making is the ability to flag anomalies by looking for discrepancy between their outputs, referred to as incongruence.
Anomaly detection, i.e. finding patterns in data that do not conform to expected normal behaviour [1] , has been studied in many areas including statistical signal processing and pattern recognition [2][3][4][5][6][7] , as well as a wide variety of applications, such as intrusion detection for cyber-security [8][9][10][11] , surveillance [12,13] , videobased crowd-behaviour analysis [14][15][16] and fault detection in sensor systems [17,18] .A large number of techniques have been developed for this problem, including the methods based on e.g.classification, clustering, statistical modelling, among many others, as surveyed by Chandola et al. [1] , Markou and Singh [6,7] , and Patcha and Park [19] .The basic approach to anomaly detection adopted in all these techniques is to compare incoming data against a reference model that embodies normality.This approach is also known as outlier detection.
Despite this effort, the development of good models of normality for diverse applications is not without challenges.Moreover, detecting anomalies in multiple classifier systems raises additional issues.It has been argued in [20] that in order to identify and distinguish the multifaceted nature of anomaly and take appropriate control actions, a more complex system consisting of several other mechanisms are needed in addition to outlier detection.They include data quality assessment, classifier decision confidence estimation and classifier incongruence detection [20] .Among these mechanisms, classifier incongruence detection, in other words measuring the disagreement between the classifiers embodied in the system, is of paramount importance.It helps to differentiate between certain types of anomalous events such an out-of-context event, where an event is unexpected, a rare event, where a given configuration of components occurs very infrequently, or an unknown structure [20] .This mechanism is the subject and focus of this paper.
A simple example of anomaly detection using incongruence is out-of-vocabulary word detection in speech recognition [21] .A speech recognition system would typically involve a hierarchical decision making strategy based on the outputs of noncontextual and contextual classifiers.Noncontextual classifiers operating at a low level of representation attempt to identify phonemes based on the speech content, whereas contextual classifiers combine this low level symbolic representation with prior knowledge to segment and recognise larger semantic units such as words.Implicitly, in this complex decision making process, we get two opinions about the identity of each phoneme: one derived from the contextual classifier and one from its noncontextual counterpart.For successful speech understanding, we do not necessarily need to be concerned with the low level interpretation process.However, by monitoring the outputs of both contextual and noncontextual classifiers we may glean very useful information which could enable us to qualify the failure of the speech recognition system to interpret input data.For instance, if the low level classifier makes confident decisions about the identity of the phonemes, but a sequence of the detected phonemes does not produce a meaningful output, the system may be encountering an out-of-vocabulary word.Discerning such nuances in sensor data interpretation would allow us to act accordingly.This, however, requires a reliable method of classifier incongruence detection which can spot and discriminate disagreements in classifier opinions about one or more hypotheses.
Detecting incongruence can be formulated as a statistical hypothesis testing problem [6] .This typically involves some proposition, referred to as a null hypothesis and a test statistics.If the outcome of the test statistics is consistent with its known distribution model, then the null hypothesis is accepted.An outlier of that distribution would lead to the hypothesis rejection.An observation is considered an outlier at a given level of significance, i.e. if the test statistics value exceeds a threshold corresponding to some vestigial probability, such as 5% or 1%.Accordingly, the proposition in incongruence detection is that two classifier outputs are congruent.If the test statistics exceeds a threshold corresponding to the required level of significance then the hypothesis is rejected, that is the classifier outputs are deemed incongruent.Let us emphasise here that measuring classifier incongruence is meaningful only when a dominant class probability output by a classifier exceeds a certain confidence level and there is sufficient margin between the probabilities of the dominant class and the next strongest class.
Clearly the test statistics is a crucial component of a hypothesis testing process.The choice not only influences its statistical properties, but also how faithfully it reflects the concept tested.For instance, the throw of a coin and counting the number of heads in testing whether the coin is biased introduces a statistical element in the test process.A much more transparent test would consist in looking at both sides of the coin, which would immediately, in unambiguous terms, establish whether the coin is biased or not.It is the choice of the experiment of repeated trials, and the head count, which makes the hypothesis testing more difficult than it needs to be, and injects randomness in the experimental outcome.Moreover, this particular choice only reflects the phenomenon to be tested indirectly, rather than in the most transparent way possible.
A classical classifier incongruence test statistic is the Kullback-Leibler (KL) divergence known as Bayesian surprise [22] .However, it has recently been pointed out that this measure has some deficiencies.In particular in multiclass problems, it has been shown to be unpredictably affected by the probabilities of nondominant classes (referred to as clutter) and a variant of the KL divergence, referred to as Decision-Cognizant KL (DC-KL) divergence has been proposed instead [23] .Some other undesirable properties of KL type divergence, induced by its log function, have been rectified by the recently proposed Delta divergence [24] .However, the key question not addressed so far, is whether the superior theoretical properties of Delta divergence are robust to estimation errors.For example, in multiple classifier fusion, sensitivity to errors changed the ranking of the product and sum fusion rules, although the former is founded on sound theoretical principles.
The aim of this paper is to investigate error sensitivity of Delta divergence as a measure of classifier incongruence.The study includes a theoretical analysis of a few special cases to gain intuitive feeling for the behaviour of Delta divergence in noisy conditions.A more comprehensive investigation is carried out by simulation studies where the space of class a posteriori probabilities is sampled to estimate the probability distribution of noise-free Delta divergence values for various scenarios.The samples of the a posteriori probability distributions are then corrupted by estimation errors and their impact on Delta divergence is measured experimentally.The aggregation of the statistical distributions of Delta divergence over different scenarios and the distribution of noisefree Delta divergence values produces the final test statistics distribution which can be used to determine appropriate classifier incongruence detection thresholds.Although the simulation studies are limited by the assumptions made regarding the estimation noise, their main merit is to give the reader a better understanding of the behaviour of Delta divergence.For practical purposes we propose guidelines for incongruence detector design, given a training set of class probability estimates.The design procedure is illustrated on a problem of detecting incongruence of noncontextual and contextual classifiers developed to recognise action and activity in breakfast dataset videos.
In summary, the contributions of the paper include: • An error sensitivity analysis of Delta divergence utilising marginalisation of the test statistics over different scenarios • Estimation of the statistical distribution of Delta divergence as a basis for classifier incongruence threshold selection • Guidelines for classifier incongruence threshold selection in practical anomaly detection systems The paper is structured as follows.The background and related work are the subjects of Section 2 .In Section 3 , Delta divergence is introduced as a novel classifier incongruence measure and its properties are related to the Bayesian surprise measure which is used as a baseline both theoretically and experimentally.The statistical properties of the proposed measure are investigated in Section 3.1 .In Section 4 , a discussion on how to determine the classifier incongruence threshold is carried out via experimental analysis on synthetic and real data.Finally, in Section 5 , the main results of this study are summarised and the paper is drawn to conclusion.

Related work
The idea of using classifier incongruence for anomaly detection has been advocated by Weinshall et al. in [25] .As in [25] , we consider just two decision making experts, classifying the data into one of m possible categories.Let ˜ P (ω j | x ) and P (ω j | x ) , j = 1 , . . ., m denote the a posteriori probabilities associated with the hypothesis that model ω j explains the input data, x , which have been estimated by the two experts.If the two distributions are identical or similar, then the classifier outputs would be considered congruent.For measuring incongruence, Weinshall et al. [25] advocated the adoption of Itti's Bayesian surprise measure [22] originally proposed for detecting content changes in video.In particular, by considering the a posteriori class probability distribution output by one of the experts as a reference, one can detect incongruence by calculating which is basically the Kullback-Leibler divergence between the two distributions.
The Kullback-Leibler divergence primarily measures the similarity between the two probability distributions through an inverse relationship.If the distributions are identical, or similar, the measure will tend to zero.A high value of the measure would indicate differences in the a posteriori probabilities, and therefore high incongruence between the classifier outputs.There are other information theory divergences that could be used for the same purpose [26,27] .
Alternatively, one could adapt any statistical measure of similarity between two distributions and use it as a test statistic for detecting classifier incongruence.More specifically, mapping the classes onto consecutive numbers (bins) will create two discrete probability distribution functions, resembling normalised histograms, which sum up to unity.This analogy suggests that wellknown criteria, namely histogram similarity measures, mainly used for calculating the goodness-of-fit between an empirical and a reference distribution, could be adapted for the purpose of measuring classifier incongruence, although there are no reported attempts in the literature to adopt them for this purpose.A comprehensive analysis of the tests that can be used for measuring the similarity between two histograms can be found in [28] .Examples are Chisquare, Kolmogorov-Smirnov [29] , Cramér-von-Mises [30,31] , and Anderson-Darling [32] tests; Geometric test using Bhattacharyya distance, and likelihood-ratio and likelihood-value tests.We plan to investigate the applicability of these histogram matching methods to the problem of incongruence detection in the future, but here we are focusing on the established state of the art methodology of incongruence detection constituted by the Bayesian surprise measure.
It should be noted that the term measures of surprise in Bayesian analysis also refers to test statistics developed for outlier detection.This confusing terminology relates to the classical notion of anomaly detection where instead of measuring the similarity between two probability distributions, the aim is to compare a single observation with the hypothesised distribution model [33][34][35][36][37][38][39] .Recently in [40] , some state-of-the-art measures of surprise in Bayesian analysis have been thoroughly analysed and modifications have been proposed.However, these techniques are not relevant to the topic addressed in this paper.
Accordingly, Itti's Bayesian surprise [22] and its decision cognizant variant DC-KL [23] are the key existing technique for assessing classifier incongruence in the literature.Thus, we shall adopt them as a reference for our deliberation.The issues with the Bayesian surprise measure can be listed as follows: 1.It goes to infinity for any hypothesis ω for which P ( ω| x ) → 0 while ˜ P (ω| x ) = 0 .This can occur even for insignificant hypotheses and result in producing false alarms of incongruence.2. The measure is not symmetric, in a sense that if we use the distribution of P ( ω| x ) as a reference instead of ˜ P (ω| x ) , we will get a different value of the divergence.3. The divergence function may produce the same value for completely different scenarios and may diverge to infinity.Hence, it is difficult to assess which values imply congruence / incongruence, and define a suitable threshold.4. The measure is classifier decision agnostic.In other words, all hypothesis (classes) are involved in the calculation of the surprise.5.By virtue of Property 4, it is also strongly affected by estimation errors on probabilities P ( ω| x ) and ˜ P (ω| x ) .
In contrast, DC-KL is decision cognizant, that is the measure ignores all the terms associated with the classes that are not selected by the decision rule.The main argument for ignoring the contribution of the classes with non maximum posterior is that first of all they contribute with a lot of irrelevant jitter to the value of the similarity measure.This contamination is proportional to the num-ber of hypotheses.In other words, in multi hypotheses problems, this background jitter potentially can bury the useful information, i.e. the probability differences for the classes selected by the decision rule.The elimination of this clutter impacts favourably also on Property 5.However, both KL and DC-KL share Properties 1-3 which limit their ability to distinguish between classifier congruence and incongruence robustly.Let us illustrate the limitation on the real data application discussed in Section 4 , which is concerned with action and activity recognition videos.
Breakfast dataset [41] is used for performing action and activity recognition from breakfast scenario videos, and is comprised of 10 activities and 52 action classes.In our approach, the action in each segment of a video is interpreted by a noncontextual and a contextual classifier, the latter taking into account the complete sequence of actions to identify the breakfast scenario activity captured by the video.As an example, for the video segment represented by the key frame shown in Fig. 1  To avoid the problems associated with KL and DC-KL, we have previously proposed alternatives, which not only focus on the dominant hypotheses flagged by the two experts [20,42] , but have the additional advantage over [23] that their values are confined to a finite range of [0, 1].Although the methods in [20,42] have attractive properties, their main disadvantage is that they are heuristic.Overcoming this shortcoming, in a recent paper [24] we have proposed a novel divergence, called Delta divergence ( D ), which exhibits all the desirable properties of a test statistic ideally suited for detecting classifier incongruence.Moreover, it is a proper information theoretic divergence, with all the advantages of a measure underpinned by information theory.Note that in [23] , a detailed theoretical and experimental analysis demonstrates the superiority of Delta divergence over KL divergence.
The rest of this paper focuses on Delta divergence.The aim is to verify that the attractive properties of Delta divergence are robust to estimation errors on the class probabilities output by the two classifiers.We investigate the sensitivity of D both analytically and experimentally.Moreover, we show how the empirical distribution of this novel incongruence measure could provide a basis for selecting an appropriate classifier incongruence detection threshold at a given level of statistical significance.Note that in practice, the only observable information are classifier outputs which are already subject to estimation errors.For such scenarios, we propose practical incongruence detection guidelines and illustrate their use on a real data application concerned with action and activity recognition in breakfast scenario videos.

Statistical properties of D
Delta divergence, proposed in [24] , has been developed from fdivergence [27] , known as variation distance, by merging all the non-dominant class hypotheses into a single set.This preserves the nature of the measure as a proper divergence of differences between two probability distributions, but has the beneficial effect of reducing the "clutter" injected by the terms associated with the non-dominant hypotheses.The positive impact of this clutter reducing modification grows with the number of classes.Let us denote the dominant hypotheses identified by two classifiers by ˜ μ = arg max ω ˜ P (ω| x ) and μ = arg max ω P (ω| x ) .Also, for the sake of notational simplicity, in the following, we shall drop making explicit references to specific observation x and denote the a poste- riori class probabilities P ( ω| x ) simply as P ω , and ˜ P (ω| x ) simply as ˜ P ω .Delta divergence is defined as The focus of Delta divergence ( D ) given in ( 2) is solely on differences between a posteriori probabilities of dominant classes (most probable classes identified by the two classifiers).When the two classifiers agree on the identity of the dominant hypothesis, Delta divergence measures only the difference between the corresponding a posteriori class probabilities.When they disagree, and the signs of the differences differ, Delta divergence equals the sum of the absolute values of the respective differences.When the labels disagree, and both of the differences of the a posteriori class probabilities are positive, it picks the maximum of the absolute values of these differences.
Apart from clutter reduction, D has a number of other attractive properties.It is independent of the actual values of a posteriori class probabilities, and therefore of their surprisal content.In other words, classifier incongruence measurement is not modulated by the likelihood of the dominant hypotheses.The measure is bounded and symmetric.In Section 4 we show that the robustness to clutter also reduces the sensitivity of Delta divergence to a posteriori class probabilities estimation error.All these characteristics jointly make Delta divergence ideal for gauging classifier incongruence.
D takes values from the interval [0, 1].In order to provide insight into the frequency of occurrence of its values, we sample the space of different combinations of class probability distribu-tions outputs ( P and ˜ P ) uniformly, and make a note of the resulting incongruence measure values after they enter the calculation defined in (2) .We then identify the scenarios in which classifiers agree on the most probable hypothesis, or disagree (cases of label agreement and disagreement) separately, and create histograms, on which averaging over bins and normalization is applied to end up with probability distributions.The graphs given in Fig. 3 are estimated using a total number of 10 6 of such probability distribution pairs for problems involving a number of classes equal to 3 and 6.Fig. 3 (a) shows the probability density functions of D for the cases of label agreement, Figure 3 -b shows distributions for the cases of label disagreement; and Figure 3 -c depicts the aggregate distributions for all cases (combination of label agreement and disagreement).In each subfigure, 3 class problems are indicated by dashed lines, whereas 6 class problems are indicated by the solid curves.Note that the indicated values of m are selected for illustration purposes and the trend for other values follow in accordance with the following analysis.
The effect of the number of classes, m , on the incongruence measure distribution can be observed by comparing the solid and dashed lines in Fig. 3 .As m increases from 3 to 6, high values of incongruence become more likely for the label agreement case.This can also be deduced from (2) ; the upper limit for incongruence can be shown to equal [1 − (1 /m )] .Note that as m goes to infinity, this value becomes equal to 1.A related observation for this case is the decrease in the likelihood of observing incongruence values close to zero when m increases.In the case of label disagreement, contrary to the findings for agreement, the realizations of lower D values are more probable for m = 6 than m = 3 .Accordingly, for high D values, the probability densities are lower for m = 6 compared to m = 3 .Note that the upper limit of the disagreement case is equal to 1 for all m, as a result of the second condition in (2) .Combining the two sets of observations for the label agreement and disagreement cases, it can be concluded that their corresponding distributions get shifted towards each other as m increases.This means that the bigger m is, the more difficult it becomes to tell if an obtained/measured incongruence value emerges from a scenario of agreement in the most probable hypothesis, or disagreement.On the other hand, for smaller m , the overall incongruence distribution has a higher variation within the range [0, 1].The effect of the incongruence distribution on hypothesis thresholding is going to be further discussed in Section 4.3 .

Error sensitivity
In reality, the a posteriori probabilities for the various hypotheses will be estimated by the two classifiers subject to estimation errors.The aim of the error sensitivity study is for the reader to get a feel for the effect of these estimation errors on the properties of Delta divergence.The intention is not to provide a comprehensive theoretical analysis, but instead consider a few simple cases where analysis is possible to gain intuitive idea of the impact of estimation errors.The subsequent simulation studies explore the scenario landscape more thoroughly, but it should be noted that even here the aim of the study is more educational than to present definitive findings.The justification for this is that in practice we will not have access to ground truth class probabilities, neither to estimation errors, and a more practical methodology will be required to design a class incongruence detector.Such a design methodology will be presented in Section 4.6 and its application illustrated in Section 4.6.1 .
Let us denote the estimates of P ( ω| x ) and ˜ P (ω| x ) by P (ω| x ) + η ω (x ) and ˜ P (ω| x ) + ˜ η ω (x ) respectively, where η ω ( x ) and ˜ η ω (x ) are the estimation errors.We refer to the probability density functions of these errors as q ( η) and ˜ q (η) accordingly.For the sake of simplicity, we shall assume that q ( η) and ˜ q (η) are normal distributions with zero mean and standard deviation σ .However, it should be emphasized that estimation errors have to satisfy the conditions Thus, as probabilities have to be nonnegative as well as not exceeding unity, the normality assumption for q ( η) has to break down for a posteriori probabilities close to zero or one.In order to satisfy these constraints, we shall simply assume that the tail of the Gaussian, constrained by any of the conditions, is clipped; and the remaining part of the distribution is normalized to have under the curve area equal to 1. Dropping again explicit references to observation x , for a noise-free posterior P , the resulting error distribution, p ( η, P ), becomes An example is shown in Fig. 4 for P = 0 . 1 and q (η) = N(0 , 0 .15 ) .In Fig. 4 (a), the thin solid line depicts q ( η).The thick solid line illustrates p ( η, P ), obtained by clipping the tail of q at the cut off point, −P = −0 . 1 , as indicated by the dashed line, followed by normalization.On the other hand, in Fig. 4 (b), the thick solid line illustrates the probability density function r ( s ) of the estimate s = P + η.It should be remembered that r is obtained as a convolution of the distributions of P and η, such that Finally, the thin line in Fig. 4 (b) is provided for convenience and depicts what r ( s ) would look like if the condition (4) did not exist.
The estimation errors corrupting class a posteriori probabilities will cause estimation errors on the computed incongruence values.It is evident that for incongruence measures involving summation over all the classes these probability estimation errors will create high background noise level which will make it difficult to measure incongruence (surprise) reliably.Hence, the proposed incongruence measure in (2) , which involves summation over at most two classes (when μ = ˜ μ) should be considerably more robust to noise.Let us now investigate the statistical properties of D .With the contamination by estimation errors, the incongruence measure can be expressed as In the two class case, referring to (3) , the estimation errors are not independent.However, as we consider problems involving several classes, we make the simplifying assumption that the probability estimation errors are statistically independent.The useful signal in each term defined by absolute value operators in (7) , which is constituted by the difference of a posteriori class probabilities, is corrupted by the difference of the two probability estimation errors.As we assume that the errors are independent, the probability distribution τ ( ν) of their difference ν = η μ − ˜ η μ , can be given by a convolution of the two component distributions, i.e.
without loss of generality (w.l.o.g) for all pairs of error terms.It would be difficult to perform an exhaustive bias and variance analysis of (7) .However, to get a feel for the effect of the estimation errors, we shall consider a few special cases.
If the a posteriori probability of the most probable class for any expert is close to the cut-off points, then the corresponding estimation error distribution will result in tail clipping as given in (5) to satisfy (4) .Any clipping affecting individual components would then show its effect on the joint error distribution defined by (8) .Therefore, while computing the expected value of the incongruence measure given in (7) , the absolute value operation in expectations would create additional bias of the estimated value as a result of clipping.
In order to keep the analysis simple, in the following few cases we will assume that no tail clipping of the error distributions occurs.In order to obtain closed forms, a further assumption that the identities of the most probable hypotheses do not change has been made.Note that these constraints are not invoked in the comprehensive experimental study given in Section 4 .

Case 1: Both classifiers produce identical probability outputs for the most probable hypothesis
In this case, we assume that the expert probability outputs are identical for the most probable hypothesis before the addition of estimation noise.Hence, As no tail clipping occurs, the difference of errors will also be distributed normally with zero mean, but with variance 2 σ 2 .The absolute value operation will result in D to have a half normal distribution with mean The implication of the result is that even when there is a 10 0% congruence between the classifiers, the incongruence measure will on average be nonzero, with the bias defined by the variance of the a posteriori probability estimation errors.The variance of the incongruence measure in this ideal case will be given by the variance of the half normal distribution, i.e.
Thus, the standard deviation σ of errors on D in this scenario is The results in (10) and ( 12) have bearing on the selection of a threshold on the incongruence measure to detect unusual events.

Case 2: Both classifiers agree on the most probable hypothesis
In this scenario the incongruence measure is Assuming none of the component estimation noise values violates the axiomatic properties of probabilities, the true value of Delta divergence a = | P μ − ˜ P μ | will be corrupted by a noise term with the distribution of a clipped Gaussian, rescaled by factor 1 Now let us denote P = P μ − ˜ P μ .To determine the expected value of Delta divergence let us note that under the above assumptions the compound noise distribution τ ( ν) in ( 8) is symmetric.The argument P + ν can be either positive or negative.However, due to the symmetry induced by the absolute value operation, we need to consider only the case when the argument is positive, as the result for the negative argument will be exactly the same.In this scenario P can be either positive or negative.In the first case, which will occur with probability 1 − γ , the contribution to the expected value will be c The expected value will be given by the weighted sum of these two contributions, namely This can be alternatively expressed as which after rearrangement becomes Noting that ∞ a ντ(ν) dν ≥ aγ , we find that the expected value in (16) will be positively biased, i.e.
For a given σ , this bias will diminish with increasing a ≤ 0.5 and γ → 0 as well.When a = 0 the bias will be equivalent to (10) of Case 1.
The positive bias of Delta divergence will suggest that the classifiers are less congruent than in reality.As a increases, the clipping will monotonically decrease, reducing the positive bias.For large enough differences in the support for the dominant hypothesis (larger a ) provided by the two classifiers, the expected value of the incongruence measure will become unbiased, as the contribution of the first term of the expression in (16) will go to zero.This is because there will be no clipping caused by the absolute value operation at the boundary of 0, and the distribution of error differences τ ( ν) will remain Gaussian.At the same time the factor γ will also approach zero.In general, however, estimation error will be introducing a positive bias and the measured incongruence will appear to be stronger than its true underlying value (noise-free case).
When the distributions of estimation noise on the probabilities of the dominant hypothesis cease to be Gaussian due to the boundary constraint effects, the compound estimation noise distribution becomes complicated, rendering Case 2 intractable.In any case, the argument of the absolute value operation will be distributed according to τ ( ν) in (8) .The inversion of the negative val- ues of ν by the absolute value operation is likely to render the estimated magnitude of Delta divergence once again positively biased.

Case 3: Classifiers disagree on the most probable hypothesis
In this case, as the classifiers disagree on the most probable hypothesis, there is likely to be a gap between the a posteriori probabilities determined by the classifiers for class μ and ˜ μ.Let us focus on the scenario where the signs of the probability distributions are positive.Under the assumption that the differences in the estimated a posteriori probabilities of the dominant hypotheses avoid clipping, the form of τ ( ν) will remain Gaussian for all error terms and the expected value of the incongruence measure will be with the bias b dependent on the relationship between the arguments of the max operator and the estimation noise distributions, as discussed in Case 2 .The limiting case of Case 3 is when for one classifier the maximum a posteriori probability is equal to one while for the other it is zero, and vice versa.Then the estimation error distributions are subject to severe clipping.Note that in this case the estimation noise will tend to reduce the underlying difference between the a posteriori class probabilities and consequently, the expected value of ν will be negatively biased by an offset equal to the mean in ( 10) Note that the effect of estimation noise will be studied experimentally in Section 4 .

Incongruence measure thresholding
To flag incongruence between two classifiers, a suitable threshold must be selected for the incongruence measure.When there is complete agreement between the classifiers (i.e.Case 1), the threshold for the half normal error distribution, | η μ − ˜ η μ | , should be set, say, 3 standard deviations from the mean of the (unclipped) normal distribution.Recalling that the variance of the normal distribution of the compound noise is 2 σ 2 , it follows that threshold In practice the estimated a posteriori probabilities will be different.
For instance, a contextual classifier is likely to have a sharper dis-tribution of probabilities over the various hypotheses than a non contextual classifier.For a difference in a posteriori probabilities which would result in no error distribution folding and for absolute value operator that would cause no bias, i.e. | P μ − ˜ P μ | = 3 √ 2 σ, the threshold should be set at

Experimental sensitivity analysis
The theoretical analyses presented in Sections 3.1 -3.2 provide some insight into the incongruence measure distribution and hypothesis testing in the presence of noise.However, the basis it provides for selecting the test statistics threshold is incomplete for several reasons: • In general, it is not possible to obtain closed forms.
• Each solution is for a specific scenario defined by the class probability distribution, the corresponding noise-free incongruence measure value, the level of noise, and its distribution, which changes dynamically as a function of the class probabilities for the dominant hypotheses.
The aim of the simulation studies designed and reported in this section is to obtain a more comprehensive picture of the properties of the test statistics and to develop a practical basis for setting an appropriate incongruence measure threshold.This will be achieved by • conducting empirical studies of the effect of class probability estimation noise on the distributions of the proposed test statistic, which is parameterised by fixed noise-free incongruence measure values and the number of classes involved in decision making, • exploring the variations of the test statistic distribution as a function of different scenarios giving rise to the same noise-free incongruence measure value, • integrating the test statistic distribution over different scenarios, and • integrating the test statistic distributions over a range of noisefree incongruence values deemed to reflect the state of the two classifiers being congruent.
The successive integrations will yield a resulting test statistic distribution which can be presented in terms of the area under its tail, facilitating the selection of a threshold that would meet a specified level of confidence in the acceptance of the hypothesis of classifier congruence based on the proposed measure.
In Section 4.1 , we firstly consider an example scenario where the two classifiers estimate identical posterior distributions for the most probable hypothesis and there is no noise tail clipping.The results of this section are expected to confirm the theoretical findings analysed in Case 1 given in Section 3.1 .In Section 4.2 we consider more general scenarios, parameterised by noise-free incongruence measure values and estimation noise statistics.Further experimental studies regarding hypothesis thresholding are carried out in Sections 4.3 -4.5 .Finally in Section 4.6 , the practical implications and guidelines for incongruence detection are provided.This section also includes an example real data application for utilising the provided guidelines.
It should be mentioned that in all experiments, each of the probability distributions employed ( P and ˜ P ) has been created by uniform sampling.Specifically, for a given m class problem and an instance x , the a posteriori probability output belonging to class ω n , P ( ω n | x ), is obtained by drawing a random sample from within the range 0 , 1 − y<n P (ω y | x ) .No te that the upper limit is updated so that ω P (ω| x ) = 1 , and the probability belonging to the last class, ω m , is assigned to sampling.After creating 10 3 many P and ˜ P distributions separately, the set of all possible combinations of ( P , ˜ P ) are used in the experiments.Hence, the total number of instances, x , is made to be equal to 10 6 .

Identical class probability outputs for the most probable hypothesis
For this simple and somewhat unrealistic case, we assume that the underlying posterior probabilities output by the two classifiers are identical for the most probable hypothesis (i.e.D = 0 ), and that the identity of the most probable hypothesis (label) does not change after the addition of the estimation noise (Case 1 of Section 3.1 ).
There is of course an infinite number of posterior class probability distributions which fit this specification.In this case study, we sample them subject to the constraint that the probability of the dominant hypothesis for any expert is sufficiently far away from the boundaries of their interval of support so as not to cause the estimation error distribution to have its tail clipped.The qualifying distributions, P and ˜ P , are then corrupted by zero mean Gaussian noise, and finally, incongruence measure distributions are acquired from the corrupted distributions.The noisy incongruence measures obtained are denoted as ˜ D .The resulting distribution of ˜ D given in Fig. 5 , which is obtained for the standard deviation of the estimation noise σ = 0 . 1 , supports the theoretical findings in Section 3.1 .The curve is shown to be in the form of a half normal distribution as discussed in Section 3.1 , and the use of any value greater than 4 .24 σ = 0 .424 as a surprise threshold is depicted to retain at least ∼ 99.7% of the distribution as given in Section 3.2 .Note that the number of classes, m , does not have an effect in this particular case, as the terms to do with P and ˜ P disappear from the calculation of surprise as shown in (9) .

Distributions of ˜ D for arbitrary class posterior probability and estimation error distributions
In this set of experiments, we parameterise the scenarios by varying noise-free D , and study the impact of noise without applying restrictions on its characteristics such as tail clipping or label change.
Initially, for a given noise-free D , all possible pairs of the probability distributions P and ˜ P which output this value from ( 2) , are recorded.The process of selecting the probability distribution pairs takes the cases involving agreement and disagreement in the most probable hypothesis into account separately.As a second step, noise drawn from the distribution p ( η), which is obtained by regularising N (0, σ ) as given in (5) , is added to the selected P and ˜ P pairs.In these experiments, σ is set to 0.10.The resulting distributions of noisy ˜ D are acquired from the corrupted P and ˜ P .
Using the histograms given in Fig. 3 , a few representative (noise-free) D values have been selected to perform the analysis.These values are 0.3 for the case of label agreement, and 0.3 and 0.7 for disagreement.The probability distribution functions of ˜ D obtained for the label agreement case are given in Fig. 6 (a) and (b) for 3 and 6 class problems respectively.As for label disagreement, Fig. 7 presents the results for the fixed value of D = 0 .3 , and Fig. 8 for D = 0 .7 .
It can be observed for all scenarios of label agreement and disagreement that the peak of the noisy incongruence measure distributions appear at the value where the input noise-free measures are originally defined.However, the noise shows its effect throughout the [0,1] range and the intensity of this effect not only depends on the values of D and σ , but also on the number of classes, m .For greater m , the impact can be observed to be marginally smaller, and hence a narrower spread of the surprise within the range [0,1] is acquired.

Integration over scenarios
In this section, we concentrate on further experimental analysis regarding hypothesis testing, where the task is to find a threshold on our test statistic which would allow us to reject the hypothesis at a given level of significance.
The experimental analysis reported in Section 4.2 was based on a variety of incongruence measure probability distributions obtained for fixed input noise-free surprise values, sampled by our experimental procedure.However, as we will not know the characteristics of the underlying scenarios in practice, it is more appropriate to integrate over the various scenarios by taking their prior probability of occurrence into account.This integration can then be represented by a plot of the area-under-the-tail belonging to the ˜ D distribution as a function of threshold.The rationale for this integration can be explained using a simple example.Looking at Fig. 8 , it can be observed that a threshold of 0.5 can leave an important portion of some distribution curves out and cause false alarms during surprise detection.However, it may turn out that the cases with large lower tail areas for the given threshold may not be likely to occur with high probability, e.g. they might only happen when the estimation noise causes a label change.In other words, the contribution of these cases to the probability of false alarm might be expected to be low.Hence, in this set of experiments, by taking the likelihood of the distributions into consideration, the average sizes of the upper tail areas (% over the total area) are gauged for given threshold points.Note that the area estimates are parameterised by noise level.In Figs. 9 and 10 , the resulting graphs illustrating the upper tail area (%) versus threshold are given for 3 and 6 class problems respectively.In each figure, the results are obtained for different fixed noise-free surprise values and they are depicted using different line types.The graphs at the top row are acquired using noise distribution with standard deviation σ = 0 .05 , whereas at the bottom row with σ = 0 . 1 .The first column corresponds to the results obtained from the case of label agreement, and the second column applies to disagreement.
Confirming the experimental results presented in Section 4.2 , a comparison of Fig 9 (a) with Fig. 10 (a) shows that for any fixed surprise threshold, the upper tail area size is greater for 3 class problems ( m = 3 ) compared to 6 classes ( m = 6 ) in the label agreement case.This observation is valid for all values of σ and noisefree D values.For the case of label disagreement, let us analyse, for instance, the scenario in which noise-free D = 0 .5 and noise σ = 0 .05 by comparing Fig. 9 (b) and (b).The observation that the spread of the surprise distribution within the [0,1] range is greater for m = 3 than for m = 6 (as previously shown in Section 4.2 ) is again reflected in the respective area-under-the-tail curves.For ex-     ample, for ˜ D = 0 .6 , the upper tail area is just under 0.1 for m = 3 , whereas it is almost zero for m = 6 .
In Figs. 9 (a) and 10 (a), a threshold around 0.7 can be observed to cover more than 95% of the lower tail areas for the label agreement cases in all scenarios.This means that almost all scenarios, which incorporate classifier agreement in the most probable hypothesis, will be perceived as congruence.However, it should be borne in mind that a scenario where there is high discrepancy between the probability outputs of two classifiers, giving rise to a high noise-free incongruence value, e.g. one greater than 0.5, should not necessarily be labeled as congruence even though there is label agreement regarding the most probable hypotheses identified by these classifiers.Hence, depending on the choice of a noisefree D cut-off for labelling congruence/incongruence, a more suitable threshold for ˜ D should be selected.Let us say we are using the cut-off value of D = 0 .5 such that all noise-free surprise values below 0.5 are to be detected as congruence, and above this value for incongruence.Utilizing σ = 0 .05 , it can be observed from Figs. 9 (a) and 10 (a) that the threshold of ˜ D = 0 .4 labels all cases with D = 0 .6 as incongruence, and D = 0 . 1 , 0 .3 as congruence with confidence around ∼ 95%.
Proceeding with D = 0 .5 cut-off value and σ = 0 .05 , and looking at Figs. 9 (b) and 10 (b) to analyse the case of label disagreement, it can be seen that employing a threshold of 0.4 (as in the case of label agreement) results in identifying the scenarios with noise-free D = 0 .5 , 0 .7 as incongruence, and scenarios with D = 0 . 2 as congruence with ∼ 90% confidence.
Although the findings in this section are of importance to give an insight into the effects of ˜ D for fixed D about the cases of agreement and disagreement separately, it should be noted that in practice it is not possible to know in advance the values of the noise-free incongruence measures or the nature of the problem (giving rise to label agreement or disagreement).Hence, in Section 4.4 , we will be marginalizing over these concepts after selecting a cut-off value for the noise-free measure to define the congruence-incongruence boundary.

Integration over noise-free congruence values
In Section 4.3 we have integrated over various scenarios, each defined by a fixed noise-free surprise value, and presented the findings for the cases of label agreement and disagreement separately.Here, we further integrate these area-under-the-tail distributions by aggregating over all noise-free surprise values below 0.5 for congruence, and above for incongruence.This process takes the prior distributions of noise-free values into account and marginalises over the scenarios of label agreement/disagreement to reflect the use of the proposed measure in practice.Hence, the thresholds suggested as a result of the experiments in this section will be different to those from Section 4.3 .
The results of the experiments are provided in Fig. 11 for a 6 class problem and for σ = 0 .05 .Fig. 11 (a) indicates the confidence in the decision to accept the hypothesis that the two classifiers are congruent as a function of ˜ D .It can be observed that, for instance, a threshold of 0.5 on the proposed measure would capture the classifier congruence cases at ∼ 95% confidence.Setting the threshold to 0.6 would raise the confidence level to ∼ 100%.However, the plot in Fig. 11 (b) clearly indicates that we should not be too ambitious, as setting the threshold to yield high confidence levels for detecting classifier congruence will inevitably lead to unacceptable level of false negatives, i.e. declaring incongruent classifier outputs as congruent.For example, at 0.4 threshold we will correctly detect ∼ 10 0% of classifier incongruence instances, but this figure goes down to ∼ 80% for the threshold set at 0.5.
Thus choosing a suitable classifier incongruence detection threshold is a question of trade-off between low false positives and low false negatives.In this context, it is important to bear in mind, that in practical applications we will not normally be able to generate the area-under-the-tail curves for incongruence cases.The threshold selection will have to be based on such curves for classifier congruence cases only.

Relationship of Delta divergence with KL and DC-KL under noise
The relationship of noise-free Delta divergence of Bayesian surprise (KL) was already shown and discussed in [24] as the main motivation for the development of the novel decision cognizant divergence and its validation.It is pertinent to investigate whether the favourable properties of Delta divergence vis-a-vis KL and its decision cognizant variant DC-KL are preserved even when the a posteriori class probability estimates are subject to errors.
In The results demonstrate that the ROC curves for the noise-free Bayesian surprise and DC-KL measures are quite remote from the top left corner (perfect separation) due to clutter, although DC-KL shows better performance than KL.In the presence of estimation noise, the areas under the ROC curves for ˜ D K and ˜ D D are much lower than that for ˜ D .Moreover, as anticipated, the areas for ˜ D K and ˜ D D are also smaller than that for D K and D D .Note that with increasing levels of noise, the under-the-curve area sizes decrease, but the ranking of the measures is maintained.It is also interesting to mention that the area under the ROC curve for ˜ D is larger than the area related to the noise-free D K and D D as given in Fig. 12 for σ = 0 .05 , and this observation still holds for σ = 0 . 1 .
In order to show the supremacy of Delta divergence over KL and DC-KL for a varying number of classes, in Fig. 13 , we show the results of an experiment where we set the confidence level for false positives to 0.05 and calculate the corresponding true positive rates of the given measures for 2,3,6,8,10 and 15 classes.The mea- surements are performed for σ = 0 .05 .It can be observed that TPR for Delta divergence is better than that of DC-KL for all number of classes, and DC-KL outperforms KL except for 2 classes, where DC-KL becomes identical to KL as there is no 'clutter" class in this scenario.Note that the plots remain approximately constant for higher number of classes than 10 (not shown in the figure).

Practical implications
The theoretical analysis and the simulation studies presented in the paper are intended to provide an intuitive insight for the properties of the proposed incongruence measures.However, in practice, setting the decision thresholds is more likely to be based on empirical distributions of ˜ D estimated on some anomaly free content.This can simply be achieved by histogramming the incongruence measure values computed from the estimated class a posteriori probabilities on a stream of training sensor data.From such a histogram the graph relating the upper tail area to threshold on ˜ D , similar to that plotted in Fig. 11 (a), can be determined and a suitable threshold selected, corresponding to a given level of confidence.This is a very pragmatic approach, as it makes use of the posterior probabilities for all the hypotheses estimated by the data interpretation system.We do not need the ground truth values of these probabilities, neither noise estimates.
The selected threshold can be tested on an independent set of anomaly free data of the same quality to check for false positives.This again is realistic, and can be done without any ground truth annotation of the validation data.
Commonly, the decision threshold would be based on a desired level of confidence that the classifier outputs are congruent.This is compatible with the standard methodology of statistical hypothesis testing for outliers in statistical anomaly detection [6] .It is also consistent with the underlying philosophy that any anomaly de- tection system should be designed on anomaly free training data, as anomalies, by definition, are very rarely observed, and therefore cannot be used in training.However, in some cases a few anomaly observations may be available or even synthetically generated; anomalous objects or events could be inserted in the data, or alternatively, some object models could be removed from the model database.For example, a few items could be removed from the speech recognition system vocabulary, which would result in incongruence between the outputs of phoneme and word classifiers, indicating out of vocabulary word anomaly.The incongruence threshold level could then be checked for anomaly under-detection (false negatives) on realistic examples of anomalous, or at least incongruous, situations.
A set of guidelines to be utilized for measuring incongruence in practice can be given as follows: Note, if it is possible to create a validation set with synthetically injected anomalies, Steps 1-3 can be repeated so as to obtain an under the tail distribution equivalent to Fig. 11 (b).In this scenario, it will be possible to compute a ROC curve similar to those provided in Fig. 12 , and threshold selection can be made to reflect a suitable balance between false positives and false negatives.
Let us now demonstrate the use of these guidelines on the action recognition problem defined on the Breakfast dataset.

Incongruence detection on Breakfast Dataset
Breakfast dataset [41] is a current benchmark for action and activity recognition from videos, which comprises 10 activities related to breakfast preparation, performed by 52 different individuals in 18 different kitchens.Each activity consists of a number of action units, and 48 different action units are observed in total.For this dataset, the goal is twofold: (1) to recognise simple, primitive actions (such as cut fruit, take bowl ), ( 2) to recognise high level, complex activities (such as prepare salad ) by utilising the detected actions.In this section, we focus on the outputs of a contextual and a non-contextual classifier, to illustrate the design of a classifier incongruence detector based on the Delta divergence in a practical scenario.
We first extract low-level local features with improved dense trajectories (iDTFs) [43] and reduce their size to half (from 426 to 213 elements) with PCA.Using the training set defined by the experimental protocol for the Breakfast Dataset [41] we estimate a 16 mode Gaussian mixture model of the empirical distribution of the extracted features.The features are encoded to Fisher vectors [44] with the VLFeat toolbox [45] .Finally, L2 and power normalisations are applied to the Fisher vectors.The resulting Fisher vectors are of size 2 × K × D , where K is the number of clusters of the GMM and D is the dimensionality of the PCA compressed iDTF descriptor.In our case, for K = 16 and D = 213 the size of each Fisher vector is 6816 dimensions.We reduce this size to 64 dimensions with a second PCA.Having obtained the reduced dimensionality Fisher vectors, we recognise actions in the dataset with the HTK toolkit [46] .
The HTK toolkit performs a non-contextual action recognition.For each detected action, HTK provides its temporal extends ( i.e. its start and end point within the video), its class ( e.g.pour water, stir milk ) and a detection score in the form of log-likelihood.The HTK toolkit contextual classifier performs activity recognition by utilising information regarding each action's neighbouring actions.
The contextual classifier partitions each video in the Breakfast dataset and assigns action labels to each of the resulting segments.The noncontextual classifier uses the segmentation information derived from the contextual classifier and also labels each segment individually.Delta divergence values are then computed for all segments, using the class probability values output by the classifiers.Afterwards, the set of all segments are divided into two random partitions for training and test, and anomalies are eliminated from the training set.By selecting monotonically increasing values of threshold for ˜ D , the area under the tail of the Delta divergence distribution can be computed for the anomaly-free training set, as given in Fig. 14 .
The operational threshold for incongruence detection can then be selected to produce an appropriate level of confidence in the acceptance of congruence hypothesis.Specifically, as an example, for the distribution in Fig. 14 we identify the threshold of 0.63 at 2.5% confidence level.The amount of the false negatives detected by this threshold in a separate test set is 2.6%, which is close to the set confidence level, as expected.Interestingly, the test set contains a few instances of classifier outputs producing incongruence value close to unity.An example of such a case is shown in Fig. 15 .
This true incongruence flags a situation where the video happens to contain a main action sequence of coffee making, as demonstrated by the key frame in Fig. 15 (a), and a secondary sequence which takes place at the background after the completion of the main action, as given by the key frame in Fig. 15 (b).The contextual classifier recognises the final segment of this video, upon the completion of coffee making, as "no action".However, the noncontextual classifier labels it as "take bowl", as this is an action carried out by the background object at this time instance.Hence, each of the classifiers produces a sensible response, however they focus on different interpretations, and this disagreement is detected by the incongruence detector correctly.

Conclusion
We addressed the problem of classifier incongruence detection for decision making systems engaging multiple classifiers (contextual/noncontextual, multimodal).The problem has been cast as one of statistical hypothesis testing, with the focus of the paper directed on the choice of a suitable test statistics.It has been argued that the challenging nature of the classifier incongruence detection lies in the inherent fuzziness of the concept of incongruence, and the effect of estimation errors on the classifier outputs.After reviewing the deficiencies of the state-of-the-art methods for classifier incongruence detection, we carried out a theoretical and experimental investigation of a recently proposed measure, Delta divergence, with the aim of providing an intuitive feel for its behaviour.The simulation studies were designed to estimate the probability distribution of the test statistics for various scenarios defined in terms of noise-free classifier incongruence measure values and estimation error statistics.The area under the tail of the distribution for various thresholds on the test statistics can then be determined to illustrate the effect of estimation noise on incongruence threshold selection.Based on the theoretical findings, a set of guidelines have been developed for selecting classifier incongruence threshold in practice.The use of these guidelines has been illustrated on the problem of action and activity recognition in breakfast scenario videos recording the preparation of different types of dishes for breakfast.
As for future work, the analysis can further be expanded to account for scenarios where more than two decision making experts are taken into consideration.Moreover, it would be interesting to conduct an extensive comparative study of Delta divergence with other families of divergence measures such as Bregman [47] and Renyi [26] divergences.However these divergences would have to be extended to decision cognizant equivalents first.
(a), the top ten hypotheses output by the two classifiers are shown in Fig. 1 (b).The classifiers are clearly incongruent.Yet the corresponding KL and DC-KL incongruence values, ˜ D D = ˜ D K = 1 .63 , are very low in the context of the normal range of values of these test statistics shown in the histograms in Fig. 2 (a) and (b), respectively.The histograms have been computed on a training set outputs of the two classifiers described in detail in Section 4 .

Fig. 1 .Fig. 2 .
Fig. 1.(a) Key frame taken from an example Breakfast dataset segment (b) Probability distribution values belonging to the contextual and non-contextual classifiers given for a sample taken from the Breakfast dataset, for which ˜ D K = ˜ D D = 1 .63 .

Fig. 3 .
Fig. 3. Probability density functions (pdf) of D for classifier label agreement on the most probable hypothesis (a), for classifier label disagreement (b), and for all cases (c).Dashed lines indicate the distributions obtained for 3 class problems, and solid lines for 6.

Fig. 5 .
Fig. 5. Pdf curve for ˜ D obtained for identical classifier outputs for the most probable hypothesis, affected by N (0, 0.10) noise with no tail clipping.

Fig. 6 .
Fig. 6.Pdf curves of ˜ D for the case of label agreement, obtained for D = 0 .3 corrupted by noise p ( η), for 3 class problems (a) and 6 class problems (b).

Fig. 7 .
Fig. 7. Pdf curves of ˜ D for the case of label disagreement, obtained for D = 0 .3 corrupted by noise p ( η), for 3 class problems (a) and 6 class problems (b).

Fig. 8 .
Fig. 8. Pdf curves of ˜ D for the case of label disagreement, obtained for D = 0 .7 corrupted by noise p ( η), for 3 class problems (a) and 6 class problems (b).

Fig. 9 .
Fig. 9. Upper tail area size versus ˜ D threshold for different noise levels and different noise-free D .Given for 3 class problems under the scenarios of classifier label agreement (a), and disagreement (b).

Fig. 10 .
Fig. 10.Upper tail area size versus ˜ D threshold for different noise levels and different noise-free D .Given for 6 class problems under the scenarios of classifier label agreement (a), and disagreement (b).

Fig. 12 .
Fig. 12. ROC curves showing the capacity of ˜ D , D K , ˜ D K , D D and ˜ D D to separate the state of congruence from incongruence computed for σ = 0 .05 and 6 classes.

Fig. 12 ,
we plot the Receiver Operating Characteristic (ROC) curves of the noisy Delta divergence ( ˜ D ), noisy Bayesian surprise ( ˜ D K ) and noisy DC-KL ( ˜ D D ) for the 6 class problem and σ = 0 .05 .The ROC curves are computed by setting the boundary between congruence and incongruence at D = 0 .5 .The figure also shows the ROC curve for noise-free Bayesian surprise and DC-KL measures, given as D K and D D respectively.

Fig. 14 .
Fig. 14.Upper tail area for anomaly-free Breakfast dataset samples for training set.

1 .
Using an anomaly-free training set of sensor data, the a posteriori probabilities, which are computed by the classifiers for various hypotheses as part of the data interpretation process, are recorded.2. The adopted incongruence measure values are computed from the probabilities obtained in Step 1, and their distribution is estimated.3. The area under the tail of the distribution determined in Step 2 as a function of threshold on the test statistic is computed.This will produce a graph equivalent to the one shown in Fig. 11 (a).4. Using the plot derived in Step 3, a classifier incongruence hypothesis testing threshold is selected for a specified confidence level, as described in Section 4.3 .

Fig. 15 .
Fig. 15.Key frames belonging to the main action (a) and the background action (b), extracted from an example test sequence in Breakfast dataset.