Point process models for novelty detection on spatial point patterns and their extremes

Novelty detection is a particular example of pattern recognition identifying patterns that departure from some model of “normal behaviour”. This article considers the classiﬁcation of point patterns ˜x = { x 1 , . . . , x N } deﬁned as sets of N observations of a multivariate random variable X and where the value N follows a discrete stochastic distribution. The use of point process models is introduced that allow us to describe the length N as well as the geometrical conﬁguration in data space of such patterns ˜x . It is shown that such inﬁnite dimensional study can be translated into a one-dimensional study that is analytically tractable for a multivariate Gaussian distribution. Moreover, for other multivariate distributions, an analytic approximation is obtained, by the use of extreme value theory, to model point patterns that occur in low-density regions as deﬁned by X . The proposed models are demonstrated on synthetic and real-world data sets.


Introduction
Novelty detection is the task of recognising test data that differ in some respect from the data that were available during training [1]; it is typically used when there is a large quantity of "normal" data available, but an insufficient quantity of "abnormal" data, thus preventing accurate estimation of the "abnormal" class in a two-class classification setting [2]. Closely related to novelty detection is anomaly or outlier detection, where one also wish to detect abnormalities but where these may not necessarily be entirely novel with respect to the training data. A probabilistic approach starts with a statistical model describing the "normal" state and then detects deviations from this model. The majority of such work deals with a point-wise approach where the novelty of individual points x is evaluated. However when multiple points are evaluated, this can lead to a large number of misclassifications due to the multiple-hypothesis testing problem [3,4]. the density is low. Moreover, in several applications it is not the distribution of the bulk of data that is of interest, but rather the behaviour of extremes, e.g., earth science or financial applications [5].
Traditional point anomaly detection techniques often use a single distribution to describe deviations from a model of normal behaviour. A PPM is based on infinitely many random variables that fully characterise the configuration of a patternx in its data space. Unlike existing approaches, a PPM follows-up the length N of patterns as well as the spatial configuration of the values x i in data space. Furthermore, PPMs have the additional advantage that they can be adopted to followup the extremes within a pattern as well using extreme value theory (EVT).
The remainder of the paper is structured as follows. In Section 2, the problem setting is described, followed by an overview of related work in Section 3. In Section 4, an introduction to the theory of PPMs and EVT is given. Section 5 introduces our main novel approaches to novelty detection. In Section 6, the method is illustrated on synthetic and real-world data sets and compared with existing models that are commonly used for novelty detection. Conclusions are presented in Section 7.

Problem setting
In this article, the problem is addressed to determine whether or not a realized pattern of vectors: is anomalous, where as well the length n is evaluated with respect to a discrete distribution N as the locations x i are evaluated with respect to a distribution X on R d . When n = 0, the pattern is treated as empty.
The problem setting is an example of a collective novelty detection problem where neither the individual instances within a patternx nor the length n itself are classified. Instead, the entire patternx of vectors is considered to be one single instance that is assigned a single label. In terms of statistical hypothesis testing, the problem can be stated as: H 0 :x is a set of vectors drawn from X and n is drawn from N H 1 :x is an anomalous pattern with respect to X and N where H 0 denotes the null-hypothesis and H 1 the alternative hypothesis. The probability of wrongly classifying a 'normal' patternx as anomalous (known as a type I-error) is given by the significance level of the test denoted as α (typically α = 0.05 or α = 0.01). From the point of view of hypothesis testing, it is clear that the problem is related to one of multiple testing. The problem of multiple (hypothesis) testing refers to testing more than one hypothesis at a time and is a well known statistical problem [3]. In this article, we will show how PPMs can be applied to obtain the correct boundary of normality corresponding with the significance level α.
The use of a PPM will allow us to model the spatial configuration of the locations as well as the stochastic length of the pattern. To illustrate this, Figure 1 shows three artificial samples that are anomalous with respect to a standard 2-dimensional Gaussian distribution. The configuration of these points with respect to the boundary of the region A clearly differs as well as the number of point that are situated within the region. While the first sample (Figure 1(a)) contains one point that is situated far beyond the boundary defined by A, the second example ( Figure 1(b)) contains multiple points near the boundary indicating that there is probably a a shift in the underlying process. The third sample (Figure 1(c)), however, indicates an accumulation of points near the centre which probably indicates that the variance of the underlying process is decreased.
We will assume in this article that the locations x i of the patternx are independent and identically distributed (i.i.d.). However, results may be extended to their use on time series data as well by considering the residuals after detrending. The latter will be demonstrated in Section 6.3 using a real-world data set.

Related work
In this section, an overview of related work is given for the main subjects treated in this work: (i) novelty detection, (ii) EVT, and (iii) PPMs.

Novelty detection
Most of the literature of novelty detection deals with a point-wise approach classifying individual points x i and therefore only gives an answer to our problem setting in the case when N = 1. Widely-used examples include the one-class support vector machine (OCSVM) [6]; active outlier (AO) [7], and local outlier factor (LOF) models [8]. For a complete review of the literature on novelty detection techniques, we refer to [9].
Closely related to our problem setting is sequence classification in which the point patternx is considered as one instance that is assigned a single classification label. A commonly-used strategy in the literature is a sequential learning approach, where each point x i within a pattern {x 1 , . . . , x N } with a fixed length N = k is given a label; the labels for all points in a pattern are then combined to yield a single classification for the whole pattern; e.g., this could be the mean of the individual novelty scores learned by an OCSVM. Similarly, a hidden Markov model (HMM) or a conditional random field (CRF) can be used to decide whether a pattern of data points is novel or not [10]. This approach, however is much in line with a point-wise approach where the number of false alarms can increase considerably due to the multiple hypothesis testing problem [3].
Alternatively, times series cluster models have been used to cluster sequences where the instances x i are time dependent. Such approaches, however, heavily depend on the similarity metric and alignment method that are used [11]. Furthermore, such methods are not suited to incorporate the stochastic properties of the spatial configuration of a patternx. Group anomaly detection on the other hand aims to detect interesting aggregate behaviours of data points among several groups [12].
All these approaches, however, lack a joint model for the stochastic properties of the lengths and those of the spatial configuration in data space R d . A suitable PPM will be applicable to an arbitrary bags of points of which sequences of a fixed length is a special case. Moreover, due to its link with EVT which will be explained in Section 4, PPMs will allow us to follow-up the extremes within a pattern as well.

Extreme value theory
In many applications, it is not the distribution of the bulk of data that is of interest, but rather the behaviour of the extremes. Modelling the stochastic behaviour of such extremes is the subject of EVT, which has already been used for many applications ranging from biomedical engineering, structural health monitoring, meteorology, to risk assessment in financial domains [13].
In [14,15,4], the use of univariate EVT is proposed to classify patternsx of fixed length N = k based on their extremes. The proposed EVT approaches were based on the so-called block model and peaks over threshold (POT) model. In such approach, only the single most extreme element inx (i.e. the vector where the density defined by a PDF y = p(x) of a variable X is lowest) is used to obtain a decision. However, the most extreme element is expected to capture limited information about the tails of X that are defined as those regions where the density y = p(x) of the variable X is below a (low) threshold e −u . In [16,17], it is shown how EVT can be used to include information contained in the number of exceedances and the average amount of exceedances present in: with respect to a low threshold e −u on the densities p(x i ), In this work a PPM is proposed that is able to fully capture the spatial configuration that is hidden in the instances ofỹ where the density defined by some PDF is lower than a threshold e −u . Such PPM of exceedances will unify the existing EVT approaches discussed above. By working directly with PPM, the higher-order information arising from the configuration of exceedances can be incorporated efficiently.

Point process models
PPMs are random processes that describe the geometrical structure of patterns formed by objects that are randomly distributed in a multidimensional space. They are well-studied in probability theory and are mainly used to model and analyse spatial data. PPMs are applied in fields as diverse as astronomy, agriculture field trials, epidemiology, and computational neuroscience [18].
The use of PPMs for machine learning and pattern recognition applications is relatively new. PPMs show some links with random fields that are often applied in pattern recognition (e.g., conditional and Markov random fields [19]). Where a random field {Z(x)} on R d is a family of random variables having values in all x of R d , a PPM on R d describes values occurring in random locations in R d . The use of Poisson and determinantal PPMs have recently been introduced in machine learning tasks as image search, tracking and text summarisation [20,21]. In this article, the use of cluster PPMs and PPM of exceedances are introduced for novelty detection applications.

Point processes on a Euclidean space
In this section, general concepts for PPMs are reviewed [22]. After starting with an informal definition in Section 4.1, some typical characteristics of PPMs are given in Section 4.2. In Sections 4.3 and 4.4, two general classes of PPMs are given that will be of high practical use in the application of novelty detection.

Informal definition
Informally, a PPM on a Euclidean space can be viewed as a random variable M with a distribution over all possible point patterns (or "point configurations") in some subspace D of R d , d ∈ N 0 , and where a point pattern is given by: PPMs are able to describe the stochastic properties of the number of points ofx as well as their location in space R d . The set-theoretical notation in (2) indicates that (i) the ordering of the points in a point pattern is irrelevant and that (ii) the points are different and thus do not coincide (a property referred to as simplicity 1 ). PPMs are often characterised by counting measures. The latter are random variables that map each configurationx of the PPM to the number of points falling in a bounded subset 2 A ⊂ R d : where: A PPM is defined such that each N A is a finite random variable, implying that only configurationsx are considered that are locally finite, meaning that they contain a finite number of points in each bounded subset A. In fact, the values of these counting measures N A for all subsets A give sufficient information to reconstruct completely the positions of a configurationx. Indeed N {x} (x) > 0 only applies for those x ∈x.

Distribution and intensity measure
The distribution of a PPM is defined by a measure P that enables us to calculate probabilities of events X 0 : This describes the probability that the realisation of the PPM M belongs to a set X 0 of point patterns. For example, setting X 0 as the set of point patterns with a predefined length k that fall in some bounded subset A ⊂ D gives: where P denotes the probability measure associated with N A . It is clear from (4) that the distribution P of the PPM completely defines the distribution of the random variables N A . Conversely it can be shown that the finite-dimensional distributions (N(A 1 ), . . . , N(A n )), n ∈ N and A i ⊂ D bounded characterise the distribution P of the PPM [18]. A fundamental concept related to the distribution of the counting measures N A are their expected values. The intensity measure of a PPM is defined as: and is a deterministic function operating on sets. The derivative function (provided it exists) of this measure is the so-called intensity function λ(x), x ∈ D ⊂ R d and satisfies:

Finite point processes
PPMs are called finite when each realised point patternx almost surely consists of a finite number of points. A well-known class of such finite point processes is the class of independent and identically distributed (i.i.d.) cluster models [22]. These are PPMs such that the point patternsx consist of a finite number of points that are i.i.d. distributed according to some PDF y = f (x). In particular, an i.i.d. cluster model X on D ⊂ R d associated with a random variable X is uniquely defined by: (i) A random variable N describing the total number N of points in a point pattern, and which is distributed according to a discrete distribution on N: (ii) The random variable X on D ⊂ R d with a PDF y = f (x) generating the locations of the points in the Euclidean space.
Conditioned on the length N of a patternx, the number of points N A of the patternx that fall in a subset A ⊂ D follows a binomial distribution B(N, p A ) with p A = A f (x)dx: Unconditionally the distribution of N A can be found by marginalisation: The density function associated with the probability measure P of a cluster PPM X can be found by calculation of the probability of an event X A of point patterns falling in a subset A ⊂ D: where we have defined: The density functionf associated with the probability measure P of the PPM X is also called the Janossy density function of the PPM X.
There is one final point to be noted here. As PPMs are treated as a theory of unordered point patterns (2), realisation of the random variable X can be considered as a point in a quotient space 3 in which point patterns are determined up to permutations. To be consistent, the density functionf is likewise considered on this quotient space, yielding the additional factorial k! in (7).

Point processes of exceedances
PPMs are closely related to the study of exceedances in EVT. To see this, the PPM of exceedances must be considered studying those observations from a sequence of i.i.d. univariate random variables W 1 , . . . W n which exceed a given threshold u.
A basis result of EVT, termed peaks over threshold (POT), models complete tails of a univariate distribution W, defined as those measurements that fall above some threshold u. In [15] the use of the POT approach is extended to its multivariate use. For this purpose, 3 A space where points are identified with respect to some equivalence relation. those measurement x of a multivariate variable X are described for which z = − log( f (x)) falls above some threshold u: It can be shown that when, the distribution of the exceedances Z − u, conditional on Z > u, satisfies the limiting property: for some scaling factor σ, then: The limiting distribution in (9) is an exponential distribution which belongs to the family of generalized Pareto distributions.
For a fixed choice of n ∈ N, the PPM of exceedances associated to Z is then defined on regions of the form ]0, 1[×]u, +∞[: where we use the superscript e to denote exceedances. The indices are divided by the factor n + 1 to rescale the process to the interval ]0, 1[ as illustrated in Figure 2.
The link between PPMs and EVT is obtained by letting n → +∞ and u → +∞ [17,5]. It can be shown that when the limit in (9) holds for the random variable Z for some scale parameter σ, the corresponding sequences of PPMs of exceedances Z e n will converge to a Poisson point process (PPP) for large u, meaning that the corresponding sequence of counting measures N n A associated with Z e n converge in distribution to a Poisson distribution: on sets and where the intensity measure Λ(A) can be parametrised in terms of the scale parameter σ and a rate parameter λ: Thus, for large u and n, the PPM of exceedances on ]0, 1[ × ]u, +∞[ can be approximated by a PPP where the number of exceedances is distributed according to a Poisson distribution with a rate parameter λ and where the locations of the exceedances are distributed according to an exponential distribution with scale σ, whenever the limit in (9) holds. The choice of u for this approximation to be valid is well-studied in the field of EVT and can be assessed by means of a mean excess plot. The latter is a graphic diagnostic tool in which the sample means of the excesses (Z − u) are plotted against a range of thresholds [13]. When the approximation is valid for u > u 0 , this plot should be linear for u > u 0 . Alternatively, an empirical rule-of-thumb can be chosen that specifies the tail fraction of exceedances above u.
One commonly-used choice is to set u as the quantile at 1 − n 2/3 n log log(n) of a sample of length n of the distribution Z [23]. The parameters σ and λ may then be estimated by means of maximum likelihood estimation.

Novelty detection for point patterns
In this section, we treat the problem of classifying patterns as introduced in Section 2: with respect to a distribution X modelling location in space R d and a distribution N modelling the stochastic behaviour of the length of the pattern. To tackle this novelty detection problem in its full generality, a PPM is proposed wherex is viewed as a realised point pattern of the i.i.d. cluster process X associated with the random variable X. In Section 5.1 the infinite-dimensional study of the distribution X on the quotient space of configurations is translated into a one-dimensional study by considering a distribution of Janossy densitiesf (x). In Section 5.2 it is shown that for a normal distribution X, this distribution is analytically tractable. In Section 5.4, PPMs of multivariate exceedances are introduced that completely characterise the low-density regions of a point pattern and this yields a model that unifies the methods previously introduced in [14,15,17].

Distributions of Janossy densities
Consider a finite i.i.d. cluster PPM X, as introduced in Section 4.3, defined by a random variable N describing the length of the point patterns and a multivariate distribution X with a PDF y = f (x) on D ⊂ R d describing the locations of the points. In this section a numerical method is proposed to evaluate patterns realised by X that fall in a subset A ⊂ D.
For this purpose consider a given subset A ⊂ D and denote X A as being the PPM describing those points within the point patternsx that fall in A: The length of these patterns is governed by a discrete distribution η k = P(N A = k) as given in (6) depending on the random variable N. We remark that X D = X. The distribution of all possible point patternsx A on A ⊂ D is impossible to visualise. Therefore it can be very useful to reduce the analysis of a PPM to the study of a univariate variable V describing the Janossy densities of point patterns v =f (x A ) distributed according to some cumulative distribution function (CDF) G(v). This distribution will make it possible to evaluate configurations of point patternsx A by using novelty scores that have a suitable probabilistic meaning.
The corresponding CDF describes the probability that we might observe a point pattern with a Janossy density that is smaller than some density v. In particular the following event in the probability space of X is considered: where we introduced the disjoint sets: and with η 0 =f ({∅}) being the probability for X A to generate an empty point pattern. The univariate distribution G(v) is now given by the probability of X v ; i.e., G(v) = P(X v ) or: Applying the same reasoning used in (7) for the events A k v and using the abbreviation dx for dx 1 · · · dx k , one finds: where we have used the Heavyside step function: to describe the probability mass associated with an empty pattern. The distribution G(v) can in any case be calculated numerically for a given PDF y = f (x); e.g., obtained by a kernel density estimation (KDE) [24]. For this purpose, G(v) can be approximated by an empirical CDF of Janossy densities of point patterns simulated from the cluster PPM X A .

Point patterns of multivariate normal distributions
In this section an analytic expression is derived for the CDF G(v) and PDF g(v) of Janossy densities of an i.i.d. cluster PPM X associated with a random variable X distributed according to a multivariate normal distribution N(µ, Σ). In particular, in equation (13), one sets: with: In the theorem below we will show that the distribution of V is one of mixed type [25]. Random variables of mixed type are neither discrete or continuous, but are a mixture of both. The discrete component results from the Janossy density η 0 of empty patterns which have a strictly positive probability mass η 0 > 0. This implies a discontinuity in the CDF that we will describe using the Heavyside step function (14).
Theorem 1 Consider the random variable V describing Janossy densities v =f (x) of random point patterns drawn from an i.i.d. cluster PPM X of normal distributed variables X i ∼ N(µ, Σ). The distribution of V is one of mixed-type with CDF: and for each k > 0, F kd denotes the CDF of a chi-squared distribution with kd degrees of freedom. Furthermore, conditioned on the non-empty patterns, the random variable V has a PDF given by: Proof We proceed by the derivation of the distribution of the random variable W = −2 log(V) whereafter a transformation V = e − W 2 recovers the original distribution. From (8), one finds for k > 0: and clearly w > 2 log c k η k . Conditioned on the length k > 0 of a non-empty pattern, the random variable W − 2 log c k η k is given by a sum of k Mahalanobis distances which is distributed according to a chi-squared distribution with k degrees of freedom [25]. Therefore the CDF G(w) associated with W = −2 log(V) is given by: To recover the distribution of V we transform back by means of V = e − W 2 : where we have used the fact that k≥0 η k = 1. This yields us the expression for the CDF G(v).
For the conditional PDF g(v|x ∅), we proceed by calculating the conditional probability which is given by: Using the expression of the PDF of a chi-squared distribution one finds: where α k (v) 2 = −2 log c k η k v for v > c k η k and zero elsewhere. Using the substitution u = η k c k e −ρ 2 k /2 with inverse ρ k = −2 log( c k η k u), this integral simplifies to: (16) yielding the expression of the conditional PDF g(v) given earlier.

Examples and special cases
As an example, Figure 3(a) shows the CDFs of logtransformed Janossy densities w drawn from multidimensional standard normal distributions where the number of dimensions is respectively given by d = 2,3, and 4. The length N of the patterns is distributed according to a binomial B(n, p) with parameters n = 20 and p = 0.7. At w = −2 log η 0 a discontinuity appears as is shown in Figure 3(b) for d = 3. It is clear that the Janossy densities are generally decreasing with increasing dimension. This is a consequence of increasing dimensionality implying that the mass of the Gaussian distribution moves away from the mode [24].
The special case in which the length of the point patterns is fixed and known is worth studying in more detail. In this case the random variable N describing the length of patterns is deterministic meaning that η k = 1 for some k > 0 and all other η i = 0 such that: For k = 1, one obtains the distribution of densities of samples x drawn from a normal distribution with PDF: obtaining a result that was previously found for the special case in [14].

Multivariate point patterns of exceedances
As mentioned in Section 5.1 the distribution of Janossy densities (13) is not analytically tractable for general random variables. However, in this section it is shown that the distribution of Janossy densities of point patterns that occur in low-density regions of a multivariate distribution can be analytically approximated using EVT.
Consider a i.i.d. cluster PPM X associated with the random variable of X distributed according to a PDF y = f (x) on D ⊂ R d such that point patterns (almost surely) have a minimal length n min . In this section a PPM of exceedances (PPM-ex) X e u is defined describing patterns in low-density region  A realisation of X e u is a pattern of exceedances, defined as those points of a point patternx realized by X that fall in the low-density region: In the theorem that follows, an analytic approximation of the distribution of the PPM-ex is found when n min and u are sufficiently large.
Using the transformation z : R d → R : x → − log f (x) the PPM X is transformed into an i.i.d. cluster process Z associated with the random variable Z of NLLs. An observed non-empty point pattern of exceedances: is transformed into a point pattern of univariate exceedances of Z i = − log f (X i ) above u: where K denotes the counting variable associated with the PPM-ex X e u . From Section 4.4, we know that for large n min and u, the distribution of K, conditioned on the length of the underlying point patternx of X, is Poisson with a rate λ: and the exceedances are distributed according to an exponential distribution with scale σ, whenever the distribution of Z satisfies the limiting property in (9). Therefore, the Janossy density of an observed point pattern of K = k extremes (19) can be approximated by: where we have introduced α k = P(K = k) and by using (20): For k = 0 one obtains the likelihood of an empty point pattern of extremes: In the following theorem an analytical expression is obtained for the distribution of Janossy densities f (z e ) of the univariate point patterns of exceedances {z e 1 , . . . z e K }.
Theorem 2 Consider the random variable V e describing Janossy densities v e = f (z e ) as defined in (21) of point patterns of exceedances that are distributed according to an exponential distribution with scale parameter σ. The CDF of V e is given by: where F 2k denotes the CDF of a chi-squared distribution with 2k degrees of freedom.
Proof We proceed as in the proof of Theorem 1 and first determine the distribution of W e = −2 log V e . From (21), one finds for k > 0: The rescaled exceedances Z e i −u σ are distributed according to an expontial with scale 1 which coincides with a chi-squared distribution with 2 degrees of freedom. Conditioned on a length k > 0, the random variables W e − 2 log σ k k!α k are therefore distributed according to a chi-squared distribution with 2k degrees of freedom and thus the CDF G e (w e ) associated with W e is given by:  shows the analytic approximation obtained in Theorem 2 of point patterns of exceedances drawn from an i.i.d cluster PPM associated with a Gaussian mixture model (GMM) X with 4 components. The value of G e (w e ) corresponding to the pattern of exceedancesz e 0 is given by 84%. Therefore, at a significance level of 5%, one cannot reject the null hypothesis that the samplex 0 was generated from X.
The approximation obtained in Theorem 2 holds whenever the PPM of exceedances associated with Z is approximately a PPP for large u and n min as noted in Section 4.4. For Gaussian mixture models X, a minimum length of n min = 10 may already lead to valid approximations [16,14].

Experimental results
In this section the finite PPM introduced in Sections 5.1 and 5.2 and the PPM-ex model introduced in Sec- For large u the PDF is approximately exponential. (c) Emperical CDF of Janossy densities of 10 4 patterns of exceedances generated from the GMM after the transformation w = −2 log(v) together with the analytic approximation obtained via Theorem 2. The transformed Janossy density w e = 5.14 of the pattern of exceedances corresponding tox 0 is indicated with a cumulative probability of 84%. tion 5.4, are demonstrated using three datasets. In Section 6.1, the responsiveness of the PPM to changes of the variance of the components in GMMs is studied using artificial data. Section 6.2 examines the detection of adverse outcomes in patients during their stay in a post-operative ward, using vital-sign data acquired in a clinical trial at the Oxford University Hospitals [26]. Finally, in Section 6.3, an online novelty detection problem is considered that is based on the PPM-ex model to monitor the extremes of a time-series. Calculations 4 were performed in Matlab R and R 3.4.2 [27,28].

Capability of industrial processes
In the field of statistical process control, a vital part of an overall quality-improvement program of a manufacturing process is capability analysis [29]. One way to express process capability is by the ratio of the natural or inherent variance a process experiences and the range of the specification limits in which it is allowed to operate. When the natural variance of the process increases, the capability ratio decreases indicating a decrease in the overall quality of the manufacturing process. Conversely a decrease in natural variance can indicate unnecessarily precision of the process, which may be too expensive to maintain in practice.
In this section, two experiments are considered where data from the normal class are generated via GMMs. In a first experiment, a multivariate normal distribution is considered centred at the origin with a covariance matrix given by Σ = 3I 3 , where I 3 denotes the three-dimensional identity matrix. In a second experiment a GMM is considered with two modes centered at respectively the origin and (3,3,3) with covariance matrices that are respectively given by 3Σ = 3I 3 and I 3 . In both experiments, the performance of a one-class classifier is studied when a change of δI 3 is adapted to the covariance matrix 3I 3 of the component centred at the origin, where δ ranges from −2.5 to 3 with a step size of ∆δ = 0.5.
Training data consists of 400 patterns with a length governed by a binomial distribution with a probability parameter p = 0.8 and n = 25 trials. For each change of the covariance matrix 320 patterns were simulated that constitute the abnormal class. (The choice of 320 ensures that test and validation sets are balanced in 5fold cross-validation). Training involves a 5-fold crossvalidation process, where in each run, the data set is randomly partitioned into 5 subsets; one subset is used for training; two subsets are used to optimize the hyperparameters in a validation step; and two subsets are hold out for testing. F 1 scores on the test data are used to compare performances of our models, which is the harmonic mean of precision and recall [31]. The PPMs are trained using a KDE with isotropic Gaussian kernels (i.e., the covariance matrix Σ = σI n , is given by a scalar multiple of the identity matrix in n variables) to estimate a distribution X of the training data. Performance scores are compared with a OCSVM algorithm (ν-SVM with a Gaussian kernel [6]) and a HMM [30], which are commonly-used methods for sequence classification, as described earlier [10]. During training, the kernel width of the OCSVM and the KDE was varied in the range [10 −3 , 0.3]. The number of states of the HMM was varied between 3, 6, and 9, where each state presented a Gaussian mixture with a number of components that varied between 1 and 2. The novelty scores for each of the points within a sequence that are obtained by applying the OCSVM method are combined by calculating the mean score. Figure 5(a) shows the results of the simulation based on the unimodal multivariate normal distribution. In this experiment, Theorem 1 is used to obtain novelty scores of patterns in the test sets. Performance of the models are compared using the F 1 score which considers both sensitivity (SS) and positive predictive value (PPV) [32]. When the variance increases, the PPMs are competitive with the OCSVM and outperform a HMM. However, none but the finite PPM is able to detect a decrease in variance. For this a PPM, X A is considered for the region A that contains 50% of the training data and that is estimated using a KDE. When variance decreases, sequences situated in these high-density regions are indeed expected to be longer inducing a decrease in the probabilities η k = P(N A = k), (6), and hence in their Janossy density. Figure 5(b) shows the results of the simulation performed using a GMM consisting of two components. Novelty scores of patterns in the test sets are obtained numerically by simulation. In this experiment, one sees that the finite PPM is able to outperform each method indicating that a change in the spatial configuration of patterns drawn from the perturbed GMM is more prominent than a change in the boundary of the normal class.

Predictive monitoring of patients
In this section a clinical data set is considered coming from a study that is carried out at the Oxford Cancer Hospital in the Oxford University Hospitals NHS Trust (Oxford, UK) [26]. During this study, 407 patients were monitored using bedside monitors during a stay in a post-operative ward after an upper-gastrointestinal cancer sugery. The data set involves manual observations of heart rate (HR), systolic blood pressure (BP), and respiratory rate (RR) that are taken at regular times with a mean of 3.5 hours between two observations (but which can rise to as long as 14 hours). The novelty detection problem in this section addresses the prediction of physiologically deterioration resulting in adverse outcomes (such as readmission to the intensive care unit, or death) by detection of novelties in the observation sequence of a patient. The data set is highly unbalanced, where only 13% of the patients suffered from post-surgical complications at some point in their stay.
Clinical guidance in the UK recommends the use of an early warning score (EWS) system in combination with vital-sign measurements [33]. This system applies a univariate scoring to each vital sign and warns of patient risk when any of the scores, or the sum of all scores, exceed some threshold. However this current standard of care treats each vital sign independently and thus disregards the correlations between vital signs. Furthermore, it is expected that information is contained in the length of observation sequences of patients, as a higher frequency of measurement indicates increased concern of the clinician taking the measurements [34]. A PPM will allow us to combine information obtained from the values of the vital signs with that obtained from the measurement frequency.
As is clear from Figure 6(a), the length of the stay in the post-operative ward varies substantially from patient to patient. The distribution of the durations is skewed with a maximum of 71 days and a mean of 10 days. It is expected that shortly after the surgery (at t = 0) care is given at a higher level and so more measurements are expected to take place during the start of each patients' stay on the ward. Figure 6(b) shows the expected number of measurements taken in the past 5 hours as a func- tion of time t (the length of this window is empirically chosen to be 5 hours for illustration). The expectation (dashed curve) and its confidence interval is obtained by fitting a Poisson distribution to the number of measurements at each time t i using a maximum likelihood estimator (MLE). To obtain a continuous estimate at each time t, a model for λ(t) was obtained: and plugged into the likelihood of all Poisson counts N t i taken at available times t i : Maximising L leads to an overall estimate of the parameters (a, b, c) given by (23.27, 3.62, 0.90), as shown in Figure 6(b). The fit of λ(t) was used to apply the PPM. For this purpose, a multivariate normal distribution was fitted to the measurements after an appropriate Box-Cox transformation [29]. To mimic the unbalanced nature of the data set in our experiments, unbalanced test sets were constructed containing 20 abnormal patients and 142 normal patients (noting that only 13% of the patients had a readmission to the intensive care unit or death at the end of their stay). Accounting for the underlying prior probability of readmissions allows us to obtain realistic estimates of how the system will perform in practice. Unlike previous sections, the PPMs were trained in a semi-supervised manner, meaning that the covariance matrix of the Gaussian kernel in a KDE on the training data was chosen based on data from the 'normal class' only (i.e. data from patients not having any postsurgical complication during their stay). This is a pragmatic assumption, as in practice no (or only few data) from the abnormal class are available, making it difficult to optimise parameters using a validation step with cross-validation. The covariance matrix of the KDE was estimated using a minimum covariance determinant estimator which is robust to outliers [35]. Using Theorem 1, risk score for each patient were obtained at each time t by considering patterns of measurements from the past 5 hours. This risk score was threshold at G e (w e ) = 95% corresponding to a significance level α = 5%. The selection of the patients was randomised in a 5fold cross-validation experiment to train a HMM and a OCSVM on the point patterns. The same partitions in training and test set were used to calculate the performance scores of a PPM and the EWS system to allow for a consistent comparison. As in Section 6.1, the scores obtained from the OCSVM for each of the points within a pattern were combined by calculating the mean score. For the EWS scores, a pattern was classified as anomalous when one of the scores exceeded the threshold defined by the EWS system [33]. The kernel width of the OCSVM was optimized over the range [10 −4 , 10 −2 ]. The number of states of the HMM was varied between 3, 6, and 9, where each state presented a Gaussian mixture with a number of components that varied between 1 and 2. Table 1 shows performance scores for the various classifiers considered. It may be seen that the PPM outperformed both the HMM and OCSVM. Compared to the clinically-recommended EWS score, the PPM was able to increase the PPV scores without decreasing the SS score, which is one of the challenges when working with unbalanced data sets [2]. Figure 7 shows empirical estimates of the expected FP rates invoked by each classifier during the stay of patients that had no post-surgical complications. Estimates were obtained by averaging rates over the different patients on time intervals of a 0.5 day. The number of stays with a duration longer than 35 days was too small to obtain reliable estimates (see Figure 6(a)). A higher number of FPs is expected during the start of  each stay as during that period more measurements are expected to take place. However, the PPM is able to reduce this increase in FP rate considerably compared to the other classifiers. This is because a PPM enables to prevent those misclassifications induced by the multiple hypothesis testing problem. The OCSVM and HMM classifier both show an increased FP rate during the complete period. Compared to the EWS method, the FP rate of the PPM shows a lower variability and fluctuates around 5% which corresponds to the choice of the 95% threshold on the Janossy likelihoods.

Online novelty detection
In this section, a public available time series data set is used that consists of the global mean land-ocean temperature index from 1880 to 2016 5 . In particular the data are deviations from measures in degrees centigrade, from the 1951 − 1980 average. To study climate changes, often it's not the distribution of the bulk of deviations that is of interest, but rather the behaviour of patterns in larger deviations [37].
As noted in previous studies, there is an apparent upward trend in the data as indicated by the linear regression fit in Figure 8. In the latter part of the twentieth century, however, there is an increase in the upward trend starting in the period 1980 − 1990 and which may be used as an argument for an acceleration of the global warming hypothesis [38]. We will show in this section how the PPM-ex introduced in Section 5.4 can be used to test whether this increase during the last decennia is statistically significant with respect to the overall increasing trend that is observed. Furthermore there is a levelling off at about 1940 that may be interesting to investigate further.
We will apply an online novelty detector based on the use of a Kalman filter model (KFM) and the PPM-ex approach and will compare its performance with the EVT approach introduced in [14]. In particular, an online linear regression model y t = β 0 +β 1 t + t with t ∼ N(0, σ 2 ) is fitted to the data through the use of a KFM with the following state space form:  where y t denotes the temperature deviation at time t, H t the observation matrix consisting of the explanatory data H t = (1 t) T and ξ t the state vector consisting of the unknown regression coefficients ξ t = (β 0 β 1 ) T .
The KFM allows at each time step t an update of the estimated regression coefficientsξ t|t−1 given the past data. This update can be calculated by a set of wellknown recursive linear estimation steps: where P t|t−1 denotes the variance-covariance matrix of the estimated coefficientsξ t|t−1 : and e t denotes the error y t − y t|t−1 on the forecast y t|t−1 = H T tξt|t−1 of y t at time t given the past observations. The distribution of the random variable Y t conditioned on the past observations Y 1 , . . . Y t−1 is given by a normal distribution: with S t = σ 2 + H t P t|t−1 H T t . The first 30 observation are used to find the unknown variance σ 2 and the initial state vector ξ 1|0 by a maximum likelihood criterion [39]. The matrix P 1|0 is set to a diffuse prior stating that there is no prior knowledge available about the true regression coefficients ξ t = (β 0 β 1 ) T .
For each time t the window of the past n = 10 measurementsỹ = {y t−n+1 , . . . , y t } is considered together with the patternẽ = {e t−n+1 , . . . , e t } of i.i.d. standardized errors: The patternsẽ may now be evaluated by considering the corresponding patterns of exceedancesẽ e together with their Janossy likelihoods v e and corresponding cumulative probabilities G e (v e ) as obtained from Theorem 2. The exponential fit on the exceedances e i − u is estimated using a MLE and a mean excess plot [13] using the first 30 observations. The scale parameter σ of the exponential model and Poisson rate λ of the Poisson model on the number of exceedances were estimated as (σ,λ) = (3.35, 1) forû = 3.5. The fit can be assessed by a quantile-quantile (QQ) plot that shows he empirical quantiles versus the theoretical quantiles obtained from the fitted exponential distribution, see Figure 9. While the fit follows the data well over the majority of q q q q q q q q qq q q q qq q q q q q q q q qq the range, the quality of the fit is lower for higher quantiles. This divergence of the fit from the more extremal data is known in the EVT literature and is also noticed in [4]. Figure 10(a) shows the cumulative probabilities G e (v e ) at each time step t ≥ n associated with the patternỹ = {y t−n+1 , . . . , y t } of the past n = 10 measurements. The scores identify the levelling off starting at 1940 and the increase in the upward trend starting at 1990 by setting a probabilistic threshold of 95%. Figure  10(b) shows the NLLs of the observations y t with respect to the normal distributions N(H T tξt|t−1 , σ 2 ). Commonly used methods for novelty detection rely on a threshold for this likelihoods, e.g. by setting a threshold as the 95% quantile that is estimated during a run-in period of e.g. the first 30 observations. However, even though novelties could be detected, the threshold should be further optimized to overcome the false alarms during the period 1900 − 1920, which is undesirable for a novelty detection problem. Furthermore, the likelihoods oscillate over the whole period while the cumulative probabilities G e (v e ) are much easier to interpret as they are low where they have to be and show steep peaks where patterns become less likely. Figure 10(c) show novelty scores that are defined by using a Gumbel model for the most extreme measurement as proposed in [40]. In this approach, at each time t, only the most extreme standardized error is used to evaluate a patterñ y = {y t−n+1 , . . . , y t } of the past n = 10 measurements. In particular, for each time t, the maximum M n of the NLLs of the standardized errors {e t−n+1 , . . . , e t } as defined in (25) is modelled. These maxima will follow ap- proximately a Gumbel distribution and can be evaluated using: where µ = u + σ log λ. As can be seen from figure 10(c), these scores show an increase at the levelling off at about 1940, but do no exceed the probabilistic 95% threshold border. Moreover, the score G e (ṽ e ) is able to detect the increase in upward trend in the latter part of the twentieth century earlier. The score χ(ỹ) only includes information about the most extreme error and is not able to recognize the change in pattern that occurs before the year 1998.

Conclusion
This paper is concerned with the problem of identifying novel point patternsx = {x 1 , ...x N } with respect to a statistical distribution X and with a length governed by a discrete random variable N. PPMs explore the spatial configuration contained inx while jointly modelling the random length ofx.
It is shown that the complex distribution of a PPM for a space of point patterns can be translated to a univariate formulation (in Janossy densities) that is readily analysed. For multivariate normal data, our formulation is exact. Moreover, for other multivariate distributions, point patterns occurring in regions of lower density can be evaluated by an analytic approximation that is obtained by the use of EVT. This model is of particular importance when not the bulk of data is of interest, but rather the behaviour of extremes.
We have demonstrated the use of our PPMs using multiple synthetic and real-world data sets, and showed that for these data our models can outperform commonly-used methods for sequence classification, such as HMMs and OCSVMs. For a synthetic data set it was shown how PPMs can detect reductions in variances in the components of Gaussian mixture models by monitoring the expected number of instances in high density regions. The PPM can easily be trained when few or no data of the abnormal class are available in a novelty detection setting. This is in contrast with OCSVMs and HMMs that need the tuning of several hyperparameters. We demonstrated this on a real-world data set consisting of vital signs of patients staying in a postoperative hospital ward. The data set was highly unbalanced as readmission was infrequent compared to the normal state present in the data. For this particular application and data set, the proposed models matched clinical best-practice for sensitivity, while substantially improving PPV in comparison to HMMs and OCSVMs. Furthermore, a real-world time series data set was used to illustrate the use of the method when observations are not independent. The combination of a KFM and the PPM-ex model allowed us to define an online approach to detect novelties in the extremes of time series. In contrast to other EVT methods, the PPM was able to monitor the spatial configurations of exceedances leading to models that were able to detect changes in patterns rather than detecting point anomalies.
There are several interesting extensions of the method possible for future research. Firstly, an extension of the calculation of the distribution of Janossy densities can be considered for random variables that are not multivariate normally distributed. Secondly, we would like to study how this method can be used to model dependent sequences of point patterns including a temporal variation into the model.