Model-Based Multiple Instance Learning

While Multiple Instance (MI) data are point patterns -- sets or multi-sets of unordered points -- appropriate statistical point pattern models have not been used in MI learning. This article proposes a framework for model-based MI learning using point process theory. Likelihood functions for point pattern data derived from point process theory enable principled yet conceptually transparent extensions of learning tasks, such as classification, novelty detection and clustering, to point pattern data. Furthermore, tractable point pattern models as well as solutions for learning and decision making from point pattern data are developed.


INTRODUCTION
Point patterns-sets or multi-sets of unordered points-arise in numerous data analysis problems where they are commonly known as 'bags'. For example, the 'bag' in multiple instance learning [1], [2], the 'bag-of-words' in natural language processing and information retrieval [3], [4], [5], the 'bag-of-visual-words' in image and scene categorization [6], [7], and the 'bag-of-features' in sparse data [8], [9], are all point patterns. However, statistical point pattern models have not received much attention in machine learning for point pattern data.
A statistical data model is specified by the likelihood function which can be interpreted as how likely an observation is, given the parameters of the underlying model. The likelihood can be used to determine the "best" labels for input observations in classification (supervised learning), the "best" cluster parameters in clustering (unsupervised learning), and outliers in novelty detection (semi-supervised learning) [10], [11], [12]. The data likelihood function thus plays a fundamental role in model-based data analysis.  To motivate the development of suitable likelihood functions for point patterns, let us consider an example in novelty detection. Suppose that apples fallen from an apple tree land on the ground independently from each other, and that the daily point patterns of landing positions are inde-pendent from day to day. Further, the probability density, p f , of the landing position, learned from 'normal' training data, is shown in Fig. 1. Since the apple landing positions are independent, following common practice (see e.g., [3], [4], [5], [6], [13]) the likelihood that the apples land at positions x 1 , ..., x m is given by the joint (probability) density: Suppose on day 1 we observe one apple landing at x 1 , and on day 2 we observe two apples landing at x 2 and x 3 (see Fig. 1), which of these daily landing patterns is more likely to be a novelty? Since there is no 'novel' training data in novelty detection, the common practice (see e.g., [10]) is to examine the 'normal' likelihoods of the landing patterns p(x 1 ) = p f (x 1 ) = 0.2, p(x 2 , x 3 ) =p f (x 2 ) p f (x 3 ) = 0.36, to identify outliers. Intuitively, the pattern observed on day 1 is more likely to be a novelty since p(x 1 ) < p(x 2 , x 3 ). However, had we measured distance in centimeters (p f is scaled by 10 −2 ), then p(x 1 ) = 0.002 > p(x 2 , x 3 ) = 0.000036, thereby, contradicting the previous conclusion! This phenomenon arises from the incompatibility in the measurement units of the likelihoods because p(x 1 ) is measured in "m −1 " or "cm −1 " whereas p(x 2 , x 3 ) is measured in "m −2 " or "cm −2 ", i.e., we are not "comparing apples with apples." The joint density of the landing positions also suffers from another problem. To eliminate the effect of unit incompatibility, we assume that there are only 201 positions numbered from −100 to 100, evenly spaced on the interval [−1 m, 1 m]. Thus, instead of a probability density on [−1 m, 1 m] we now have a (unit-less) probability mass function on the discrete set {−100, ... , 100}, as shown in Fig.  2a. Four point patterns from the 'normal' training data set are shown in Fig. 2b, while Fig. 2c shows 2 new observations X 1 and X 2 . Since X 2 has only 1 feature, whereas X 1 and the 'normal' observations each has around 10 features, it is intuitive that X 2 is novel. However, its likelihood is much higher than that of X 1 (0.009 versus 2×10 −23 ). This counter intuitive phenomenon arises from the lack of appropriate cardinality information in the likelihood.    The simple example above demonstrates that the joint probability density of the constituent points is not the likelihood of a point pattern. In particular, it suffers from incompatibility in the unit of measurement and does not appropriately account for the number of elements in each point pattern. Worse, such inconsistency in a likelihood function could lead to erroneous results in more complex point pattern learning tasks. Hence, a proper notion of probability density for random point pattern is necessary.
This paper proposes a model-based approach for learning from point pattern data using point process theory [14], [15], [16]. Likelihood functions derived from point process theory are probability densities of random point patterns, which incorporate both cardinality and feature information, and avoid the unit of measurement inconsistency. Moreover, they enable the extension of model-based formulations for learning tasks such as classification, novelty detection, and clustering to point pattern data in a conceptually transparent yet principled manner. Such a framework, facilitates the development of tractable point pattern models as well as solutions for learning and decision making. Specifically: • In classification, we propose solutions based on learning point process models from fully observed training data, and develop an inexpensive classifier using a tractable class of models; • In novelty detection, where observations are ranked according to their likelihoods, we show that standard point process probability densities are not suitable for point patterns and develop suitable ranking functions; • In clustering we introduce point process mixture models, and develop an inexpensive Expectation Maximization clustering algorithm for point pattern using a tractable class of models.
These developments have been partially reported in [17], [18], [19], respectively. This article provides a more complete study, under a unified exposition. In Section 2 we review basic concepts from point process theory. Subsequent sections present the proposed framework for model-based point pattern learning, in progression from: supervised, namely classification, in Section 3; semi-supervised, namely novelty detection, in Section 4; to unsupervised, namely clustering, in Section 5. Numerical studies for these learning tasks are presented in Section 6, followed by some concluding remarks in Section 7.

BACKGROUND
Tools for modeling and analysis of point patterns are available from point process theory, and have proven to be effective in a number of diverse areas, see e.g., [14], [20], [21], [22] and references therein. This section outlines the elements of point process theory and presents some basic models for point pattern data. For further detail on point processes, we refer the reader to textbooks such as [15], [14], [16].

Point Process
A point pattern is a set or multi-set of unordered points. While a multi-set is different from a set in that it may contain repeated elements, a multi-set can also be equivalently expressed as a set. Specifically, a multi-set with elements x 1 of multiplicity N 1 , ...., x m of multiplicity N m , can be represented as the set {(x 1 , N 1 ), ..., (x m , N m )}. A point pattern can be characterized as a counting measure on the space X of features. Given a point pattern X, a counting measure N is defined, for each (compact) set A ⊆ X , by N (A) = number of points of X falling in A. ( The values of the counting variables N (A) for all subsets A provide sufficient information to reconstruct the point pattern X [15], [14]. The points of X are the set of x such that N ({x}) > 0. A point pattern is said to be: finite if it has a finite number of points, i.e., N (X ) < ∞; and simple if it contains no repeated points, i.e., N ({x}) ≤ 1 for all x ∈ X . Formally a point process is defined as a random counting measure. A random counting measure N may be viewed as a collection of random variables N (A) indexed by A ⊆ X . A point process is finite if its realizations are finite almost surely, and simple if its realizations are simple almost surely.
In this work we are interested in likelihood functions for finite point patterns. For a countable feature space X , the likelihood function f is simply the probability of the point pattern. More concisely, where: π denotes a permutation of {1, 2, ..., i}; p(x 1 , ..., x i |i) is the joint probability of the features x 1 , ..., x i , given that there are i features; and p c (i) is the probability that there are i features. Conceptually, likelihoods for point patterns in a countable space is straightforward and requires no further discussion. Hereon, we consider point processes on a compact subset X of R d .

Probability Density
The probability density of a point process is the Radon-Nikodym derivative of its probability distribution with respect to a dominating measure µ, usually an unnormalised probability distribution of a Poisson point process. Let ν be a (non-atomic σ-finite) measure on X . A Poisson point process on X with intensity measure ν is a point process such that In general the probability density of a point process may not exist [23], [24]. To ensure that probability densities are available, we restrict ourselves to finite point processes [24]. Further, in many applications involving uncountable feature spaces, the observed point patterns do not have repeated elements, and hence can be modeled as a simple point process. A simple finite point process is equivalent to a random finite set [24], i.e., a random variable taking values in F(X ), the space of finite subsets of X .
The probability density f : F(X ) → [0, ∞) of a random finite set is usually taken with respect to the dominating measure µ, defined for each (measurable) T ⊆ F(X ), by (see e.g., [25], [16], [26]): where U is the unit of hyper-volume in X , 1 T (·) is the indicator function for T , and by convention the integral for i = 0 is the integrand evaluated at ∅. The measure µ is the unnormalized distribution of a Poisson point process with unit intensity 1/U when X is bounded. For this choice of reference measure, it was shown in [26] that the integral of f is given by is equivalent to Mahler's set integral [27], [28] and that densities relative to µ can be computed using Mahler's set derivative [27], [28]. Note that the reference measure µ, and the integrand f are all dimensionless.
The probability density of a random finite set, with respect to µ, evaluated at {x 1 , ..., x i } can be written as [23, p. 27] ((Eqs. (1.5), (1.6), and (1.7)): where p c (i) is the cardinality distribution, and f i (x 1 , ..., x i ) is a symmetric function 1 denoting the joint probability density of x 1 , ..., x i given cardinality i. Note that by convention f 0 = 1 and hence f (∅) = p c (0). It can be seen from (6) that the probability density f captures the cardinality information as well as the dependence between the features. Also, U i cancels out the unit of the probability density f i (x 1 , ..., x i ) making f unit-less, thereby avoids the unit mismatch.

Intensity and Conditional Intensity
The intensity function λ of a point process is a function on X such that for any (compact) A ⊂ X The intensity value λ(x) is interpreted as the instantaneous expected number of points per unit hyper-volume at x. For a hereditary probability density f , i.e., f (X) > 0 implies f (Y ) > 0 for all Y ⊆ X, the conditional intensity at a point u is given by [24] λ(u, Loosely speaking, λ(u, X)du can be interpreted as the conditional probability that the point process has a point in an infinitesimal neighbourhood du of u given all points of X outside this neighbourhood. The intensity function is related to the conditional intensity by For a Poisson point process the conditional intensity equals the intensity The probability density of a finite point process is completely determined by its conditional intensity [14], [16]. Certain point process models are convenient to formulate in terms of the conditional intensity rather than probability density. Using the conditional intensity also eliminates the normalizing constant needed for the probability density. However, the functional form of the conditional intensity must satisfy certain consistency conditions.

IID cluster model
Imposing the independence assumption among the features, the model in (6) reduces to the IID-cluster model [15], [14]: where |X| denotes the cardinality (number of elements) of X, p f is a probability density on X , referred to as the feature density, and h X x∈X h(x), with h ∅ = 1 by convention. Sampling from an IID-cluster can be accomplished by first sampling the number of points from the cardinality distribution p c , and then sampling the corresponding number points independently from the feature distribution p f .
When the cardinality distribution p c is Poisson with rate ρ we have the celebrated Poisson point process [15], [14].
The Poisson point process model is completely determined by the intensity function λ = ρp f . Note that the Poisson cardinality distribution is described by a single non-negative number ρ, hence there is only one degree of freedom in the choice of cardinality distribution for the Poisson point process model.

Finite Gibbs model
A well-known general model that accommodates dependence between its elements is a finite Gibbs process, which has probability density of the form [14], [16] f (X) = exp where V i is called the i th potential, given explicitly by Gibbs models arise in statistical physics, where log f (X) may be interpreted as the potential energy of the point pattern. The term −V 1 (x) can be interpreted as the energy required to create a single point at a location x, and the term −V 2 (x 1 , x 2 ) can be interpreted as the energy required to overcome the force between the points x 1 and x 2 . Note that any hereditary probability density of a finite point process can be expressed in the Gibbs form [24]. The Poisson point process is indeed a first order Gibbs model. Another example is the hardcore model, where every pair of distinct points is at least r units apart. In this case, V 1 (x) is a constant and The next three sections show how point process models are used in model-based point pattern classification, novelty detection and clustering.

MODEL-BASED CLASSIFICATION
Classification is the supervised learning task that uses fully-observed training input-output pairs D train = {(X n , y n )} Ntrain n=1 to determine the output class label y ∈ {1, . . . , K} of each input observation [11], [12]. This fundamental machine learning task is the most widely used form of supervised machine learning, with applications spanning many fields of study.
Model-based classifiers for point pattern data have not been investigated. In multiple instance learning, existing classifiers in the Bag-Space paradigm are based on distances between point patterns, such as Hausdorff [29], [30], Chamfer [31], Earth Mover's [32], [33]. Such classifiers do not require any underlying data models and are simple to use. However, they may perform poorly with high dimensional inputs due to the curse of dimensionality, and are often computationally intractable for large datasets [12], not to mention that the decision procedure is unclear. On the other hand, knowledge of the underlying data model can be used to exploit statistical patterns in the training data, and to devise optimal decision procedures.
Using the notion of probability density for point process from subsection 2.2, the standard model-based classification formulation directly extends to point pattern classification: • In the training phase, we seek likelihoods that "best" fit the training data. Specifically, for each k ∈ {1, . . . , K}, we seek a likelihood function f (·|y = k) that best fit the training input point patterns in D (k) train = {X : (X, k) ∈ D train }, according to criteria such as maximum likelihood (ML) or Bayes optimal if suitable priors on the likelihoods are available.

•
In the classifying phase, the likelihoods (learned from training data) are used to classify input observations. When a point pattern X is passed to query its label, the Bayes classifier returns the mode of the class label posterior p (y = k | X) computed from the likelihood and the class prior p via Bayes' rule: The simplest choices for the class priors are the uniform distribution, and the categorical distribution, usually estimated from the training data via is the Kronecker delta, which takes on the value 1 when i = j, and zero otherwise. Hence, the main computational effort in model-based classification lies in the training phase.

Learning Point Process Models
Learning the likelihood function for class k boils down to finding the value(s) of the parameter θ k such that the (parameterized) probability density f (· | y = k, θ k ) best explains the observations X 1 , ..., train . In this subsection, we consider a fixed class label and its corresponding observations X 1 , ..., X N , and omit the dependence on k.
Methods for learning point process models have been available since the 1970's, see e.g., [16], [24]. We briefly summarize some recognized techniques and presents ML for IID cluster models as a tractable point pattern classification solution.

Model fitting via summary statistics
The method of moments seeks the parameter θ such that the expectation of a given statistic of the model point process parameterized by θ is equal to the statistic of the observed point patterns [24]. However, this approach is only tractable when the solution is unique and the expectation is a closed form function of θ, which is usually not the case in practice, not to mention that moments are difficult to calculate.
The method of minimum contrast seeks the parameter θ that minimizes some dissimilarity between the expectation of a given summary statistic (e.g., the K-function) of the model point process and that of the observed point patterns [24]. Provided that the dissimilarity functional is convex in the parameter θ, this approach can avoid some of the problems in the method of moments. However, in general the statistical properties of the solution are not well understood, not to mention the numerical behaviour of the algorithm used to determine the minimum.

Maximum likelihood (ML)
In the ML approach, we seek the ML estimate (MLE) of θ: The MLE has some desirable statistical properties such as asymptotic normality and optimality [24]. However, in general, there are problems with non-unique maxima. Moreover, analytic MLEs are not available because the likelihood (12) of many Gibbs models contains an intractable normalizing constant (which is a function of θ) [16].
To the best of our knowledge, currently there is no general ML technique for learning generic models such as Gibbs from real data. Numerical approximation method proposed in [34] and Markov Chain Monte Carlo (MCMC) method proposed in [35] are highly specific to the chosen model, computationally intensive, and require careful tuning to ensure good performance. Nonetheless, simple models such as the IID-cluster model (10) admits an analytic MLE (see subsection 3.1.4).
Remark: The method of estimating equation replaces the ML estimation equation by an unbiased sample approximation (15) is an unbiased estimating equation. Setting Ψ(θ, X n ) to the difference between the empirical value and the expectation of the summary statistic results in the method of moments. Takacs-Fiksel is another wellknown family of estimating equations [36], [37].

Maximum Pseudo-likelihood
Maximum pseudo-likelihood (MPL) estimation is a powerful approach that avoids the intractable normalizing constant present in the likelihood while retaining desirable properties such as consistency and asymptotic normality in a large-sample limit [38], [39]. The key idea is to replace the likelihood of a point process (with parameterized conditional intensity λ θ (u; X)) by the pseudo-likelihood: The rationale behind this strategy is discussed in [38]. Up to a constant factor, the pseudo-likelihood is indeed the likelihood if the model is Poisson, and approximately equal to the likelihood if the model is close to Poisson. The pseudo-likelihood may be regarded as an approximation to the likelihood which neglects the inter-point dependence.
An MPL algorithm has been developed by Baddeley and Turner in [40] for point processes with sufficient generality such as Gibbs whose conditional intensity has the form By turning the pseudo-likelihood of a general point process into a classical Poisson point process likelihood, MPL can be implemented with standard generalized linear regression software [40]. Due to its versatility, the Baddeley-Turner algorithm is the preferred model fitting tool for point processes.
The main hurdle in the application of the Baddeley-Turner algorithm to point pattern classification is the computational requirement. While this may not be an issue in spatial statistics applications, the computational cost is still prohibitive with large data sets often encountered in machine learning. On the other hand, disadvantages of MPL (relative to ML) such as small-sample bias and inefficiency [39], [41] become less significant with large data. Efficient algorithms for learning general point process models is an on going area of research.

ML Learning for IID-Clusters
Computationally efficient algorithms for learning point process models are important because machine learning usually involve large data sets (compared to applications in spatial statistics). Since learning a general point process is computationally prohibitive, the IID-cluster model (10) provides a good trade-off between tractability and versatility by neglecting interactions between the points.
Since an IID-cluster model is uniquely determined by its cardinality and feature distributions, we consider a parameterization of the form: where p ξ and p ϕ , are the cardinality and feature distributions parameterized by ξ and ϕ, respectively. Learning the underlying parameters of an IID-cluster model amounts to estimating the parameter θ = (ξ, ϕ) from training data. The form of the IID-cluster likelihood function allows the MLE to separate into the MLE of the cardinality parameter ξ and MLE of the feature parameter ϕ. This is stated more concisely in Proposition 1 (the proof is straightforward, but included for completeness). Proposition 1. Let X 1 , ..., X N be N i.i.d. realizations of an IIDcluster with parameterized cardinality distribution p ξ and feature density p ϕ . Then the MLE of (ξ, ϕ), is given bŷ where denotes disjoint union.
Proof: Using (17), we have Hence, to maximize the likelihood we simply maximize the second and last products in the above separately. This is achieved with (18) and (19).
Observe from Proposition 1 that the MLE of the feature density parameter is identical to that used in NB. For example: if the feature density is a Gaussian, then the MLEs of the mean and covariance arê if the feature density is a Gaussian mixture, then the MLE of the Gaussian mixture parameters can be determined by the EM algorithm. Consequently, learning the IID-cluster model requires only one additional, but relatively inexpensive, task of computing the MLE of the cardinality parameters. For a categorical cardinality distribution, i.e., ξ = (ξ 1 , ..., ξ M ) where ξ k = Pr(|X| = k) and M k=1 ξ k = 1, the MLE of the cardinality parameter is given bŷ Note that to avoid over-fitting, the standard practice of placing a Laplace prior on the cardinality distribution can be applied, i.e. replacing the above equation byξ k ∝ where is a small number. For a Poisson cardinality distribution parameterized by the rate ξ = ρ, the MLE is given bŷ It is also possible to derive MLEs for other families of cardinality distributions such as Panjer, multi-Bernoulli, etc. Remark: Proposition 1 also extends to Bayesian learning for IID-clusters if the prior on (ξ, ϕ) separates into priors on ξ and ϕ. Following the arguments in the proof of Proposition 1, the maximum aposteriori (MAP) estimate of (ξ, ϕ) separates into MAP estimates of ξ and ϕ. Typically a (symmetric) Dirichlet distribution Dir(·|η/K, ..., η/K), with dispersion η on the unit M -simplex, can be used as a prior on the categorical cardinality distribution. The prior for ϕ depends on the form of the feature density p ϕ (see also subsection 5.2 for conjugate priors of the Poisson model). Indeed, Bayesian learning for point process models can be also be considered as a variation of the Bayesian point pattern clustering problem in Section 5.

MODEL-BASED NOVELTY DETECTION
Novelty detection is the semi-supervised task of identifying observations that are significantly different from the rest of the data [10], [42]. In novelty detection, there is no novel training data, only 'normal' training data is available. Hence it is not a special case of classification nor clustering [43], [44], and is a separate problem in its own right.
Similar to classification, novelty detection involves a training phase and a detection phase. Since novel training data is not available, input observations are ranked according to how well they fit the 'normal' training data and those not well-fitted are deemed novel or anomalous [43], [44]. The preferred measure of goodness of fit is the 'normal' likelihoods of the input data. To the best of our knowledge, there are no novelty detection solutions for point pattern data in the literature.
In this section we present a model-based solution to point pattern novelty detection. The training phase in novelty detection is the same as that for classification. However, in the detection phase the ranking of likelihoods is not applicable to point pattern data, even though point process probability density functions are unit-less and incorporates both feature and cardinality information. In subsection 4.1, we discuss why such probability densities are not suitable for ranking input point patterns, while in subsection 4.2 we propose a suitable ranking function for novelty detection.

Probability density and likelihood
This subsection presents an example to illustrate that the probability density of a point pattern does not necessarily indicate how likely it is. For this example, we reserve the term likelihood for the measure of how likely or probable a realization is.
Consider two IID-cluster models with different uniform feature densities and a common cardinality distribution as shown in Fig. 3. Due to the uniformity of p f , it follows from (10) that point patterns from each IID-cluster model with the same cardinality have the same probability density. Note from [15] that to sample from an IID-cluster model, we first sample the number of points from the cardinality distribution, and then sample the corresponding number of points independently from the feature distribution. For an IID-cluster model with uniform feature density, the joint distribution of the features is completely uninformative (total uncertainty) and so the likelihood of a point pattern should be proportional to the probability of its cardinality.    If the probability density were an indication of how likely a point pattern is, then the plot of probability density against cardinality should resemble the cardinality distribution. However, this is not the case. Fig. 4 indicates that for the IID-cluster with 'short' feature density, the probability density tends to decrease with increasing cardinality (Fig.  4a). This phenomenon arises because the feature density given cardinality n is (1/20) n , which vanishes faster than the n! growth (for n ≤ 20). The converse is true for the IID-cluster with 'tall' feature density (Fig. 4b). Thus, point patterns with highest/least probability density are not necessarily the most/least probable.
Such problem arises from the non-uniformity of the reference measure. A measure µ is said to be uniform if for any measurable region A with µ(A) < ∞, all points of A (except on set of measure zero) are equi-probable under the probability distribution µ/µ(A). One example is the Lebesgue measure vol on R n : given any bounded measurable region A, all realizations in A are equally likely under the probability distribution vol(·)/vol(A). The probability density f (X) = P (dX)/µ(dX) (as a Radon-Nikodym derivative) at a point X is the ratio of probability measure to reference measure at an infinitesimal neighbourhood of X. Hence, unless the reference measure is uniform, f (X) is not a measure of how likely X is. This is also true even for probability densities on the real line. For example, the probability density of a zero-mean Gaussian distribution with unit variance relative to the (uniform) Lebesgue measure is the usual Gaussian curve shown in Fig. 5a, while its density relative to a zero-mean Gaussian distribution with variance 0.8 is shown in Fig. 5b, where the most probable point has the least probability density value. The reference measure µ defined by (4) is not uniform because for a bounded region T ⊆ F(X ), the probability distribution µ/µ(T ) is not necessarily uniform (unless all points of T have the same cardinality). Hence, probability densities of input point patterns relative to µ are not indicative of how well they fit the 'normal' data model.
Remark: In novelty detection we are interested in the likelihood of the input point pattern whereas in Bayesian classification we are interested in its likelihood ratio. The posterior class probability p (y | X) = p(y)f (X | y) p(y)f (X | y)dy = p(y)P (dX | y)/µ(dX) p(y)(P (dX | y)/µ(dX))dy = p(y)P (dX | y) p(y)P (dX | y)dy , (using standard properties of Radon-Nikodym derivative and relevant absolute continuity assumption) is the ratio, at an infinitesimal neighbourhood dX, between the joint probability P (dX, y), and the probability P (dX), which is invariant to the choice of reference measure. In essence, the normalizing constant cancels out the influence of the reference measure, and hence, problems with the non-uniformity of the reference measure do not arise.

Ranking functions
To the best of our knowledge, it is not known whether there exists a uniform reference measure on F(X ) that dominates the probability distributions of interest (so that they admit densities). In this subsection, we propose a suitable point pattern ranking function for novelty detection by modifying the probability density. The probability density (6) is the product of the cardinality distribution p c (|X|), the cardinality-conditioned feature (probability) density f |X| (X), and a trans-dimensional weight |X|!U |X| . Note that the cardinality distribution and the conditional joint feature density completely describes the point process. The conditional density f |X| (X) enables the ranking of point patterns of the same cardinality, but cannot be used to rank across different cardinalities because it takes on different units of measurement. The weights |X|!U |X| reconcile for the differences in dimensionality and unit of measurement between f |X| (X) of different cardinalities. However, the example in subsection 4.1 demonstrates that weighting by |X|!U |X| leads to probability densities that are inconsistent with likelihoods.
In the generalization of the maximum aposteriori (MAP) estimator to point patterns [28], Mahler circumvented such inconsistency by replacing |X|!U |X| with c |X| , where c is an arbitrary constant. More concisely, instead of maximizing the probability density f (X), Mahler proposed to maximize f (X)c |X| /|X|!. Since c is a free parameter, the generalized MAP estimate depends on the choice of c.
Inspired by Mahler's generalized MAP estimator, we replace the weight |X|!U |X| in the probability density by a general function of the cardinality C(|X|), resulting in a ranking function of the form The example in subsection 4.1 demonstrated that, as a function of cardinality, the ranking should be proportional to the cardinality distribution, otherwise unlikely samples can assume high ranking values. In general, the ranking function is not solely dependent on the cardinality, but also varies with the features. Nonetheless, the example suggests that the ranking function, on average, should be proportional to the cardinality distribution. Hence, we impose the following consistency requirement: for a given cardinality n, the expected ranking value is proportional to the probability of cardinality n, i.e., Proposition 2. For a point process with probability density (6), a ranking function consistent with the cardinality distribution, i.e., satisfies (25), is given by where || · || 2 denotes the L 2 -norm.
Proof: Noting f (X | |X| = n) = n!U n f n (X)δ n [|X|] from (6), and using the integral (5) we have = p c (n). Note that ||f |X| || 2 2 has units of U −|X| , which is the same as the unit of f (X), rendering the ranking function r unitless, thereby avoids the unit of measurement inconsistency described in Section 1.
For an IID-cluster with feature density p f the ranking function reduces to The feature density p f , in the example of subsection 4.1, is uniform and so p f /||p f || 2 2 = 1 on its support. Hence the ranking is equal to the cardinality distribution, as expected. Fig. 6 illustrates the effect of dividing a non-uniform feature density p f , by its energy ||p f || 2 2 : 'tall' densities become shorter and 'short' densities become taller, providing adjustments for multiplying together many large/small numbers.

MODEL-BASED CLUSTERING
The aim of clustering is to partition the dataset into groups so that members in a group are similar to each other whilst dissimilar to observations from other groups [12]. A partitioning of a given set of observations {X 1 ,...,X N } is often represented by the (latent) cluster assignment y 1:N , where y n denotes the cluster label for the n th observation. Clustering is an unsupervised learning problem since the labels are not included in the observations [45], [46]. Indeed it can be regarded as classification without training and is a fundamental problem in data analysis. Comprehensive surveys on clustering can be found in [45], [47], [48]. At present, model-based point pattern clustering have not been investigated. To the best of our knowledge, there are two clustering algorithms for point patterns: the Baglevel MI Clustering (BAMIC) algorithm [49]; and the Maximum Margin MI Clustering (M 3 IC) algorithm [50]. BAMIC adapts the k-medoids algorithm with the Hausdorff distance as a measure of dissimilarity between point patterns [49]. On the other hand, in M 3 IC, the clustering was posed as a non-convex optimization problem which is then relaxed and solved via a combination of the Constrained Concave-Convex Procedure and Cutting Plane methods [50]. While these algorithms are simple to use, they lack the ability to exploit statistical trends in the data, not to mention computational problems with high dimensional or large datasets [12].
In this section, we propose a model-based approach to the clustering problem for point pattern data. Mixture modeling is the most common probabilistic approach to clustering, where the aim is to estimate the cluster assignment y 1:N via likelihood or posterior inference [12]. The point process formalism enables direct extension of mixture models to point pattern data. In particular, the finite mixture point process model for problems with known number of clusters is presented in subsection 5.1, while the infinite mixture point process model for problems with unknown number of clusters is presented in subsection 5.2.

Finite Mixture Model
A finite mixture model assumes K underlying clusters labeled 1 to K, with prior probabilities π 1 , ..., π K , and characterized by the parameters θ 1 , ..., θ K in some space Θ. Let f (X n | θ k ) f (X n | y n = k, θ 1:K ) denote the likelihood of X n given that cluster k generates an observation. Then f (X 1:N , y 1:N | π 1:K , θ 1:K ) = N n=1 π yn f (X n | θ yn ), Marginalizing the joint distribution (28) over the cluster assignment y 1:N gives the data likelihood function f (X 1:N | π 1:K , θ 1: Thus, in a finite mixture model, the likelihood of an observation is a mixture of K probability densities. Hence the application of the finite mixture model requires the number of clusters to be known apriori. The posterior probability of cluster label y n = k (i.e., the probability that, given π 1:K , θ 1:K and X n , cluster k generates X n ) is p (y n = k | X n , π 1:K , θ 1: Under a mixture model formulation, clustering can be treated as an incomplete data problem since only the X 1:N of the complete data D = {(X n , y n )} N n=1 is observed and the cluster assignment y 1:N is unknown or missing. We seek y 1:N , and the mixture model parameter ψ (π 1:K , θ 1:K ) that best explain the observed data X 1:N according to a given criterion such as ML or optimal Bayes. ML is intractable in general and often requires the Expectation-Maximisation (EM) algorithm [51], [52] to find approximate solutions. Optimal Bayes requires suitable priors for ψ. Typically the prior for π 1:K is a Dirichlet distribution Dir(·|η/K, ..., η/K) with dispersion η, while the prior for θ 1:K is model-specific, depending on the form of the likelihood f (X n |θ k ). Computing the cluster label posterior p(y 1:N |X 1:N ) or the joint posterior p(y 1:N , ψ|X 1:N ) are intractable in general and Markov Chain Monte Carlo methods, such as Gibbs sampling are often needed [53], [54].
Next we detail an ML solution to point pattern clustering using EM with an IID-cluster mixture model. Instead of presenting a Bayesian solution for the finite mixture model, in subsection 5.2 we extend it to address unknown number of clusters, and develop a solution based on Gibbs sampling, which can be simplified to the finite mixture case.

EM clustering via IID-cluster mixture model
The EM algorithm maximizes the data likelihood (29) by generating a sequence of iterates {ψ (i) } ∞ i=0 using the following two steps [51], [52]: The expectation Q(ψ (i) | ψ (i−1) ) increases after each EM iteration, and consequently converges to a (local) maximum of (29) [51], [52]. In practice, the iteration is terminated at a user defined number N iter or when increments in Q(ψ (i) | ψ (i−1) ) falls below a given threshold. The optimal cluster label estimate is the mode of the cluster label posterior (30).
Following the arguments from [55], the M-step can be accomplished by separately maximizing Q(π 1:K , θ 1:K | ψ (i−1) ) over θ 1 , ..., θ K and π 1:K . Using Lagrange multiplier with constraint K k=1 π k = 1, yields the optimal weights: Noting that log(π k f (X n | θ k )) is accompanied by the weight p(y n = k | X n , ψ (i−1) ), maximizing Q(π 1:K , θ 1:K | ψ (i−1) ) over θ k is equivalent to ML estimation of θ k with weighted data. However, the data-weighted MLE of θ k depends on the specific form of f (·|θ k ), and is intractable in general. Fortunately, for the IID-cluster mixture model, where with θ k = (ξ k , ϕ k ) denoting the parameters of the cardinality and feature distributions, tractable solutions are available. Similar to Proposition 1, the IID-cluster form allows the data-weighted MLE of θ k to separate into data-weighted MLEs of ξ k and ϕ k . Some examples are: • For a categorical cardinality distribution with maximum cardinality M , where ξ k = (ξ k,0 , ..., ξ k,M ) lies in the unit M -simplex, the iteration is For a Poisson cardinality distribution, where ξ k > 0 is the mean cardinality, the iteration is For a Gaussian feature distribution, where ϕ k = (µ k , Σ k ) is the mean-covariance pair, the iteration is • For a Gaussian mixture feature distribution, where ϕ k is the Gaussian mixture parameter, ϕ (i) k can be determined by applying the standard EM algorithm on the weighted data.

Infinite Mixture Model
For an unknown number of clusters, finite mixture models are no longer directly applicable. Bayesian non-parametric modeling (see e.g., [56], [57]) addresses the unknown number of clusters by modeling the set of mixture parameters as a point process. Thus, the observations and the clusters are all modeled as point processes.
In a finite mixture model, the number of components (and clusters) is fixed at K. The mixture parameter ψ = (π 1:K , θ 1:K ) is a point in (R + × Θ) K , such that K i π i = 1. Under the Bayesian framework, it is further assumed that θ 1:K follows a given distribution on Θ K , and that π 1:K follows a distribution on the unit (K − 1)-simplex, e.g. a Dirichlet distribution. An infinite mixture model addresses the unknown number of components by considering the mixture parameter Ψ as a point pattern in R + × Θ such that (π,θ)∈Ψ π = 1. Further, under the Bayesian non-parametric framework, we furnish Ψ with a prior distribution, thereby modeling the mixture parameter as a point process on R + × Θ. The simplest model would be the Poisson point process, but the resulting component weights do not necessarily sum to one. Nonetheless, these weights can be normalized to yield a tractable point process model for the mixture parameter [58], [59]. More concisely, let Ξ be a Poisson point process on R + × Θ with intensity measure ηω −1 e −ηω dωG 0 (dθ), i.e., the product of an improper gamma distribution and the base distribution G 0 . Then the prior model for the mixture parameter is given by where ν Ξ = (ω,θ)∈Ξ ω. Note that (33) is no longer a Poisson point process because each constituent element involves the sum ν Ξ , thereby violating the independence condition.
To specify an explicit form for the prior distribution of the mixture parameter Ψ , note that each point ν −1 Ξ ω, θ can be equivalently represented by atom at θ with weight ν −1 Ξ ω, and hence the point process (33) can be represented by the random atomic distribution G on Θ, defined by It was shown in [58] that G follows a Dirichlet process DP (η, G 0 ), with parameter η and base distribution G 0 . Noting that the cluster parameter θ n for X n can be regarded as a sample from G, the data generation process for this model can be summarized as follows The cluster assignment and the mixture parameters, including the number of clusters, can be automatically learned from the data via posterior inference. Analogous to finite mixture models, computing the posteriors are intractable in general and often require MCMC methods. Next we detail a (collapsed) Gibbs sampler to simulate the cluster label posterior p(y 1:N |X 1:N ) for point pattern clustering. is the n th conditional probability. After the so-called preconvergence period, samples from the Markov chain are effectively distributed from the cluster label posterior. However, the actual distribution of the samples depends on the starting value. In practice, the pre-convergence samples, known as burn-ins, are discarded, and the post-convergence samples are used for inference.

Gibbs Sampling for Poisson mixture model
While Gibbs sampling is efficient, it requires the conditionals to be easily computed and sampled. Using the notationn {1, ..., N } − {n}, the n th conditional for the cluster labels of the infinite point process mixture can be written as p(k|yn,X 1:N ) = p(y n = k|zn, X n , Xn) ∝f (X n |y n = k, yn , , Xn)p(y n = k|yn, Xn) =f (X n |y n = k, yn , , Xn)p(y n = k|yn) where the last line follows from the fact that given Xn and yn, y n is independent of Xn. Using the Polya urn characterization [60] of the Dirichlet process, we have where the sum overn is called the popularity of k. Further, note in the first term of (35) that, given y n = k, yn and Xn, X n only depends on X (k) n {X j ∈ Xn : y j = k}, the set of observations in Xn that belongs to cluster k (in fact its cardinality is the popularity of k). Hence f (X n |y n = k, yn , , Xn) = f (X n |y n = k, X (37) which is the predictive likelihood for cluster k of the point pattern X n given X (k) n . In general the predictive likelihood, and consequently the conditionals, are intractable.
Fortunately, an analytic predictive likelihood is available for the (infinite) Poisson point process mixture, where (38) with θ k = (ξ k , ϕ k ) denoting the mean cardinality and feature density parameter. In the following, we propose a family of conjugate priors for Poisson point processes and exploit conjugacy to derive the predictive likelihood for the Poisson mixture model.
For an infinite Poisson mixture model with base measure G 0 given by (39), where H(dϕ|γ) is conjugate with respect to the feature density p ϕ of the constituent Poisson point process components, the predictive likelihood (37) is given by f (X n |X (k) n ). Depending on the specific forms for H(dϕ|γ) and p ϕ , the integral p X ϕ H(dϕ|γ Z ) can be evaluated analytically (see [53] for several examples). Consequently, given the hyper-parameters α, β, γ, η the conditionals for the Gibbs sampler can be computed analytically.
Remark: The proposed Bayesian solution can be adapted for semi-supervised learning, where only labeled training data for certain clusters are available and the objective is to compute the posterior of the missing labels. This approach can also address the novelty detection problem in Section

EXPERIMENTS
This section demonstrates the viability of the proposed framework with one of the simplest point processes-the Poisson model. A Poisson point process with Gaussian intensity is specified by the triple (ρ, µ, Σ) where ρ is the rate and µ, Σ are the mean and covariance of the feature density. The NB model is used as a performance bench mark since it has been used for this type of problems (see e.g. [3], [4], [5], [6], [13]) and assumes i.i.d. features like the Poisson model.

Classification Experiments
This subsection presents two classification experiments on simulated data and the Texture images dataset [61]. In the training phase, ML is used to learn the parameters of the NB model and the Poisson model (using the technique outlined in subsection 3.1.4) from fully observed training data. For simplicity we use a uniform class prior in the test phase.

Classification on simulated data
We consider three diverse scenarios, each comprising three classes simulated from Poisson point processes with Gaussian intensities shown in Fig. 7. In scenario (a), point patterns from each class are well-separated from other classes in feature, but significantly overlapping in cardinality (see Fig. 7a). In scenario (b), point patterns from each class are well-separated from other classes in cardinality, but significantly overlapping in feature (see Fig. 7b). Scenario (c) is a mix of (a) and (b), where: point patterns from Class 1 are well-separated from other classes in features, but significantly overlapping with Class 2 in cardinality; and the point patterns from Classes 2 and 3 significantly overlap in feature, but well-separated in cardinality (see Fig. 7c).
The fully observed training dataset comprises 600 point patterns (200 per class) is used to learn the NB/Poisson model in which each class is modeled by a Gaussian density/intensity. In the test phase, 10 different test sets each comprises 300 point patterns (100 per class) are used. The average classification performance is reported in Fig. 7. Observe that in scenario (a), both the NB and Poisson models perform equally well, while in scenarios (b) and (c) where the point patterns overlap in feature, the Poisson model outperforms the NB model since it exploits the cardinality information in the point patterns.

Classification on the Texture dataset
Three classes "T14 brick1", "T15 brick2", and "T20 upholstery" of the Texture images dataset [61] are considered. Each class comprises 40 images, with some examples shown in Fig. 8a. Each image is processed by the SIFT algorithm (using the VLFeat library [63]) to produce a point pattern of 128-D SIFT features, which is then compressed into a 2-D point pattern by Principal Component Analysis (PCA). Fig.  8b shows the superposition of the 2-D point patterns from the three classes along with their cardinality histograms.
A 4-fold cross validation scheme is used for performance evaluation. In each fold, the fully observed training dataset comprising 30 images per class is used to learn the NB/Poisson model in which each class is parameterized by a 3-component Gaussian mixture density/intensity. The test set comprises the remaining images (10 per class). Observe from Fig 9 that the Poisson model outperforms NB, since it can exploit cardinality information from the data.

Novelty Detection Experiments
This subsection presents two novelty detection experiments on both simulated and real data using the Poisson model to illustrate the effectiveness of the proposed ranking function against the NB likelihood and standard probability density. Like the classification experiments, ML is used to learn the parameters of the 'normal' NB and Poisson models in the training phase. The novelty threshold is set at the 2nd 10quantile of the ranking values of the 'normal' training data. The detection performance measure is the F 1 score [62]: where precision is the proportion of correct decisions in the output of the detector, and recall is the proportion of correctly identified novelties in the test set. To ensure functional continuity of F 1 , we define F 1 (0, 0) 0, i.e. its limit at (0, 0).

Novelty detection on simulated data
We consider three simulated scenarios comprising 'normal' and novel point patterns generated from Poisson point processes with 2-D Gaussian intensities as shown in Fig. 10. All scenarios have the same 'normal' point patterns, with cardinalities between 20 and 60. In scenario (a) novelties are well-separated from 'normal' data in feature, but overlapping in cardinality (see Fig. 10a). In scenario (b) novelties are overlapping with 'normal' data in feature, but only partially overplapping in cardinality (see Fig. 10b). In scenario (c) we remove the high cardinality novelties from (b) (see Fig. 10c).
In the training phase, the same 300 'normal' point patterns for each scenario are used to learn the 'normal' NB/Poisson model that consists of a Gaussian density/intensity. In the testing phase, 10 tests are ran per scenario with each test set comprising 100 'normal' point patterns and 100 novelties generated according to their respective models. Observe from Fig. 10a that in scenario (a) the NB likelihood, Poisson probability density, and Poisson ranking function all perform well. Fig. 10b shows good performance by the ranking function in scenario (b). The moderate performance of the NB likelihood and probability density are inflated by erroneously ranking high cardinality point patterns lower than they are, due to the multiplication of many small numbers without proper adjustment. Observe from Fig. 10c that after removing the high cardinality novelties in the test set, only the ranking function perform well while the others fail. The boxplots for test data in Fig. 11c verified that only the proposed ranking function is consistent, whereas the NB likelihood and the probability density even rank novelties higher than 'normal' data.

Novelty detection on the Texture dataset
For this experiment, data from class "T14 brick1" of the Texture dataset from subsection 6.1.2, are considered 'normal' while novel data are taken from class "T20 upholstery".
A 4-fold cross validation scheme is used for performance evaluation. In each fold, training data comprising 30 inality n        density/intensity. The test set comprises the remaining 10 'normal' images and 10 novel images. The learned models are similar to those of class "T14 brick1" in Fig. 9. The novelty detection performance in Fig. 12 showed that ranking the data using the NB likelihood or the probability density failed to detect most novelties, whereas the proposed ranking function achieved a high F 1 score. Moreover, the box plots for test data in Fig. 13  (a) Novelty well-separated from 'normal' data in feature, but overlap in cardinality. 20 30 ardinality n ality histogram   ranking function provides a consistent ranking.

Clustering Experiments
This subsection presents two clustering experiments with known number of clusters using the EM clustering algorithm (outlined in subsection 5.1.1). For clustering performance measure, we use Purity, normalized multual information (NMI), rand index and F 1 score.

EM clustering on simulated data
This experiment uses the same simulated dataset described in section 6.1.1 but without labels. Since there are three clusters, we use a 3-component Poisson mixture model, where each constituent Poisson point process is parameterized by a Gaussian intensity. The clustering results in Fig. 14 show that the proposed point pattern clustering algorithm performs well on all three scenarios.

EM clustering on the Texture dataset
This experiment uses the Texture dataset described in section 6.1.2, but without labels. Since there are three clusters, we use a 3-component Poisson mixture model, where each constituent Poisson point process is parameterized by a 3-component Gaussian mixture intensity (similar to subsection 6.1.2). The M-step of the proposed EM algorithm is accomplished by applying the standard EM algorithm to find the data-weighted MLE of the Gausian mixture parameter. The clustering results in Fig. 15 show that the proposed algorithm performs well on real data.

CONCLUSIONS
This article outlined a framework for model-based learning for point pattern data using point process theory. In particular, we demonstrated the use of point process models for various learning tasks. Our main aim is to introduce an available toolset that facilitates research in machine learning for point pattern data. While the utility of the framework was only demonstrated on representative learning tasks such as classification, novelty detection and clustering, such framework is flexible enough to accommodate other learning tasks. For tractability, the proposed algorithms are based on very simple models. Improved performance on real data can be achieved with more sophisticated models, albeit at higher computational costs. More complex datasets,  where the i.i.d. assumption is no longer adequate, require sophisticated point process models such as Gibbs to capture interactions between the elements of the point patterns. Developing efficient techniques for learning such models is an active research area in statistics.  Figure 13. Boxplots of: NB likelihood; probability density, and ranking function; for 'normal' and novel data in one fold of the Texture dataset.