Multivariate Poisson interpoint distances
Introduction
We are interested in the sampling distribution of the squared IDs from Poisson samples. Suppose for is a sample of independent and identically distributed (i.i.d.) random vectors in from a Poisson distribution with mean vector , covariance , and distribution function , where is the set of non-negative integers. The probability distribution of is denoted by where for . Suppose for is a sample of i.i.d. random vectors from a Poisson distribution with mean vector , covariance , and distribution function . We assume that the two samples are independent.
IDs are at the heart of several multivariate techniques such as clustering and classification. For example, Ripley, 1976, Ripley, 1977 considers the analysis of point patterns that follow Poisson distributions. Bonetti and Pagano (2005) consider discrete observations and use the asymptotic normality of the empirical DF of the dependent IDs evaluated at a finite number of values to detect spatial disease clusters. Tebaldi et al. (2011) implement a Stata package for the two-sample problem using IDs and permutation distributions. Selvin et al. (1993) use IDs to measure spatial clustering. IDs are also been used for testing the equality of distribution functions. More recently, Modarres (2014) investigates the IDs among the observations from a high dimensional multivariate Bernoulli distribution.
Classical multivariate statistical methodology is based on the multivariate normal distribution. Many multivariate techniques become ineffective or lose statistical power in high dimensions. Estimation of multivariate densities or distribution functions, even when a parametric model is assumed, is difficult. Methods based on interpoint distances side-step these problems because interpoint distances are always one dimensional irrespective of the value of .
Applications of IDs include testing the hypothesis against general alternatives when (Liu and Modarres, 2010), testing for the equality of the mean vectors , classification of high dimensional Poisson observations, detection of multivariate hotspots, and a mixture distribution approach for detection of outliers. Some of these applications are further described below.
For testing when , consider independent random vectors i.i.d from and i.i.d from . Let and denote the distributions of and , respectively, where is the Euclidean norm of . Under mild conditions, Maa et al. (1996) prove that , and are identical if and only if both continuous and discrete distributions. Let represent the average of IDs from the sample. Let represent the average of IDs from the sample and represent the average of cross-sample IDs. To test , Biswas and Ghosh (2014) test statistic is For small sample sizes, one can use the permutation principle to calculate the critical values and reject the hypothesis for large values of . One can easily extend the test statistic for comparing groups by aggregating over comparisons. Study of distances within one sample and across two samples of observations, allows one to obtain the distribution of , and .
Given the and as training samples, suppose is a high dimensional Poisson observation that we would like to allocate to population or . Liao and Akritas (2007) define a test-based method of classification. We can construct a classification rule based on the average IDs. We first obtain and from the training samples. Next, we assume belongs to and compute , where is the joint sample containing the elements of the training sample and the new observation . We now test the hypothesis that using observations from the sample and observations from the sample. Suppose we use the statistics and obtain a -value of . We also compute , assuming belongs to . We next test the hypothesis that using observations from the sample and observations from the sample. Suppose we use the statistic and obtain a -value of . Liao and Akritas (2007) allocate to if , otherwise, it is allocated to .
A hotspot refers to a cluster of events in space and time with elevated responses, such as a disease outbreak. Two well-known hotspot detection techniques are the spatial scan statistic (Kulldorff, 1997) and the upper level set scan statistic (Patil and Taillie, 2004). The scanning region of the Euclidean space is partitioned into cells and the Poisson distribution is often used to model the cell counts. Modarres and Patil (2007) suggest a Hotelling statistic for multivariate anomaly detection. However, when the number of variables exceeds the number of observations, the covariance matrix is singular and statistic cannot be computed. Bonetti and Pagano (2005) use the distribution of IDs for hotspot detection.
As a final application of IDs, consider the following mixture distribution approach for detection of outliers. In this case, the sample represents the main group while the sample represents the outliers group of observations. Suppose represent a random ID among the pairwise distances. The distribution of is a mixture of the form where , and are probability mass functions of within and between group IDs defined in the next sections. The mixture parameters can be estimated with , and .
The Poisson distribution is the limiting form of the binomial distribution as the number of trials tends to infinity, probability of success tends to zero and tends to the Poisson mean . The probability mass function (pmf) of a Poisson random variable with mean is given by . The central moments are obtained from where are the Sterling numbers of the second kind (Abramowitz and Stegun, 1974). The distribution function of is , its probability generating function is and its factorial moments are given by where . For an excellent treatment of the Poisson distribution, see Johnson et al. (1992).
Based on the trivariate reduction method, Dwass and Teicher (1957) propose the following multivariate Poisson distribution that allows for a constant correlation. Suppose are independent Poisson random variables with means , and , respectively. Let , for . It is not difficult to show that has a multivariate Poisson distribution with mean vector and covariance matrix where and for . The pmf of is given by where . When , then the components of are independent.
Furthermore, we suppose are independent Poisson random variables with means , and , respectively. Let , for . It follows that has a multivariate Poisson distribution with mean vector and covariance matrix where and for . When , then the components of are independent. For a comprehensive treatment of the bivariate Poisson distribution and its multivariate extensions see Kocherlakota and Kocherlakota (1992) and Johnson et al. (1997).
The constant correlation model allows only for a positive correlation. Extensions based on mixtures to allow for a flexible correlation structure and over dispersed marginal distributions are discussed in Chib and Winkelmann (2001) and Karlis and Xekalaki (2005) while Nikoloulopoulos and Karlis (2009) consider copula-based models for multivariate count data. A simple method of generating a multivariate Poisson vector uses the multivariate normal copula (Yakov and Shmueli, 2012). Karlis and Meligkotsidou (2007) use univariate Poisson random variables to define a -dimensional Poisson random vector and propose the following general method to create multivariate Poisson distributions with a more flexible correlation structures.
Let be an binary matrix with no duplicate column where . Let where , and are independent random variables, for . The random vector has a -dimensional Poisson distribution with parameters . The mean of is and its covariance is , where . The vector has univariate Poisson distributions. Now, let . If is , the identity matrix of order , and is a column vector of 1s, then one obtains the constant correlation matrix. Karlis and Meligkotsidou (2007) assume is an binary matrix. Each column of has exactly 2 ones and zeros and no duplicate columns. This multivariate Poisson model allows for different correlations.
We obtain the distribution of the squared Euclidean IDS from a Poisson distribution within one sample in Section 2. Section 3 finds the distribution of the IDs across two independent groups. Section 4 examines the average IDs and obtains their expected values, variances and covariances. A simulation study compares the performance of the Hotelling statistic with distance methods under two types of hypotheses in Section 5.
Section snippets
One sample IDs
In this section, we derive the distribution of the squared IDs among a group of observations from a Poisson distribution. The squared IDs between and is defined as where for .
One can show that the squared ID satisfy (a) if and only if for and (b) . However, the triangle inequality for any and does not hold in general. The IDs
Two-sample IDs
The squared IDs between and for and are defined as where . Theorem 4 Suppose for is a sample of independent and identically distributed random vectors from and for an independent random sample from . Let . The distribution of the squared IDs between and for and is given by
Average interpoint distances
We obtain the pairwise IDs , the IDs and the IDs between the two samples. We denote the average of the squared IDs, respectively, by
Using the binomial expansion and the central moments of the Poisson distribution, one can show that the mixed moments of are
Simulation
We examine the hypotheses and , where , using the following statistics,
- •
Biswas and Ghosh (BG),
- •
total squared ID deviations (TD),
- •
the Hotelling test, and
- •
Euclidean distance between the centers (ED).
The test statistic of Biswas and Ghosh (2014) rejects both hypotheses for large values of . We use the permutation principle to calculate the critical values. One can also base a test statistic on the total square deviations of within and across groups means from the
Acknowledgments
I would like to thank two anonymous referees whose helpful comments and suggestions improved the presentation of the article.
References (27)
- et al.
A nonparametric two-sample test applicable to high dimensional data
J. Multivariate Anal.
(2014) - et al.
Test-based classification: A linkage between classification and statistical testing
Statist. Probab. Lett.
(2007) On the interpoint distances of Bernoulli vectors
Statist. Probab. Lett.
(2014)- et al.
Hotspot detection with bivariate data
J. Statist. Plann. Inference
(2007) - et al.
Finite normal mixture copulas for multivariate discrete data modeling
J. Statist. Plann. Inference
(2009) - et al.
Interpoint squared distance as a measure of spatial clustering
Soc. Sci. Med.
(1993) - et al.
Handbook of Mathematical Functions
(1974) - et al.
The interpoint distance distribution as a descriptor of point patterns, with an application to spatial disease clustering
Stat. Med.
(2005) - et al.
Markov chain Monte Carlo analysis of correlated count data
J. Bus. Econom. Statist.
(2001) - et al.
On infinitely divisible random vectors
Ann. Math. Statist.
(1957)
The frequency distribution of the difference between two independent variates following the same Poisson distribution
J. Roy. Statist. Soc. Ser. A
Discrete Multivariate Distributions
Univariate Discrete Distributions
Cited by (5)
Instance-Based Classification through Hypothesis Testing
2021, IEEE AccessInterpoint distances: Applications, properties, and visualization
2020, Applied Stochastic Models in Business and IndustryUnified multivariate hypergeometric interpoint distances
2019, StatisticsInterpoint Distance Classification of High Dimensional Discrete Observations
2019, International Statistical Review