Multivariate Poisson interpoint distances

https://doi.org/10.1016/j.spl.2016.01.025Get rights and content

Abstract

We study the properties of the squared interpoint distances (IDs) in samples taken from multivariate Poisson distributions. We obtain the distribution IDs within one sample and across two independent samples. We derive the means and covariances of the average IDs.

Introduction

We are interested in the sampling distribution of the squared IDs from Poisson samples. Suppose X={Xi} for i=1,,nx is a sample of independent and identically distributed (i.i.d.) random vectors in Ip from a Poisson distribution with mean vector λx=(λ(x)1,,λ(x)p), covariance Σx, and distribution function F, where I={0,1,2,} is the set of non-negative integers. The probability distribution of X is denoted by PO(λx) where λ(x)t0 for t=1,,p. Suppose Y={Yj} for j=1,,ny is a sample of i.i.d. random vectors from a Poisson distribution with mean vector λy=(λ(y)1,,λ(y)p), covariance Σy, and distribution function G. We assume that the two samples are independent.

IDs are at the heart of several multivariate techniques such as clustering and classification. For example, Ripley, 1976, Ripley, 1977 considers the analysis of point patterns that follow Poisson distributions. Bonetti and Pagano (2005) consider discrete observations and use the asymptotic normality of the empirical DF of the dependent IDs evaluated at a finite number of values to detect spatial disease clusters. Tebaldi et al. (2011) implement a Stata package for the two-sample problem using IDs and permutation distributions. Selvin et al. (1993) use IDs to measure spatial clustering. IDs are also been used for testing the equality of distribution functions. More recently, Modarres (2014) investigates the IDs among the observations from a high dimensional multivariate Bernoulli distribution.

Classical multivariate statistical methodology is based on the multivariate normal distribution. Many multivariate techniques become ineffective or lose statistical power in high dimensions. Estimation of multivariate densities or distribution functions, even when a parametric model is assumed, is difficult. Methods based on interpoint distances side-step these problems because interpoint distances are always one dimensional irrespective of the value of p.

Applications of IDs include testing the hypothesis H0:F=G against general alternatives when p>max(nx,ny) (Liu and Modarres, 2010), testing for the equality of the mean vectors K0:λx=λy, classification of high dimensional Poisson observations, detection of multivariate hotspots, and a mixture distribution approach for detection of outliers. Some of these applications are further described below.

For testing H0:F=G when p>max(nx,ny), consider independent random vectors X1,X2 i.i.d from F and Y1,Y2 i.i.d from G. Let DFF,DGG and DFG denote the distributions of X1X2,Y1Y2 and X1Y2, respectively, where X=XX1/2 is the Euclidean norm of X. Under mild conditions, Maa et al. (1996) prove that DFF, DGG and DFG are identical if and only if F=G both continuous and discrete distributions. Let d̄(x)2 represent the average of (nx2) IDs from the X sample. Let d̄(y)2 represent the average of (ny2) IDs from the Y sample and d̄(xy)2 represent the average of nxny cross-sample IDs. To test H0:F=G, Biswas and Ghosh (2014) test statistic is BG(x,y)=(d̄(xy)d̄(x))2+(d̄(xy)d̄(y))2. For small sample sizes, one can use the permutation principle to calculate the critical values and reject the hypothesis for large values of BG. One can easily extend the test statistic for comparing g groups by aggregating BG over g(g1)/2 comparisons. Study of distances within one sample and across two samples of observations, allows one to obtain the distribution of d̄(x),d̄(y), and d̄(xy).

Given the {Xi} and {Yj} as training samples, suppose Z is a high dimensional Poisson observation that we would like to allocate to population πx or πy. Liao and Akritas (2007) define a test-based method of classification. We can construct a classification rule based on the average IDs. We first obtain d̄(x) and d̄(y) from the training samples. Next, we assume Z belongs to πx and compute d̄(xz), where (xz) is the joint sample containing the elements of the training sample X and the new observation Z. We now test the hypothesis that H0:F=G using nx+1 observations from the X sample and ny observations from the Y sample. Suppose we use the BG statistics and obtain a p-value of γ1. We also compute d̄(yz), assuming Z belongs to πy. We next test the hypothesis that H0:F=G using nx observations from the X sample and ny+1 observations from the Y sample. Suppose we use the BG statistic and obtain a p-value of γ2. Liao and Akritas (2007) allocate Z to πx if γ1<γ2, otherwise, it is allocated to πy.

A hotspot refers to a cluster of events in space and time with elevated responses, such as a disease outbreak. Two well-known hotspot detection techniques are the spatial scan statistic (Kulldorff, 1997) and the upper level set scan statistic (Patil and Taillie, 2004). The scanning region of the Euclidean space is partitioned into cells and the Poisson distribution is often used to model the cell counts. Modarres and Patil (2007) suggest a Hotelling T2 statistic for multivariate anomaly detection. However, when the number of variables exceeds the number of observations, the covariance matrix is singular and T2 statistic cannot be computed. Bonetti and Pagano (2005) use the distribution of IDs for hotspot detection.

As a final application of IDs, consider the following mixture distribution approach for detection of outliers. In this case, the X sample represents the main group while the Y sample represents the outliers group of observations. Suppose d2 represent a random ID among the s=(nx+ny2) pairwise distances. The distribution of d2 is a mixture of the form P(d2=k)=i=13αiqi(d2) where i=13αi=1, and q1,q2andq3 are probability mass functions of within and between group IDs defined in the next sections. The mixture parameters can be estimated with αˆ1=(nx2)/s,αˆ2=(ny2)/s, and αˆ3=nxny/s.

The Poisson distribution is the limiting form of the binomial distribution as the number of trials n tends to infinity, probability of success θ tends to zero and nθ tends to the Poisson mean λ. The probability mass function (pmf) of a Poisson random variable Z with mean λ0 is given by P(Z=z)=exp(λ)λzz!. The central moments are obtained from E(Zk)=t=1kλtS(k,t) where S(k,t) are the Sterling numbers of the second kind (Abramowitz and Stegun, 1974). The distribution function of Z is F(z)=i=0zλiexp(λ)i!, its probability generating function is h(t)=exp{λ(1t)} and its factorial moments are given by E(Z(k))=λk where Z(k)=Z(Z1)(Zk+1). For an excellent treatment of the Poisson distribution, see Johnson et al. (1992).

Based on the trivariate reduction method, Dwass and Teicher (1957) propose the following multivariate Poisson distribution that allows for a constant correlation. Suppose Z,Z1,,Zp are independent Poisson random variables with means λ, and λ1,,λp, respectively. Let Xi=Zi+Z, for i=1,,p. It is not difficult to show that X=(X1,,Xp) has a multivariate Poisson distribution with mean vector λx=(λ(x)1,,λ(x)p) and covariance matrix Σx where E(Xr)=Var(Xr)=λ(x)r=λ+λr and Cov(Xr,Xs)=λ0 for rs=1,,p. The pmf of X is given by P(X=x1,,Xp=xp)=exp(t=1pλi)t=1pλtxtxt!i=0mj=1p(xji)i!(λk=1pλi)i where m=min(x1,,xp). When λ=0, then the components of X are independent.

Furthermore, we suppose Z,Z1,,Zp are independent Poisson random variables with means λ, and λ1,,λp, respectively. Let Yi=Zi+Z, for i=1,,p. It follows that Y=(Y1,,Yp) has a multivariate Poisson distribution with mean vector λy=(λ(y)1,,λ(y)p) and covariance matrix Σy where E(Yr)=Var(Yr)=λ(y)r=λ+λr and Cov(Yr,Ys)=λ0 for rs=1,,p. When λ=0, then the components of Y are independent. For a comprehensive treatment of the bivariate Poisson distribution and its multivariate extensions see Kocherlakota and Kocherlakota (1992) and Johnson et al. (1997).

The constant correlation model allows only for a positive correlation. Extensions based on mixtures to allow for a flexible correlation structure and over dispersed marginal distributions are discussed in Chib and Winkelmann (2001) and Karlis and Xekalaki (2005) while Nikoloulopoulos and Karlis (2009) consider copula-based models for multivariate count data. A simple method of generating a multivariate Poisson vector uses the multivariate normal copula (Yakov and Shmueli, 2012). Karlis and Meligkotsidou (2007) use dp univariate Poisson random variables to define a p-dimensional Poisson random vector and propose the following general method to create multivariate Poisson distributions with a more flexible correlation structures.

Let A be an p×q binary matrix with no duplicate column where qp. Let X=AZ where Z=(Z1,Z2,,Zq), and Zt are independent PO(λt) random variables, for t=1,,q. The random vector X has a p-dimensional Poisson distribution with parameters λ=(λ1,,λq). The mean of X is Aλ and its covariance is AΣA, where Σ=Diag(λ1,,λq). The vector X has univariate Poisson distributions. Now, let A=[A1,A2]. If A1 is Ip, the identity matrix of order p, and A2=1p is a column vector of 1s, then one obtains the constant correlation matrix. Karlis and Meligkotsidou (2007) assume A2 is an p×p(p1)/2 binary matrix. Each column of A2 has exactly 2 ones and p2 zeros and no duplicate columns. This multivariate Poisson model allows for different correlations.

We obtain the distribution of the squared Euclidean IDS from a Poisson distribution within one sample in Section  2. Section  3 finds the distribution of the IDs across two independent groups. Section  4 examines the average IDs and obtains their expected values, variances and covariances. A simulation study compares the performance of the Hotelling T2 statistic with distance methods under two types of hypotheses in Section  5.

Section snippets

One sample IDs

In this section, we derive the distribution of the squared IDs among a group of observations from a Poisson distribution. The squared IDs between Xi and Xj is defined as d(x)ij2=(XiXj)(XiXj)=t=1p(XitXjt)2=t=1pT(x)ij2(t), where T(x)ij(t)=XitXjt for 1i<jnx.

One can show that the squared ID d(x)ij2 satisfy (a) d(x)ij2=0 if and only if Xit=Xjt for t=1,,p and (b) d(x)ij2=d(x)ji2. However, the triangle inequality d(x)ij2d(x)ik2+d(x)kj2 for any ki and kj does not hold in general. The IDs

Two-sample IDs

The squared IDs between Xi and Yj for i=1,,nx and j=1,,ny are defined as d(xy)ij2=(XiYj)(XiYj)=t=1p(XitYjt)2=t=1pT(xy)ij2(t), where T(xy)ij(t)=XitYjt.

Theorem 4

Suppose X={Xi} for i=1,,nx is a sample of independent and identically distributed random vectors from PO(λx) and Y={Yj} for j=1,,ny an independent random sample from PO(λy). Let k=k12++kp2. The distribution of the squared IDs between Xi and Yj for i=1,,nx and ji=1,,ny is given byP(d(xy)ij2=k)=exp{2tr(Σx+Σy)}kI+t=1p(1+ηtkt)ηtkt/

Average interpoint distances

We obtain the nx(nx1)/2 pairwise IDs {d(x)ij2,1i<jnx}, the ny(ny1)/2 IDs {d(y)ij2,1i<jny} and the nxny IDs {d(xy)ij2for  i=1,,nx,j=1,,ny} between the two samples. We denote the average of the squared IDs, respectively, by d̄(x)2=2nx(nx1)i=1nxj=i+1nxd(x)ij2,d̄(y)2=2ny(1ny)i=1nyj=i+1nyd(y)ij2,d̄(xy)2=1nxnyi=1nxj=1nyd(xy)ij2.

Using the binomial expansion and the central moments of the Poisson distribution, one can show that the mixed moments of XPO(λx) are E(X1u1Xpup)=j1=0u1jp=0

Simulation

We examine the hypotheses H0:F=G and K0:λx=λy, where p>max(nx,ny), using the following statistics,

  • Biswas and Ghosh (BG),

  • total squared ID deviations (TD),

  • the Hotelling T2 test, and

  • Euclidean distance between the centers (ED).

The test statistic of Biswas and Ghosh (2014) rejects both hypotheses for large values of BG(x,y). We use the permutation principle to calculate the critical values. One can also base a test statistic on the total square deviations of within and across groups means from the

Acknowledgments

I would like to thank two anonymous referees whose helpful comments and suggestions improved the presentation of the article.

References (27)

  • J.0 Irwin

    The frequency distribution of the difference between two independent variates following the same Poisson distribution

    J. Roy. Statist. Soc. Ser. A

    (1937)
  • N. Johnson et al.

    Discrete Multivariate Distributions

    (1997)
  • N. Johnson et al.

    Univariate Discrete Distributions

    (1992)
  • View full text