Skip to main content
Log in

Two-sample test for sparse high-dimensional multinomial distributions

  • Original Paper
  • Published:
TEST Aims and scope Submit manuscript

Abstract

In this paper we consider testing the equality of probability vectors of two independent multinomial distributions in high dimension. The classical Chi-square test may have some drawbacks in this case since many of cell counts may be zero or may not be large enough. We propose a new test and show its asymptotic normality and the asymptotic power function. Based on the asymptotic power function, we present an application of our result to a neighborhood-type test which has been previously studied, especially for the case of fairly small p values. To compare the proposed test with existing tests, we provide numerical studies including simulations and real data examples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11(10):1–12

    Article  Google Scholar 

  • Anderson DA, McDonald LL, Weaver KD (1974) Tests on categorical data from the union-intersection principle. Ann Inst Stat Math 26:203–213

    Article  MathSciNet  MATH  Google Scholar 

  • Bai Z, Saranadasa H (1996) Effect of high dimension: by an example of a two sample problem. Stat Sin 6:311–329

    MathSciNet  MATH  Google Scholar 

  • Berger J, Delampady M (1987) Testing precise hypotheses. Stat Sci 2:317–352

    Article  MathSciNet  MATH  Google Scholar 

  • Berger J, Sellke T (1987) Testing a point null hypothesis: irreconcilability of p-values and evidence. J Am Stat Assoc 82:112–122

    MathSciNet  MATH  Google Scholar 

  • Cai TT, Liu WD, Xia Y (2014) Two-sample test of high dimensional means under dependence. J R Stat Soc Ser B Stat Methodol 76:349–372

    Article  MathSciNet  Google Scholar 

  • Chen SX, Qin YL (2010) A two sample test for high dimensional data with applications to gene-set testing. Ann Stat 38:808–835

    Article  MathSciNet  MATH  Google Scholar 

  • Choi S, Park J (2014) Plug-in tests for nonequivalence of means of independent normal populations. Biom J 56:1016–1034

    Article  MathSciNet  MATH  Google Scholar 

  • Dette H, Munk A (1998) Validation of linear regression models. Ann Stat 26:778–800

    Article  MathSciNet  MATH  Google Scholar 

  • Hall P, Jin J (2010) Innovated higher criticism for detecting sparse signals in correlated noise. Ann Stat 38:1686–1732

    Article  MathSciNet  MATH  Google Scholar 

  • Morris C (1975) Central limit theorems for multinomial sums. Ann Stat 3:165188

    Article  MathSciNet  Google Scholar 

  • Munk A, Paige R, Pang J, Patrangenaru V, Ruymgaat F (2008) The one- and multi-sample problem for functional data with application to projective shape analysis. J Multivar Anal 99:815–833

    Article  MathSciNet  MATH  Google Scholar 

  • Park J, Ayyala DN (2013) A test for the mean vector in large dimension and small samples. J Stat Plan Inference 143:929–943

    Article  MathSciNet  MATH  Google Scholar 

  • Park J, Sinha B, Shah A, Xu D, Lin J (2015) Likelihood ratio tests for interval hypotheses with applications. Commun Stat Theory Methods 44:2351–2370

    Article  MathSciNet  MATH  Google Scholar 

  • Solo V (1984) An alternative to significance tests. Technical Report 84-14, Department of Statistics, Purdue University, IN

  • Srivastava M (2009) A test for the mean vector with fewer observations than the dimension under non-normality. J Multivar Anal 100:518–532

    Article  MathSciNet  MATH  Google Scholar 

  • Srivastava M, Katayama S, Kano Y (2013) A two sample test in high dimensional data. J Multivar Anal 114:349–835

    Article  MathSciNet  MATH  Google Scholar 

  • Steck GP (1957) Limit theorems for conditional distributions. Univ Calif Publ Stat 2(12):237–284

    MathSciNet  MATH  Google Scholar 

  • Zelterman D (1987) Goodness of fit tests for large sparse multinomial distributions. J Am Stat Assoc 82:624–629

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junyong Park.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 234 KB)

Appendix: Proof of Lemma 1

Appendix: Proof of Lemma 1

We show the ratio consistency of \(\hat{\sigma }_k^2\). To show the ratio consistency of \(\hat{\sigma }_k^2\), by using \(n_1 \asymp n_2\) and \(\sigma _k^2 \asymp n^{-2} ||\mathbf{P}_1 + \mathbf{P}_2||^2_2\), it is sufficient to show

$$\begin{aligned}&\frac{\hat{\sigma }_k^2 -\sigma _k^2}{\sigma _k^2} \asymp \frac{ \sum _{i=1}^{k}({\hat{p}}^2_{1i} -\frac{{\hat{p}}_{1i}}{n_1} -p_{1i}^2)}{||\mathbf{P}_1 + \mathbf{P}_2||^2_2} \nonumber \\&\quad + \,\frac{\sum _{i=1}^{k}{\hat{p}}_{1i}{\hat{p}}_{2i} -\sum _{i=1}^{k}p_{1i}p_{2i} }{||\mathbf{P}_1 + \mathbf{P}_2||^2_2} + \frac{ \sum _{i=1}^{k}({\hat{p}}^2_{2i} -\frac{{\hat{p}}_{2i}}{n_2} -p_{2i}^2)}{||\mathbf{P}_1 + \mathbf{P}_2||^2_2} \nonumber \\&\quad \overset{p}{\rightarrow }0. \end{aligned}$$
(25)

We first show the ratio consistency of \(\sum _{i=1}^{k}\left( {\hat{p}}_{1i}^2 -\frac{{\hat{p}}_{1i}}{n_1}\right) \) for \(\sum _{i=1}^{k}p_{1i}^2\). The case of the 2nd group (\(\sum _{i=1}^{k}\left( {\hat{p}}_{1i}^2 -\frac{{\hat{p}}_{1i}}{n_1}\right) \) for \(\sum _{i=1}^{k}p_{1i}^2\)) can be proved similarly. Since \(E({\hat{p}}_{1i} ^2) = p_{1i}^2 + \frac{p_{1i}(1-p_{1i})}{n_1} = (1-\frac{1}{n_1})p_{1i}^2 + \frac{p_{1i}}{n_1}\) where \({\hat{p}}_{1i} = \frac{N_{1i}}{n_1}\), we have \(E \left( \frac{n_1}{n_1-1}({\hat{p}}_{1i}^2 - \frac{{\hat{p}}_{1i}}{n_1}) \right) = p_{1i}^2\). Thus we consider the following unbiased estimator of \(\sum _{i=1}^k p_{1i}^2\): \(\frac{n_1}{n_1-1}\sum _{i=1}^k({\hat{p}}_{1i}^2 - \frac{{\hat{p}}_{1i}}{n_1})\). To show \(\frac{\frac{n_1}{n_1-1}\sum _{i=1}^k({\hat{p}}_{1i}^2 - \frac{{\hat{p}}_{1i}}{n_1}) - \sum _{i=1}^{k}p^2_{1i} }{||\mathbf{P}_1 + \mathbf{P}_2||^2_2} \overset{p}{\rightarrow }0\), we will show that the following quantity converges to 0 as follows:

$$\begin{aligned}&\frac{E \left[ \left( \left( \frac{n_1}{n_1-1}\right) \sum _{i=1}^k \left( {\hat{p}}_{1i}^2 - \frac{{\hat{p}}_{1i}}{n_1}\right) -\sum _{i=1}^k p_{1i}^2 \right) ^2 \right] }{||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} \nonumber \\&\quad = \frac{Var\left( \left( \frac{n_1}{n_1-1}\right) \sum _{i=1}^k \left( {\hat{p}}_{1i}^2 - \frac{{\hat{p}}_{1i}}{n_1}\right) \right) }{||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} \nonumber \\&\quad \le \frac{8}{||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} \left( Var\left( \sum _{i=1}^k {\hat{p}}_{1i}^2\right) +Var\left( \sum _{i=1}^k \frac{{\hat{p}}_{1i}}{n}\right) \right) \nonumber \\&\quad =\frac{8}{||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} ((I) + (II)) \end{aligned}$$
(26)

where the last inequality in (26) is from \(Var(X+Y) \le 2(Var(X)+Var(Y))\) and \(n_1/(n_1-1) \le 2\). We decompose (I) into two parts:

$$\begin{aligned} (I)=Var\left( \sum _{i=1}^k {\hat{p}}_{1i}^2\right)= & {} \sum _{i=1}^k Var({\hat{p}}_{1i}^2) + \sum _{i \ne j} Cov\left( {\hat{p}}_{1i}^2, {\hat{p}}_{1j}^2\right) = (A) + (B). \end{aligned}$$

Using the results in Lemma.S2 in supplementary material, for some constants \(C_1\) and \(C_2\), we have

$$\begin{aligned} (A)= & {} \sum _{i=1}^k \left( E({\hat{p}}_{1i}^4) - (E({\hat{p}}_{1i}^2))^2 \right) \le C_1 \sum _{i=1}^k \left( \frac{p_{1i}^4}{n_1} + \frac{p_{1i}^3}{n_1} + \frac{p_{1i}^2}{n_1^2} +\frac{p_{1i}}{n_1^3} \right) \\ |(B)|= & {} \left| \sum _{i \ne j} \left( E({\hat{p}}_{1i}^2 {\hat{p}}_{1j}^2) -E({\hat{p}}_{1i}^2) E({\hat{p}}_{1j}^2) \right) \right| \\\le & {} C_2 \sum _{i\ne j} \left( \frac{p_{1i}^2 p_{1j}^2}{n_1} + \frac{p_{1i}^2 p_{1j}}{n_1^2} + \frac{p_{1i}p_{1j}^2}{n_1^2} +\frac{p_{1i}p_{1j}}{n_1^3} \right) . \end{aligned}$$

For all the terms in the above, we can show \(\frac{(A)}{||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} \rightarrow 0\) and \(\frac{(B)}{||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} \rightarrow 0\) as follows: first, note that \( \frac{\max _i p_{ci}^2}{||\mathbf{P}_1 + \mathbf{P}_2||^2_2} \rightarrow 0\) since \( \frac{\max _i p_{ci}^2}{||\mathbf{P}_1 + \mathbf{P}_2||^2_2} \le \frac{\max _i p_{ci}}{||\mathbf{P}_c ||_2^2} \rightarrow 0\) from Condition 2 in Theorem 1. For (A), using \(\max _i p_{1i}^2 \le \max _i p_{1i} \rightarrow 0\) in result 2 in Lemma.S2 in the supplementary material, we have

$$\begin{aligned} \frac{\sum _{i=1}^k p_{1i}^4}{n ||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}}\le & {} \frac{\max _i p_{1i}^2 \sum _{i=1}^k p_{1i}^2}{ n_1 ||\mathbf{P}_1 + \mathbf{P}_2||_2^{4} } = \frac{\max _i p_{1i}^2}{n_1 ||\mathbf{P}_1 + \mathbf{P}_2||^2_2} \rightarrow 0,\\ \frac{\sum _{i=1}^k p_{1i}^3}{n ||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}}\le & {} \frac{\max _i p_{1i} \sum _{i=1}^k p_{1i}^2}{ n_1 ||\mathbf{P}_1 + \mathbf{P}_2||_2^{4} } = \frac{\max _i p_{1i}}{n_1} \rightarrow 0,\\ \frac{\sum _{i=1}^k p_{1i}^2}{n_1^2 ||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}}\le & {} \frac{1}{ n_1 (n_1 ||\mathbf{P}_1 + \mathbf{P}_2||^2_2) } \rightarrow 0,~~\frac{\sum _{i=1}^k p_{1i}}{n_1^3 ||\mathbf{P}_1 + \mathbf{P}_2||_2^{4} }\\\le & {} \frac{1}{ n_1 (n_1||\mathbf{P}_1 + \mathbf{P}_2||^2_2)^2 } \rightarrow 0 \end{aligned}$$

where Condition 3 (\( n||\mathbf{P}_1 + \mathbf{P}_2||^2_2\ge \epsilon >0 \)) and \(n_1 \asymp n_2 \) are used in the last steps as \(n_1 \rightarrow \infty \). For (B), using \(\sum _{i\ne j} p_{1i}^2 p_{1j}^2 \le ||\mathbf{P}_1 + \mathbf{P}_2||^2_2\) and \(\sum _{i\ne j} p_{1i} p_{1j} \le \sum _{i=1}^k p_{1i} =1\), we have from Conditions 1–3 in Theorem 1

$$\begin{aligned} \frac{\sum _{i \ne j} p_{1i}^2 p_{1j}^2 }{n_1 ||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}}\le & {} \frac{1}{n_1} \rightarrow 0,~~~~~~\frac{\sum _{i \ne j} p_{1i}^2 p_{1j} }{n_1^2 ||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} \rightarrow 0, \\ \frac{\sum _{i \ne j} p_{1i} p_{1j}^2 }{n_1^2 ||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}}\le & {} \frac{||\mathbf{P}_1 + \mathbf{P}_2||^2_2}{n_1^2 ||\mathbf{P}_1 + \mathbf{P}_2||_2^{4} } \le \frac{1}{n_1 ||\mathbf{P}_1 + \mathbf{P}_2||^2_2} \rightarrow 0, \\ \frac{\sum _{i \ne j} p_{1i} p_{1j} }{n_1^3 ||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}}\le & {} \frac{1}{n_1 (n_1||\mathbf{P}_1 + \mathbf{P}_2||^2_2)^2} \rightarrow 0. \end{aligned}$$

Similarly, for (II), we have

$$\begin{aligned} (II) = Var\left( \sum _{i=1}^k \frac{{\hat{p}}_{1i}}{n_1}\right)= & {} \sum _{i=1}^k Var\left( \frac{{\hat{p}}_{1i}}{n_1}\right) + \frac{1}{n_1^2} \sum _{i \ne j} Cov\left( {\hat{p}}_{1i}, {\hat{p}}_{1j}\right) \\= & {} \sum _{i=1}^k \frac{p_{1i}(1-p_{1i})}{n_1^3} - \sum _{i\ne j} \frac{p_{1i}p_{1j}}{n_1^3} \le \sum _{i=1}^{k}\frac{p_{1i}}{n_1^3} =\frac{1}{n_1^3}. \end{aligned}$$

Therefore, we have \(\frac{(II)}{n_1^3||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} \le \frac{1}{n_1^3||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} \rightarrow 0\) which leads

$$\begin{aligned} \frac{ \sum _{i=1}^{k}\left( {\hat{p}}^2_{1i} -\frac{{\hat{p}}_{1i}}{n_1}\right) -\sum _{i=1}^{k}p_{1i}^2}{||\mathbf{P}_1 + \mathbf{P}_2||^2_2} \overset{p}{\rightarrow }0. \end{aligned}$$
(27)

The ratio consistent estimator of \(\sum _{i=1}^{k}\left( {\hat{p}}_{2i}^2 - \frac{{\hat{p}}_{2i}}{n_2}\right) \) can be also proved in the same way.

For \(\sum _{i=1}^{k}{\hat{p}}_{1i}{\hat{p}}_{2i}\), we show

$$\begin{aligned}&\frac{E((\sum _{i=1}^{k}{\hat{p}}_{1i}{\hat{p}}_{2i})^2)}{||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} = \frac{\sum _{i=1}^k Var(N_{1i}N_{2i}) + \sum _{i\ne j}\mathrm{Cov}(N_{1i}N_{2i},N_{1j}N_{2j})}{||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} \\&\quad \asymp \frac{\sum _{i=1}^k p_{1i}^2 p_{2i}}{n ||\mathbf{P}_1 + \mathbf{P}_2||_2^{4} }+ \frac{\sum _{i=1}^k p_{1i} p_{2i}^2}{n||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} + \frac{1}{n^2 ||\mathbf{P}_1 + \mathbf{P}_2||^2_2} \\&\quad - \frac{\sum _{i\ne j}p_{1i}p_{2i}p_{1j}p_{2j}}{n||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} \\&\quad \asymp \frac{\max _i p_{1i} }{n||\mathbf{P}_1 + \mathbf{P}_2||^2_2} +\frac{\max _i p_{2i} }{n||\mathbf{P}_1 + \mathbf{P}_2||^2_2} + \frac{\sum _{i=1}^k p_{1i} p_{2i}^2}{n||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}}\\&\quad + \frac{1}{n^2 ||\mathbf{P}_1 + \mathbf{P}_2||^2_2} - \frac{ (\mathbf{P}_1\cdot \mathbf{P}_2)^2 }{n||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} \rightarrow 0 \end{aligned}$$

where the last term converges to 0 since \(\frac{ (\mathbf{P}_1\cdot \mathbf{P}_2)^2 }{n||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} \le \frac{||\mathbf{P}_1 + \mathbf{P}_2||^2_2}{2n||\mathbf{P}_1 + \mathbf{P}_2||_2^{4}} = \frac{1}{2n||\mathbf{P}_1 + \mathbf{P}_2||^2_2} \rightarrow 0\) from Condition 3 in Theorem 1. Therefore

$$\begin{aligned} \frac{\sum _{i=1}^{k}{\hat{p}}_{1i}{\hat{p}}_{2i} - \sum _{i=1}^{k}p_{1i}p_{2i} }{ ||\mathbf{P}_1 + \mathbf{P}_2||^2_2} \overset{p}{\rightarrow }0 \end{aligned}$$
(28)

Combining (27) and (28), we have (25) which leads to the ratio consistency of \(\hat{\sigma }_k^2\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Plunkett, A., Park, J. Two-sample test for sparse high-dimensional multinomial distributions. TEST 28, 804–826 (2019). https://doi.org/10.1007/s11749-018-0600-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11749-018-0600-8

Keywords

Mathematics Subject Classification

Navigation