TwoSampleTest.HD: An R Package for the Two-Sample Problem with High-Dimensional Data

The two-sample problem refers to the comparison of two probability distributions via two independent samples. With high-dimensional data, such comparison is performed along a large number \(p\) of possibly correlated variables or outcomes. In genomics, for instance, the variables may represent gene expression levels for \(p\) locations, recorded for two (usually small) groups of individuals. In this paper we introduce TwoSampleTest.HD, a new R package to test for the equal distribution of the \(p\) outcomes. Specifically, TwoSampleTest.HD implements the tests recently proposed by (Cousido-Rocha et al. 2019) for the low sample size, large dimensional setting. These tests take the possible dependence among the \(p\) variables into account, and work for sample sizes as small as two. The tests are based on the distance between the empirical characteristic functions of the two samples, when averaged along the \(p\) locations. Different options to estimate the variance of the test statistic under dependence are allowed. The package TwoSampleTest.HD provides the user with individual permutation \(p\)-values too, so feature discovery is possible when the null hypothesis of equal distribution is rejected. We illustrate the usage of the package through the analysis of simulated and real data, where results provided by alternative approaches are considered for comparison purposes. In particular, benefits of the implemented tests relative to ordinary multiple comparison procedures are highlighted. Practical recommendations are given.

Marta Cousido-Rocha (Instituto Español de Oceanografía (IEO, CSIC), Centro Oceanográfico de Vigo) , Jacobo de Uña-Álvarez (CINBIO, Universidade de Vigo, SiDOR Research Group)
2023-12-18

0.1 Supplementary materials

Supplementary materials are available in addition to this article. It can be downloaded at RJ-2023-063.zip

Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B, 57: 289–300, 1995.
M. Biswas and A. K. Gosh. A nonparametric two-sample test applicable to high dimensional data. Journal of Multivariate Analysis, 123: 160–171, 2014.
M. Biswas, M. Mukhopadhyay and A. K. Ghosh. A distribution-free two-sample run tests applicable to high-dimensional data. Biometrika, 101: 913–926, 2014.
D. Bosq. Nonparametric statistics for stochastic processes: Estimation and prediction. Second edition. Springer-Verlag, New York, 1998.
E. Carlstein. The use of subseries values for estimating the variance of a general statistic from a stationary sequence. Annals of Statistics, 4: 1171–1179, 1996.
M. Cousido-Rocha and J. de Uña-Álvarez. Equalden.HD: An R package for testing the equality of a high dimensional set of densities. Computer Methods and Programs in Biomedicine, 217: 106694, 2022. DOI https://doi.org/10.1016/j.cmpb.2022.106694.
M. Cousido-Rocha, J. de Uña-Álvarez and S. Döhler. Multiple comparison procedures for discrete uniform and homogeneous tests. Journal of the Royal Statistical Society: Series C (Applied Statistics), 71: 219–243, 2021. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssc.12529.
M. Cousido-Rocha, J. de Uña-Álvarez and J. Hart. A two-sample test for the equality of distributions for high-dimensional data. Journal of Multivariate Analysis, 174: 104537, 2019. URL https://www.sciencedirect.com/science/article/pii/S0047259X19300521.
H. Dehling, R. Fried, I. Garcia and M. Wendler. Change-point Detection Under Dependence Based on Two-Sample U-Statistics. In: Dawson D., R. Kulik, M. Ould Haye, B. Szyszkowicz, Y. Zhao (eds) Asymptotic Laws and Methods in Stochastics. Fields Institute Communications, vol 76. Springer, New York, NY. 2015.
P. Doukhan. Mixing: Properties and examples. Springer-Verlag, New York, 1995.
S. Dudoit and M. J. van der Laan. Multiple testing procedures and applications to genomics. Spinger, New York, 2007.
R. A. Fisher. Statistical methods for research workers. Fourth edition. Oliver; Boyd, Edinburgh, 1934.
J. D. Gibbons and S. Chakraborti. Nonparametric Statistical Inference. Third Edition. Marcel Dekker, Inc, New York, 1992.
P. Hall and J. Jin. Properties of higher criticism under strong dependence. The Annals of Statistics, 36: 381–402, 2008.
I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, O. Kallioniemi, et al. Gene-Expression Profiles in Hereditary Breast Cancer. New England Journal of Medicine, 344: 539–548, 2001.
H. Levene. Robust tests for equality of variances. In Contributions to probability and statistics, Ed I. Olkin pages. 278–92 1960. Palo Alto, Calif.: Stanford University Press.
Z. Liu, X. Xia and W. Zhou. A test for equality of two distributions via jackknife empirical likelihood and characteristic functions. Computational Statistics and Data Analysis, 92: 97–114, 2015.
P. Martínez-Camblor and J. de Uña-Álvarez. Nonparametric k-sample tests: Density functions vs distribution functions. Computational Statistics and Data Analysis, 53: 3344–3357, 2009.
P. K. Mondal, M. Biswas and A. K. Ghosh. On high dimensional two-sample tests based on nearest neighbors. Journal of Multivariate Analysis, 141: 168–178, 2015.
M. H. Neumann and E. Paparoditis. On bootstrapping \(L_2\)-type statistics in density testing. Statistics \(\&\) Probability Letters, 50: 137–147, 2000.
D. N. Politis and H. White. Automatic block-length selection for the dependent bootstrap. American Economic Review, 23: 53–70, 2004.
S. A. Stouffer, E. A. Suchman, L. C. DeVinney, S. A. Star and R. M. Williams. The american soldier. Adjustment during army life. Princeton University Press, England, 1949.
S. Wei, C. Lee, L. Wichers and J. S. Marron. Direction-projection-permutation for high-dimensional hypothesis tests. Journal of Computational and Graphical Statistics, 25: 549–569, 2016.
H. Zhang, J. Jin and Z. Wu. Distributions and power of optimal signal-detection statistics in finite case. IEEE Transactions on Signal Processing, 68: 1021–1033, 2020. DOI https://doi.org/10.1109/TSP.2020.2967179.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Cousido-Rocha & Uña-Álvarez, "TwoSampleTest.HD: An R Package for the Two-Sample Problem with High-Dimensional Data", The R Journal, 2023

BibTeX citation

@article{RJ-2023-063,
  author = {Cousido-Rocha, Marta and Uña-Álvarez, Jacobo de},
  title = {TwoSampleTest.HD: An R Package for the Two-Sample Problem with High-Dimensional Data},
  journal = {The R Journal},
  year = {2023},
  note = {https://doi.org/10.32614/RJ-2023-063},
  doi = {10.32614/RJ-2023-063},
  volume = {15},
  issue = {3},
  issn = {2073-4859},
  pages = {79-92}
}