Abstract
In this paper we introduce a novel family of level sets semimetrics for density functions and address subtleties entailed in the estimation and computation of such semimetrics. Given data drawn from f and q, two unknown density functions, we consider different level set semimetrics so to test the null hypothesis \(H_0: f=q\). The performance of such testing procedure is showcased in a Monte Carlo simulation study. Using the methods developed in the paper, we assess differences in gene expression profiles between two groups of patients with different respiratory recovery patterns in a clinical study; and find significant differences between the 15 top–ranked genes density profiles corresponding to the two groups.
Similar content being viewed by others
Data Availibility Statement
The Bionformatic dataset (and the corresponding source code to reproduce the results) is available from the corresponding author on request. We share the source code to replicate the numerical experiments in Section 3 as a supplementary file.
Notes
To ensure that as sample size increases, we have enough information so to estimate the corresponding density level sets.
Formally, the Mean Integrated Square Error corresponding to \(\widehat{f}_n\) goes to zero at the rate \(n^{-4/(d+4)}\), which becomes slower as d increases (the same holds for \(\widehat{q}_m\)). This phenomenon is known as the course of dimensionality, for further details on the consistency and the converge rates of multivariate kernel density estimators see for instance Wand and Jones (1994), pp. 100.
The definition and computation of \(\widehat{\mu }(\mathcal {A}_i(q,\varvec{\nu }))\) is of course tantamount to \(\widehat{\mu }(\mathcal {A}_i(f,\varvec{\nu }))\).
The number of null rejections over the 1.000 Monte Carlo simulations.
Is also worth mentioning the lack of reliable software to estimate densities with high dimensional data.
To this aim, we consider the \(d=675\) ordered p-values, namely \(p_{(1)}\le p_{(2)}\le \dots \le p_{(d)}\), and reject those null hypothesis corresponding to the first \(l^*\) small p–values where \(l^* = \max \{l\in \{1,\dots ,d\}:p_{(l)}\le \frac{\alpha }{d}\frac{l}{\beta _d} \}\). We choose \(\alpha = 0.05\) (the upper bound for the false discovery rate) and \(\beta _d = \sum _{i=1}^d 1/i\) since we are considering a sequence of dependent tests.
References
Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with bregman divergences. J Mach Learn Res 6:1705–1749
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc: Ser B (Methodol) 57(1):289–300
Cadre B (2006) Kernel estimation of density level sets. J Multivar Anal 97(4):999–1023
Chazal F, Fasy B, Lecci F, Michel B, Rinaldo A, Rinaldo A, Wasserman L (2017) Robust topological inference: Distance to a measure and kernel distance. J Mach Learn Res 18(1):5845–5884
Chen Y-C, Genovese CR, Wasserman L (2017) Density level sets: Asymptotics, inference, and visualization. J Am Stat Assoc 112(520):1684–1696
Devroye L, Wise GL (1980) Detection of abnormal behavior via nonparametric estimation of the support. SIAM J Appl Math 38(3):480–488
Deza MM, Deza E (2009) Encyclopedia of distances. Springer
Gabriel Martos NH (2018) bigdatadist: Distances for machine learning and statistics in the context of big data. R package version 1:1
Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. Int Stat Rev 70(3):419–435
Giné E, Guillou A (2002) Rates of strong uniform consistency for multivariate kernel density estimators. In Annales de l’Institut Henri Poincare (B) Probability and Statistics 38:907–921. Elsevier
Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A (2012) A kernel two-sample test. J Mach Learn Res 13(1):723–773
Hartigan JA (1987) Estimation of a convex density contour in two dimensions. J Am Stat Assoc 82(397):267–270
Hayden D (2011) Peripheral blood leukocyte genomic response one day post traumatic injury may predict early respiratory recovery
Hayden D, Lazar P, Schoenfeld D, Inflammation the Host Response to Injury Investigators et al (2009) Assessing statistical significance in microarray experiments using the distance between microarrays. PLoS One 4(6):e5838
Jones MC, Marron JS, Sheather SJ (1996) A brief survey of bandwidth selection for density estimation. J Am Stat Assoc 91(433):401–407
Lebanon G (2006) Metric learning for text documents. Pattern Analysis and Machine Intelligence, IEEE Transactions on 28(4):497–508
Mielke PW, Berry KJ (2007) Permutation methods: a distance function approach. Springer Science & Business Media
Minas C, Curry E, Montana G (2013) A distance-based test of association between paired heterogeneous genomic data. Bioinformatics
Moguerza JM, Muñoz A (2006) Support vector machines with applications. Stat Sci p 322–336
Muñoz A, Moguerza JM (2006) Estimation of high-density regions using one-class neighbor machines. Pattern Analysis and Machine Intelligence, IEEE Transactions on 28(3):476–480
Nguyen X, Wainwright MJ, Jordan MI (2010) Estimating divergence functionals and the likelihood ratio by convex risk minimization. Information Theory, IEEE Transactions on 56(11):5847–5861
Polonik W (1995) Measuring mass concentrations and estimating density contour clusters-an excess mass approach. Ann Stat p 855–881
Ramdas A, García Trillos N, Cuturi M (2017) On wasserstein two-sample testing and related families of nonparametric tests. Entropy 19(2):47
Ryabko D, Mary J (2012) Reducing statistical time-series problems to binary classification. In Advances in Neural Information Processing Systems p 2060–2068
Schölkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471
Stone CJ (1980) Optimal rates of convergence for nonparametric estimators. Ann Stat p 1348–1360
Székely GJ, Rizzo ML (2004) Testing for equal distributions in high dimension. InterStat 5
Vert R, Vert J-P, Schölkopf B (2006) Consistency and convergence rates of one-class svms and related algorithms. J Mach Learn Res 7(5)
Wand MP, Jones MC (1994) Kernel smoothing. CRC Press
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethical Statement
Authors have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Conflict of Interest Statement
None of the authors have a conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Technical Appendix
Technical Appendix
The semimetric in Equation (1) is non-negative and symmetric by definition; and also obeys the triangular inequality (see the Biotope transformation in Deza and Deza (2009), pp 118 for further details).
Proposition: The semimetric \(\mathrm {D}(f,q,\varvec{\nu },\mathbf {w})=\sum _{i=1}^{k}w_i \mathrm {d}\left( \mathcal {A}_i(f,\varvec{\nu }),\mathcal {A}_i(q,\varvec{\nu })\right)\) behaves as a proper metric when: (1) \(\mathrm {d}:\mathcal {X}\times \mathcal {X} \rightarrow \mathbb {R}^+\) is a metric between subsets in \(\mathcal {X}\), and (2) \(\varvec{\nu }_k\equiv \{\nu _0,\dots ,\nu _k\}\) is an asymptotically dense set in [0, 1]. Moreover, in the limit \(\mathrm {D}(f,q,\varvec{\nu }_k,{ {w}})\) does not depend on the sequence \(\varvec{\nu }_k\) (only depends of \({ {w}}\)).
Proof: We need to show that \(\mathrm {D}(f,q,\varvec{\nu }_k,{ {w}}){\mathop {\longrightarrow }\limits ^{ n\rightarrow \infty }}0\) if and only if \(f=q\). Consider the asymptotically dense set \(\varvec{\nu }_k = \{ \frac{i}{k}\}_{i=0}^k\), then if \(f = q\) for all \(k\in \mathbb {N}\) it holds that \(d_i(f,q,\varvec{\nu }_k)=0\) for \(i \in \{1,\dots ,k\}\), leading to \(\mathrm {D}(f,q,\varvec{\nu }_k,{ {w}})=0\). When \(f \ne q\), there exists a constant \(k_0\) such that for all \(k > k_0\), \(d_i(f,q,\varvec{\nu }_k) > 0\) for at least one \(i \in \{1,\dots ,k\}\); therefore for \(k>k_0: \mathrm {D}(f,q,\varvec{\nu }_k,{ {w}})>0\) and then \(\lim \limits _{k \rightarrow \infty } \mathrm {D}_J(f,q,\varvec{\nu }_k,{ {w}})>0\). Notice also that for any asymptotically dense set \(\varvec{\nu }_k\) then \(\lim \limits _{k \rightarrow \infty } \mathrm {D}_J(f,q,\varvec{\nu }_k,{ {w}})=D(f,q,{ {w}})\); since for all asymptotically dense sequences \(\varvec{\nu }_k\) we obtain (asymptotically) the same collection of level sets to compute \(D(f,q,{ {w}})\).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Muñoz, A., Martos, G. & Gonzalez, J. Level Sets Semimetrics for Probability Measures with Applications in Hypothesis Testing. Methodol Comput Appl Probab 25, 21 (2023). https://doi.org/10.1007/s11009-023-09990-5
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11009-023-09990-5