Skip to main content
Log in

Level Sets Semimetrics for Probability Measures with Applications in Hypothesis Testing

  • Published:
Methodology and Computing in Applied Probability Aims and scope Submit manuscript

Abstract

In this paper we introduce a novel family of level sets semimetrics for density functions and address subtleties entailed in the estimation and computation of such semimetrics. Given data drawn from f and q, two unknown density functions, we consider different level set semimetrics so to test the null hypothesis \(H_0: f=q\). The performance of such testing procedure is showcased in a Monte Carlo simulation study. Using the methods developed in the paper, we assess differences in gene expression profiles between two groups of patients with different respiratory recovery patterns in a clinical study; and find significant differences between the 15 top–ranked genes density profiles corresponding to the two groups.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availibility Statement

The Bionformatic dataset (and the corresponding source code to reproduce the results) is available from the corresponding author on request. We share the source code to replicate the numerical experiments in Section 3 as a supplementary file.

Notes

  1. To ensure that as sample size increases, we have enough information so to estimate the corresponding density level sets.

  2. Formally, the Mean Integrated Square Error corresponding to \(\widehat{f}_n\) goes to zero at the rate \(n^{-4/(d+4)}\), which becomes slower as d increases (the same holds for \(\widehat{q}_m\)). This phenomenon is known as the course of dimensionality, for further details on the consistency and the converge rates of multivariate kernel density estimators see for instance Wand and Jones (1994), pp. 100.

  3. The definition and computation of \(\widehat{\mu }(\mathcal {A}_i(q,\varvec{\nu }))\) is of course tantamount to \(\widehat{\mu }(\mathcal {A}_i(f,\varvec{\nu }))\).

  4. The number of null rejections over the 1.000 Monte Carlo simulations.

  5. Is also worth mentioning the lack of reliable software to estimate densities with high dimensional data.

  6. To this aim, we consider the \(d=675\) ordered p-values, namely \(p_{(1)}\le p_{(2)}\le \dots \le p_{(d)}\), and reject those null hypothesis corresponding to the first \(l^*\) small p–values where \(l^* = \max \{l\in \{1,\dots ,d\}:p_{(l)}\le \frac{\alpha }{d}\frac{l}{\beta _d} \}\). We choose \(\alpha = 0.05\) (the upper bound for the false discovery rate) and \(\beta _d = \sum _{i=1}^d 1/i\) since we are considering a sequence of dependent tests.

References

  • Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with bregman divergences. J Mach Learn Res 6:1705–1749

    MathSciNet  MATH  Google Scholar 

  • Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc: Ser B (Methodol) 57(1):289–300

    MathSciNet  MATH  Google Scholar 

  • Cadre B (2006) Kernel estimation of density level sets. J Multivar Anal 97(4):999–1023

    Article  MathSciNet  MATH  Google Scholar 

  • Chazal F, Fasy B, Lecci F, Michel B, Rinaldo A, Rinaldo A, Wasserman L (2017) Robust topological inference: Distance to a measure and kernel distance. J Mach Learn Res 18(1):5845–5884

    MathSciNet  MATH  Google Scholar 

  • Chen Y-C, Genovese CR, Wasserman L (2017) Density level sets: Asymptotics, inference, and visualization. J Am Stat Assoc 112(520):1684–1696

    Article  MathSciNet  Google Scholar 

  • Devroye L, Wise GL (1980) Detection of abnormal behavior via nonparametric estimation of the support. SIAM J Appl Math 38(3):480–488

    Article  MathSciNet  MATH  Google Scholar 

  • Deza MM, Deza E (2009) Encyclopedia of distances. Springer

    Book  MATH  Google Scholar 

  • Gabriel Martos NH (2018) bigdatadist: Distances for machine learning and statistics in the context of big data. R package version 1:1

    Google Scholar 

  • Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. Int Stat Rev 70(3):419–435

    Article  MATH  Google Scholar 

  • Giné E, Guillou A (2002) Rates of strong uniform consistency for multivariate kernel density estimators. In Annales de l’Institut Henri Poincare (B) Probability and Statistics 38:907–921. Elsevier

  • Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A (2012) A kernel two-sample test. J Mach Learn Res 13(1):723–773

    MathSciNet  MATH  Google Scholar 

  • Hartigan JA (1987) Estimation of a convex density contour in two dimensions. J Am Stat Assoc 82(397):267–270

    Article  MathSciNet  MATH  Google Scholar 

  • Hayden D (2011) Peripheral blood leukocyte genomic response one day post traumatic injury may predict early respiratory recovery

  • Hayden D, Lazar P, Schoenfeld D, Inflammation the Host Response to Injury Investigators et al (2009) Assessing statistical significance in microarray experiments using the distance between microarrays. PLoS One 4(6):e5838

  • Jones MC, Marron JS, Sheather SJ (1996) A brief survey of bandwidth selection for density estimation. J Am Stat Assoc 91(433):401–407

    Article  MathSciNet  MATH  Google Scholar 

  • Lebanon G (2006) Metric learning for text documents. Pattern Analysis and Machine Intelligence, IEEE Transactions on 28(4):497–508

    Article  Google Scholar 

  • Mielke PW, Berry KJ (2007) Permutation methods: a distance function approach. Springer Science & Business Media

  • Minas C, Curry E, Montana G (2013) A distance-based test of association between paired heterogeneous genomic data. Bioinformatics

  • Moguerza JM, Muñoz A (2006) Support vector machines with applications. Stat Sci p 322–336

  • Muñoz A, Moguerza JM (2006) Estimation of high-density regions using one-class neighbor machines. Pattern Analysis and Machine Intelligence, IEEE Transactions on 28(3):476–480

    Article  Google Scholar 

  • Nguyen X, Wainwright MJ, Jordan MI (2010) Estimating divergence functionals and the likelihood ratio by convex risk minimization. Information Theory, IEEE Transactions on 56(11):5847–5861

    Article  MathSciNet  MATH  Google Scholar 

  • Polonik W (1995) Measuring mass concentrations and estimating density contour clusters-an excess mass approach. Ann Stat p 855–881

  • Ramdas A, García Trillos N, Cuturi M (2017) On wasserstein two-sample testing and related families of nonparametric tests. Entropy 19(2):47

    Article  MathSciNet  Google Scholar 

  • Ryabko D, Mary J (2012) Reducing statistical time-series problems to binary classification. In Advances in Neural Information Processing Systems p 2060–2068

  • Schölkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471

    Article  MATH  Google Scholar 

  • Stone CJ (1980) Optimal rates of convergence for nonparametric estimators. Ann Stat p 1348–1360

  • Székely GJ, Rizzo ML (2004) Testing for equal distributions in high dimension. InterStat 5

  • Vert R, Vert J-P, Schölkopf B (2006) Consistency and convergence rates of one-class svms and related algorithms. J Mach Learn Res 7(5)

  • Wand MP, Jones MC (1994) Kernel smoothing. CRC Press

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriel Martos.

Ethics declarations

Ethical Statement

Authors have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Conflict of Interest Statement

None of the authors have a conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 413 KB)

Technical Appendix

Technical Appendix

The semimetric in Equation (1) is non-negative and symmetric by definition; and also obeys the triangular inequality (see the Biotope transformation in Deza and Deza (2009), pp 118 for further details).

Proposition: The semimetric \(\mathrm {D}(f,q,\varvec{\nu },\mathbf {w})=\sum _{i=1}^{k}w_i \mathrm {d}\left( \mathcal {A}_i(f,\varvec{\nu }),\mathcal {A}_i(q,\varvec{\nu })\right)\) behaves as a proper metric when: (1) \(\mathrm {d}:\mathcal {X}\times \mathcal {X} \rightarrow \mathbb {R}^+\) is a metric between subsets in \(\mathcal {X}\), and (2) \(\varvec{\nu }_k\equiv \{\nu _0,\dots ,\nu _k\}\) is an asymptotically dense set in [0, 1]. Moreover, in the limit \(\mathrm {D}(f,q,\varvec{\nu }_k,{ {w}})\) does not depend on the sequence \(\varvec{\nu }_k\) (only depends of \({ {w}}\)).

Proof: We need to show that \(\mathrm {D}(f,q,\varvec{\nu }_k,{ {w}}){\mathop {\longrightarrow }\limits ^{ n\rightarrow \infty }}0\) if and only if \(f=q\). Consider the asymptotically dense set \(\varvec{\nu }_k = \{ \frac{i}{k}\}_{i=0}^k\), then if \(f = q\) for all \(k\in \mathbb {N}\) it holds that \(d_i(f,q,\varvec{\nu }_k)=0\) for \(i \in \{1,\dots ,k\}\), leading to \(\mathrm {D}(f,q,\varvec{\nu }_k,{ {w}})=0\). When \(f \ne q\), there exists a constant \(k_0\) such that for all \(k > k_0\), \(d_i(f,q,\varvec{\nu }_k) > 0\) for at least one \(i \in \{1,\dots ,k\}\); therefore for \(k>k_0: \mathrm {D}(f,q,\varvec{\nu }_k,{ {w}})>0\) and then \(\lim \limits _{k \rightarrow \infty } \mathrm {D}_J(f,q,\varvec{\nu }_k,{ {w}})>0\). Notice also that for any asymptotically dense set \(\varvec{\nu }_k\) then \(\lim \limits _{k \rightarrow \infty } \mathrm {D}_J(f,q,\varvec{\nu }_k,{ {w}})=D(f,q,{ {w}})\); since for all asymptotically dense sequences \(\varvec{\nu }_k\) we obtain (asymptotically) the same collection of level sets to compute \(D(f,q,{ {w}})\).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Muñoz, A., Martos, G. & Gonzalez, J. Level Sets Semimetrics for Probability Measures with Applications in Hypothesis Testing. Methodol Comput Appl Probab 25, 21 (2023). https://doi.org/10.1007/s11009-023-09990-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11009-023-09990-5

Keywords

Mathematics Subject Classification

Navigation