Level Sets Semimetrics for Probability Measures with Applications in Hypothesis Testing

Muñoz, Alberto; Martos, Gabriel; Gonzalez, Javier

doi:10.1007/s11009-023-09990-5

Level Sets Semimetrics for Probability Measures with Applications in Hypothesis Testing

Published: 11 February 2023

Volume 25, article number 21, (2023)
Cite this article

Methodology and Computing in Applied Probability Aims and scope Submit manuscript

104 Accesses
1 Citation
Explore all metrics

Abstract

In this paper we introduce a novel family of level sets semimetrics for density functions and address subtleties entailed in the estimation and computation of such semimetrics. Given data drawn from f and q, two unknown density functions, we consider different level set semimetrics so to test the null hypothesis \(H_0: f=q\). The performance of such testing procedure is showcased in a Monte Carlo simulation study. Using the methods developed in the paper, we assess differences in gene expression profiles between two groups of patients with different respiratory recovery patterns in a clinical study; and find significant differences between the 15 top–ranked genes density profiles corresponding to the two groups.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The spherical-Dirichlet distribution

Article Open access 05 September 2020

Super-delta: a new differential gene expression analysis procedure with robust data normalization

Article Open access 21 December 2017

The shape of gene expression distributions matter: how incorporating distribution shape improves the interpretation of cancer transcriptomic data

Article Open access 28 December 2020

Data Availibility Statement

The Bionformatic dataset (and the corresponding source code to reproduce the results) is available from the corresponding author on request. We share the source code to replicate the numerical experiments in Section 3 as a supplementary file.

Notes

To ensure that as sample size increases, we have enough information so to estimate the corresponding density level sets.
Formally, the Mean Integrated Square Error corresponding to \(\widehat{f}_n\) goes to zero at the rate \(n^{-4/(d+4)}\), which becomes slower as d increases (the same holds for \(\widehat{q}_m\)). This phenomenon is known as the course of dimensionality, for further details on the consistency and the converge rates of multivariate kernel density estimators see for instance Wand and Jones (1994), pp. 100.
The definition and computation of \(\widehat{\mu }(\mathcal {A}_i(q,\varvec{\nu }))\) is of course tantamount to \(\widehat{\mu }(\mathcal {A}_i(f,\varvec{\nu }))\).
The number of null rejections over the 1.000 Monte Carlo simulations.
Is also worth mentioning the lack of reliable software to estimate densities with high dimensional data.
To this aim, we consider the \(d=675\) ordered p-values, namely \(p_{(1)}\le p_{(2)}\le \dots \le p_{(d)}\), and reject those null hypothesis corresponding to the first \(l^*\) small p–values where \(l^* = \max \{l\in \{1,\dots ,d\}:p_{(l)}\le \frac{\alpha }{d}\frac{l}{\beta _d} \}\). We choose \(\alpha = 0.05\) (the upper bound for the false discovery rate) and \(\beta _d = \sum _{i=1}^d 1/i\) since we are considering a sequence of dependent tests.

References

Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with bregman divergences. J Mach Learn Res 6:1705–1749
MathSciNet MATH Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc: Ser B (Methodol) 57(1):289–300
MathSciNet MATH Google Scholar
Cadre B (2006) Kernel estimation of density level sets. J Multivar Anal 97(4):999–1023
Article MathSciNet MATH Google Scholar
Chazal F, Fasy B, Lecci F, Michel B, Rinaldo A, Rinaldo A, Wasserman L (2017) Robust topological inference: Distance to a measure and kernel distance. J Mach Learn Res 18(1):5845–5884
MathSciNet MATH Google Scholar
Chen Y-C, Genovese CR, Wasserman L (2017) Density level sets: Asymptotics, inference, and visualization. J Am Stat Assoc 112(520):1684–1696
Article MathSciNet Google Scholar
Devroye L, Wise GL (1980) Detection of abnormal behavior via nonparametric estimation of the support. SIAM J Appl Math 38(3):480–488
Article MathSciNet MATH Google Scholar
Deza MM, Deza E (2009) Encyclopedia of distances. Springer
Book MATH Google Scholar
Gabriel Martos NH (2018) bigdatadist: Distances for machine learning and statistics in the context of big data. R package version 1:1
Google Scholar
Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. Int Stat Rev 70(3):419–435
Article MATH Google Scholar
Giné E, Guillou A (2002) Rates of strong uniform consistency for multivariate kernel density estimators. In Annales de l’Institut Henri Poincare (B) Probability and Statistics 38:907–921. Elsevier
Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A (2012) A kernel two-sample test. J Mach Learn Res 13(1):723–773
MathSciNet MATH Google Scholar
Hartigan JA (1987) Estimation of a convex density contour in two dimensions. J Am Stat Assoc 82(397):267–270
Article MathSciNet MATH Google Scholar
Hayden D (2011) Peripheral blood leukocyte genomic response one day post traumatic injury may predict early respiratory recovery
Hayden D, Lazar P, Schoenfeld D, Inflammation the Host Response to Injury Investigators et al (2009) Assessing statistical significance in microarray experiments using the distance between microarrays. PLoS One 4(6):e5838
Jones MC, Marron JS, Sheather SJ (1996) A brief survey of bandwidth selection for density estimation. J Am Stat Assoc 91(433):401–407
Article MathSciNet MATH Google Scholar
Lebanon G (2006) Metric learning for text documents. Pattern Analysis and Machine Intelligence, IEEE Transactions on 28(4):497–508
Article Google Scholar
Mielke PW, Berry KJ (2007) Permutation methods: a distance function approach. Springer Science & Business Media
Minas C, Curry E, Montana G (2013) A distance-based test of association between paired heterogeneous genomic data. Bioinformatics
Moguerza JM, Muñoz A (2006) Support vector machines with applications. Stat Sci p 322–336
Muñoz A, Moguerza JM (2006) Estimation of high-density regions using one-class neighbor machines. Pattern Analysis and Machine Intelligence, IEEE Transactions on 28(3):476–480
Article Google Scholar
Nguyen X, Wainwright MJ, Jordan MI (2010) Estimating divergence functionals and the likelihood ratio by convex risk minimization. Information Theory, IEEE Transactions on 56(11):5847–5861
Article MathSciNet MATH Google Scholar
Polonik W (1995) Measuring mass concentrations and estimating density contour clusters-an excess mass approach. Ann Stat p 855–881
Ramdas A, García Trillos N, Cuturi M (2017) On wasserstein two-sample testing and related families of nonparametric tests. Entropy 19(2):47
Article MathSciNet Google Scholar
Ryabko D, Mary J (2012) Reducing statistical time-series problems to binary classification. In Advances in Neural Information Processing Systems p 2060–2068
Schölkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471
Article MATH Google Scholar
Stone CJ (1980) Optimal rates of convergence for nonparametric estimators. Ann Stat p 1348–1360
Székely GJ, Rizzo ML (2004) Testing for equal distributions in high dimension. InterStat 5
Vert R, Vert J-P, Schölkopf B (2006) Consistency and convergence rates of one-class svms and related algorithms. J Mach Learn Res 7(5)
Wand MP, Jones MC (1994) Kernel smoothing. CRC Press
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

Universidad Carlos III de Madrid, Calle Madrid, 126, Getafe (Madrid), 28903, Spain
Alberto Muñoz
Universidad Torcuato Di Tella, Av Figueroa Alcorta 7350, Buenos Aires, Argentina
Gabriel Martos
Microsoft Research Cambridge, 21 Station Road, Cambridge, CB1 2FB, UK
Javier Gonzalez

Authors

Alberto Muñoz
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Martos
View author publications
You can also search for this author in PubMed Google Scholar
Javier Gonzalez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gabriel Martos.

Ethics declarations

Ethical Statement

Authors have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Conflict of Interest Statement

None of the authors have a conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 413 KB)

Technical Appendix

The semimetric in Equation (1) is non-negative and symmetric by definition; and also obeys the triangular inequality (see the Biotope transformation in Deza and Deza (2009), pp 118 for further details).

Proposition: The semimetric \(\mathrm {D}(f,q,\varvec{\nu },\mathbf {w})=\sum _{i=1}^{k}w_i \mathrm {d}\left( \mathcal {A}_i(f,\varvec{\nu }),\mathcal {A}_i(q,\varvec{\nu })\right)\) behaves as a proper metric when: (1) \(\mathrm {d}:\mathcal {X}\times \mathcal {X} \rightarrow \mathbb {R}^+\) is a metric between subsets in \(\mathcal {X}\), and (2) \(\varvec{\nu }_k\equiv \{\nu _0,\dots ,\nu _k\}\) is an asymptotically dense set in [0, 1]. Moreover, in the limit \(\mathrm {D}(f,q,\varvec{\nu }_k,{ {w}})\) does not depend on the sequence \(\varvec{\nu }_k\) (only depends of \({ {w}}\)).

Proof: We need to show that \(\mathrm {D}(f,q,\varvec{\nu }_k,{ {w}}){\mathop {\longrightarrow }\limits ^{ n\rightarrow \infty }}0\) if and only if \(f=q\). Consider the asymptotically dense set \(\varvec{\nu }_k = \{ \frac{i}{k}\}_{i=0}^k\), then if \(f = q\) for all \(k\in \mathbb {N}\) it holds that \(d_i(f,q,\varvec{\nu }_k)=0\) for \(i \in \{1,\dots ,k\}\), leading to \(\mathrm {D}(f,q,\varvec{\nu }_k,{ {w}})=0\). When \(f \ne q\), there exists a constant \(k_0\) such that for all \(k > k_0\), \(d_i(f,q,\varvec{\nu }_k) > 0\) for at least one \(i \in \{1,\dots ,k\}\); therefore for \(k>k_0: \mathrm {D}(f,q,\varvec{\nu }_k,{ {w}})>0\) and then \(\lim \limits _{k \rightarrow \infty } \mathrm {D}_J(f,q,\varvec{\nu }_k,{ {w}})>0\). Notice also that for any asymptotically dense set \(\varvec{\nu }_k\) then \(\lim \limits _{k \rightarrow \infty } \mathrm {D}_J(f,q,\varvec{\nu }_k,{ {w}})=D(f,q,{ {w}})\); since for all asymptotically dense sequences \(\varvec{\nu }_k\) we obtain (asymptotically) the same collection of level sets to compute \(D(f,q,{ {w}})\).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Muñoz, A., Martos, G. & Gonzalez, J. Level Sets Semimetrics for Probability Measures with Applications in Hypothesis Testing. Methodol Comput Appl Probab 25, 21 (2023). https://doi.org/10.1007/s11009-023-09990-5

Download citation

Received: 30 March 2022
Revised: 17 October 2022
Accepted: 27 October 2022
Published: 11 February 2023
DOI: https://doi.org/10.1007/s11009-023-09990-5

Keywords

Mathematics Subject Classification

62

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Level Sets Semimetrics for Probability Measures with Applications in Hypothesis Testing

Abstract

Access this article

Similar content being viewed by others

The spherical-Dirichlet distribution

Super-delta: a new differential gene expression analysis procedure with robust data normalization

The shape of gene expression distributions matter: how incorporating distribution shape improves the interpretation of cancer transcriptomic data

Data Availibility Statement

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical Statement

Conflict of Interest Statement

Additional information

Publisher’s Note

Supplementary Information

Supplementary file1 (PDF 413 KB)

Technical Appendix

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Level Sets Semimetrics for Probability Measures with Applications in Hypothesis Testing

Abstract

Access this article

Similar content being viewed by others

The spherical-Dirichlet distribution

Super-delta: a new differential gene expression analysis procedure with robust data normalization

The shape of gene expression distributions matter: how incorporating distribution shape improves the interpretation of cancer transcriptomic data

Data Availibility Statement

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical Statement

Conflict of Interest Statement

Additional information

Publisher’s Note

Supplementary Information

Supplementary file1 (PDF 413 KB)

Technical Appendix

Technical Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation