Short communication
Near universal consistency of the maximum pseudolikelihood estimator for discrete models

https://doi.org/10.1016/j.jkss.2017.10.001Get rights and content

Abstract

Maximum pseudolikelihood (MPL) estimators are useful alternatives to maximum likelihood (ML) estimators when likelihood functions are more difficult to manipulate than their marginal and conditional components. Furthermore, MPL estimators subsume a large number of estimation techniques including ML estimators, maximum composite marginal likelihood estimators, and maximum pairwise likelihood estimators. When considering only the estimation of discrete models (on a possibly countably infinite support), we show that a simple finiteness assumption on an entropy-based measure is sufficient for assessing the consistency of the MPL estimator. As a consequence, we demonstrate that the MPL estimator of any discrete model on a bounded support will be consistent. Our result is valid in parametric, semiparametric, and nonparametric settings.

Introduction

Let XX be a random variable that takes values over the domain X. Suppose that X arises from some probability model with density function f0x. One of the core problem domains of machine learning, signal processing, and statistics is to devise an estimator fˆn for f0 that is in some sense close, using an independent and identically distributed (IID) sample Xn=Xii=1n from a distribution with density f0. In such estimation problems, the general measure of success is consistency (i.e. for fˆn to approach f0 in some sense, as n goes to infinity). The importance and fundamental nature of consistency is expounded well in Vapnik (2000, Ch. 2).

Let f=f;θ be from a parametrically determined family of probability density functions (PDFs), where θ0ΘRp for some pN. Assuming that f0 can be written as f;θ0, for some θ0Θ, the conditions for the consistency of the maximum likelihood (ML) estimator θˆn, which maximizes the log-likelihood function i=1nlogfXi;θ, are well-established in foundational works such as Kiefer and Wolfowitz (1956) and Wald (1949). When we estimate f0 using PDFs from some general (potentially nonparametric) class F, conditions for the consistency of the ML estimator fˆn=argmaxfFi=1nlogfXihave also been established in works such as Patilea (2001) and van de Geer (1993); see also Gine and Nickl (2015, Sec. 7.2). Apart from the likelihood function, general consistency conditions for estimates obtained via extremum estimation (Amemiya, 1985, Ch. 4), empirical-risk minimization (Vapnik, 2000, Ch. 2), and minimum-contrast estimation (Bickel & Doksum, 2000, Sec. 2.2) have also been broadly studied.

The maximum pseudolikelihood (MPL) estimator is an extremum estimator that is related to the ML estimator. Let XRq for qN, and write X=X1,,Xq and Xi=Xi1,,Xiq, for each in (n=1,,n). Define S=2q0 to be the power set of q that excludes the zero-string 0 and define fSxS to be the marginal PDF over the coordinates of X that are in SS. We write XS as the vector that contains only the coordinates that are in the set S. Next, define T to be the partitions of all subsets of q into two non-empty sets. Referring to the two partitioning as left and right, we write XT and XT to indicate the vector containing the elements of X that are selected in the left and right set, respectively. Let fTxT|xT be the conditional PDF of XT given XT=xT, for each T. Note that we use the usual convention of upper case for random variables and lower case for realizations.

For each set of constants c=cSSS0,S and d=dTTT0,T for which c0 if d=0 and d0 if c=0, we can define a pseudolikelihood (PL) function as LfcdXn=i=1nSScSlogfSXiS+i=1nTTdTlogfTXiT|XiT,where XiS is distributed with marginal PDF fSxS, and XiT and XiT are distributed with marginal PDF fTxT|xTfTxT, as per their counterparts without subscripts: XS, XT, and XT; see Arnold and Strauss (1991) for a definition of the PL function that is compatible with Eq. (1). The PL functions of form (1) can also be viewed as the nonparametric generalization of the composite likelihood functions that were studied in Cox and Reid (2004) and Lindsay (1988). We note that S=2q1 and T can be counted using Stirling numbers of the second kind; see Charalambides (2002, Ch. 8) for details.

We say that f̃nf̃F:Lf̃cdXn=maxfFLfcdXn,is the MPL estimator over the functional class F, provided that all of the necessary marginals f̃nS and conditionals f̃nT| that contribute to Lfcd (i.e. cS0 or dT0) are compatible with f̃n, in the sense that we can compute all f̃nS and f̃nT| from f̃n using the usual laws for deriving marginal and conditional PDFs; see Arnold, Castillo, and Sarabia (1999) and Joe (1997) for further details regarding the theory of compatibility when defining PL functions. When F is a parametric family such that f0=fx;θ0, and when all of the marginal and conditional densities fS=fS;θ and fT|=fT|;θ (for cS0 and dT0) are all compatible with some density f;θ, for θΘ, the conditions for consistency for the parametric MPL estimator θ̃nθ̃Θ:LfcdXn;θ̃=maxθΘLfcdXn;θhave been obtained by Arnold and Strauss (1991) and Lindsay (1988); see also Molenberghs and Verbeke (2005, Ch. 9). Here, LfcdXn;θ is defined as per (1), with fS;θ and fT|;θ replacing fS and fT|, respectively, for each S and T.

MPL estimation, as an alternative to ML estimation, was first introduced in Besag (1974). The use of MPL estimation has become ubiquitous in statistics and machine learning, with applications from multivariate modeling of toxicology data (Geys, Molenberghs, & Ryan, 1999), to genetic mapping (Ahfock, Wood, Stephen, Cavanagh, & Huang, 2014), and fitting of neural networks Hyvarinen (2006), Nguyen and Wood (2016a), Nguyen and Wood (2016b). Numerous other applications are presented in Arnold et al. (1999) and Molenberghs and Verbeke (2005). Recent reviews of MPL estimation appear in Varin, Reid, and Firth (2011) and Yi (2017).

We now concentrate our attention to the estimation of probability mass functions (PMFs) for random variables in XZq, where X can potentially be countable infinite. When q=1, Seo and Lindsay (2013) presented a simple criterion for checking the consistency ML estimator for any PMF over Z via a simple entropy criterion. Using their criterion, it is then easy to observe that if X is bounded, then the criterion is always satisfied and thus provides a universal consistency result. When X is unbounded, however, there still may exist classes that cannot be consistently estimated via MPL estimation.

Using the proof technique from Seo and Lindsay (2013), we derive an MPL estimator consistency theorem for the estimation of PMFs for data in XZq. Our result subsumes that of Seo and Lindsay (2013) as the ML estimator, when q=1, is an MPL estimator. Our result is suitable for the parametric, semiparametric, and nonparametric estimation settings. However, in the case of parametric and semiparametric estimation, it can only identify the parameters of interest to up to a set of maxima, unless the PL function is identifiable with respect to the parameter space. We proceed with the presentation of the main result in Section 2. Some example applications of the main result are provided in Section 3.

Section snippets

Main result

Assume from here on that XZq. Let XS, XT, and XT be the coordinates of X that correspond to the sets of coordinates in SS, T, and T, respectively, where TT. Let the IID sample Xn arise from a distribution with PMF f0F, where F is some class of PMFs, with marginals and conditions f0S and f0T| for all S and T. If all relevant marginal and conditional PMFs f̃nS and f̃nT| (i.e. cS0 and dT0) are compatible with the estimator f̃n, and if f̃nSxSf0SxS and f̃nTxT|xTf0TxT|xT with

Example applications

Example 1 Categorical Distribution

We begin with a toy example. Let Xi be a random variable (in) and suppose that PXi=ek=πk>0, where ek is a q-dimensional vector with a one at the kth coordinate and zeros at all other coordinates (kq), and where k=1qπk=1. We say that Xi arises from a categorical distribution with parameter vector π=π1,,πq, if its distribution can be characterized by the PMF fxi;π=k=1qπkIxi=ek.The categorical distribution is the single-trial case of the multinomial distribution and has

Acknowledgments

The author is grateful to the Associate Editor and two reviewersfor suggestions that have greatly improved the quality of the paper. The author is personally supported by Australian Research Council grant number DE170101134.

References (39)

  • SeoB. et al.

    Nearly universal consistency of maximum likelihood in discrete models

    Statistics & Probability Letters

    (2013)
  • AhfockD. et al.

    Characterizing uncertainty in high-density maps for multiparental populations

    Genetics

    (2014)
  • AlbertI. et al.

    Dirichlet and multinomial distributions: properties and uses in Jags, Technical report 2012-5

    (2012)
  • AmemiyaT.

    Advanced econometrics

    (1985)
  • ArnoldB.C. et al.

    Conditional specification of statistical models

    (1999)
  • ArnoldB.C. et al.

    Pseudolikelihood estimation: some examples

    Sankhya B

    (1991)
  • BauerH.

    Measure and Integration Theory

    (2001)
  • BengioY.

    Learning deep architechtures for AI

    Foundations and Trends in Machine Learning

    (2009)
  • BesagJ.

    Spatial interaction and the statistical analysis of lattice systems

    Journal of the Royal Statistical Society. Series B.

    (1974)
  • BickelP.J. et al.

    Mathematical statistics: Basic ideas and selected topics, Vol. 1

    (2000)
  • BishopC.M.

    Pattern recognition and machine learning

    (2006)
  • CharalambidesC.A.

    Enumerative combinatorics

    (2002)
  • ChungK.L.

    A course in probability theory

    (2001)
  • CoxD.R. et al.

    A note on pseudolikelihood constructed from marginal densities

    Biometrika

    (2004)
  • GeysH. et al.

    Pseudolikelihood modeling of multivariate outcomes in dedevelopment toxicology

    Journal of the American Statistical Association

    (1999)
  • GineE. et al.

    Mathematical foundations of infinite-dimensional statistical models

    (2015)
  • HyvarinenA.

    Consistency of pseudolikelihood estimation of fully visible Boltzmann machines

    Neural Computation

    (2006)
  • JacobP. et al.

    Local smoothing with given marginals

    Journal of Statistical Computation and Simulation

    (2012)
  • JoeH.

    Multivariate models and dependence concepts

    (1997)
  • Cited by (0)

    View full text