Short communicationNear universal consistency of the maximum pseudolikelihood estimator for discrete models
Introduction
Let be a random variable that takes values over the domain . Suppose that arises from some probability model with density function . One of the core problem domains of machine learning, signal processing, and statistics is to devise an estimator for that is in some sense close, using an independent and identically distributed (IID) sample from a distribution with density . In such estimation problems, the general measure of success is consistency (i.e. for to approach in some sense, as goes to infinity). The importance and fundamental nature of consistency is expounded well in Vapnik (2000, Ch. 2).
Let be from a parametrically determined family of probability density functions (PDFs), where for some . Assuming that can be written as , for some , the conditions for the consistency of the maximum likelihood (ML) estimator , which maximizes the log-likelihood function , are well-established in foundational works such as Kiefer and Wolfowitz (1956) and Wald (1949). When we estimate using PDFs from some general (potentially nonparametric) class , conditions for the consistency of the ML estimator have also been established in works such as Patilea (2001) and van de Geer (1993); see also Gine and Nickl (2015, Sec. 7.2). Apart from the likelihood function, general consistency conditions for estimates obtained via extremum estimation (Amemiya, 1985, Ch. 4), empirical-risk minimization (Vapnik, 2000, Ch. 2), and minimum-contrast estimation (Bickel & Doksum, 2000, Sec. 2.2) have also been broadly studied.
The maximum pseudolikelihood (MPL) estimator is an extremum estimator that is related to the ML estimator. Let for , and write and , for each (). Define to be the power set of that excludes the zero-string 0 and define to be the marginal PDF over the coordinates of that are in . We write as the vector that contains only the coordinates that are in the set . Next, define to be the partitions of all subsets of into two non-empty sets. Referring to the two partitioning as left and right, we write and to indicate the vector containing the elements of that are selected in the left and right set, respectively. Let be the conditional PDF of given , for each . Note that we use the usual convention of upper case for random variables and lower case for realizations.
For each set of constants and for which if and if , we can define a pseudolikelihood (PL) function as where is distributed with marginal PDF , and and are distributed with marginal PDF , as per their counterparts without subscripts: , , and ; see Arnold and Strauss (1991) for a definition of the PL function that is compatible with Eq. (1). The PL functions of form (1) can also be viewed as the nonparametric generalization of the composite likelihood functions that were studied in Cox and Reid (2004) and Lindsay (1988). We note that and can be counted using Stirling numbers of the second kind; see Charalambides (2002, Ch. 8) for details.
We say that is the MPL estimator over the functional class , provided that all of the necessary marginals and conditionals that contribute to (i.e. or ) are compatible with , in the sense that we can compute all and from using the usual laws for deriving marginal and conditional PDFs; see Arnold, Castillo, and Sarabia (1999) and Joe (1997) for further details regarding the theory of compatibility when defining PL functions. When is a parametric family such that , and when all of the marginal and conditional densities and (for and ) are all compatible with some density , for , the conditions for consistency for the parametric MPL estimator have been obtained by Arnold and Strauss (1991) and Lindsay (1988); see also Molenberghs and Verbeke (2005, Ch. 9). Here, is defined as per (1), with and replacing and , respectively, for each and .
MPL estimation, as an alternative to ML estimation, was first introduced in Besag (1974). The use of MPL estimation has become ubiquitous in statistics and machine learning, with applications from multivariate modeling of toxicology data (Geys, Molenberghs, & Ryan, 1999), to genetic mapping (Ahfock, Wood, Stephen, Cavanagh, & Huang, 2014), and fitting of neural networks Hyvarinen (2006), Nguyen and Wood (2016a), Nguyen and Wood (2016b). Numerous other applications are presented in Arnold et al. (1999) and Molenberghs and Verbeke (2005). Recent reviews of MPL estimation appear in Varin, Reid, and Firth (2011) and Yi (2017).
We now concentrate our attention to the estimation of probability mass functions (PMFs) for random variables in , where can potentially be countable infinite. When , Seo and Lindsay (2013) presented a simple criterion for checking the consistency ML estimator for any PMF over via a simple entropy criterion. Using their criterion, it is then easy to observe that if is bounded, then the criterion is always satisfied and thus provides a universal consistency result. When is unbounded, however, there still may exist classes that cannot be consistently estimated via MPL estimation.
Using the proof technique from Seo and Lindsay (2013), we derive an MPL estimator consistency theorem for the estimation of PMFs for data in . Our result subsumes that of Seo and Lindsay (2013) as the ML estimator, when , is an MPL estimator. Our result is suitable for the parametric, semiparametric, and nonparametric estimation settings. However, in the case of parametric and semiparametric estimation, it can only identify the parameters of interest to up to a set of maxima, unless the PL function is identifiable with respect to the parameter space. We proceed with the presentation of the main result in Section 2. Some example applications of the main result are provided in Section 3.
Section snippets
Main result
Assume from here on that . Let , , and be the coordinates of that correspond to the sets of coordinates in , , and , respectively, where . Let the IID sample arise from a distribution with PMF , where is some class of PMFs, with marginals and conditions and for all and . If all relevant marginal and conditional PMFs and (i.e. and ) are compatible with the estimator , and if and with
Example applications
Example 1 Categorical Distribution We begin with a toy example. Let be a random variable () and suppose that , where is a vector with a one at the coordinate and zeros at all other coordinates (), and where . We say that arises from a categorical distribution with parameter vector , if its distribution can be characterized by the PMF The categorical distribution is the single-trial case of the multinomial distribution and has
Acknowledgments
The author is grateful to the Associate Editor and two reviewersfor suggestions that have greatly improved the quality of the paper. The author is personally supported by Australian Research Council grant number DE170101134.
References (39)
- et al.
Nearly universal consistency of maximum likelihood in discrete models
Statistics & Probability Letters
(2013) - et al.
Characterizing uncertainty in high-density maps for multiparental populations
Genetics
(2014) - et al.
Dirichlet and multinomial distributions: properties and uses in Jags, Technical report 2012-5
(2012) Advanced econometrics
(1985)- et al.
Conditional specification of statistical models
(1999) - et al.
Pseudolikelihood estimation: some examples
Sankhya B
(1991) Measure and Integration Theory
(2001)Learning deep architechtures for AI
Foundations and Trends in Machine Learning
(2009)Spatial interaction and the statistical analysis of lattice systems
Journal of the Royal Statistical Society. Series B.
(1974)- et al.
Mathematical statistics: Basic ideas and selected topics, Vol. 1
(2000)