Noise contrastive estimation: Asymptotic properties, formal comparison with MC-MLE

: A statistical model is said to be un-normalised when its like- lihood function involves an intractable normalising constant. Two popular methods for parameter inference for these models are MC-MLE (Monte Carlo maximum likelihood estimation), and NCE (noise contrastive esti-mation); both methods rely on simulating artiﬁcial data-points to approx- imate the normalising constant. While the asymptotics of MC-MLE have been established under general hypotheses (Geyer, 1994), this is not so for NCE. We establish consistency and asymptotic normality of NCE estimators under mild assumptions. We compare NCE and MC-MLE under several asymptotic regimes. In particular, we show that, when m → ∞ while n is ﬁxed ( m and n being respectively the number of artiﬁcial data-points, and actual data-points), the two estimators are asymptotically equivalent. Con-versely, we prove that, when the artiﬁcial data-points are IID, and when n → ∞ while m/n converges to a positive constant, the asymptotic variance of a NCE estimator is always smaller than the asymptotic variance of the corresponding MC-MLE estimator. We illustrate the variance reduction brought by NCE through a numerical study.


Introduction
Consider a set of probability densities {f θ : θ ∈ Θ} with respect to some measure μ, defined on a space X , such that: where h θ is non-negative, and Z(θ) is a normalising constant, Z(θ) = X h θ (x)μ(dx). A model based on such a family of densities is said to be unnormalised if function h θ may be computed point-wise, but Z(θ) is not available (i.e. it may not be computed in a reasonable CPU time). Un-normalised models arise in several areas of machine learning and Statistics, such as deep learning (Salakhutdinov and Hinton, 2009), computer vision (Wang et al., 2013), image segmentation (Gu and Zhu, 2001), social network modelling (Caimo and Friel, 2011), directional data modelling (Walker, 2011), among others. In most applications, data-points are assumed to be IID (independent and identically distributed); see however e.g. Mnih and Teh (2012) or Barthelmé and Chopin (2015) for applications of non-IID un-normalised models. In that spirit, we consider an un-normalised model of IID variables Y 1 , . . . , Y n , with log-likelihood (divided by n): (1) The fact that Z(θ) is intractable precludes standard maximum likelihood estimation. Geyer (1994) wrote a seminal paper on un-normalised models, in which he proposed to estimate θ by maximising function where the x j 's are m artificial data-points generated from a user-chosen distribution P ψ with density f ψ (x) = h ψ (x)/Z(ψ). The empirical average inside the second log is a consistent (as m → ∞) importance sampling estimate of Z(θ)/Z(ψ). Function IS n,m is thus an approximation of the log-likelihood ratio n (θ) − n (ψ), whose maximiser is the MLE.
In many applications, the easiest way to sample from P ψ is to use MCMC (Markov chain Monte Carlo). Geyer (1994) established the asymptotic properties of the MC-MLE estimates under general conditions; in particular that the x j 's are realisations of an ergodic process. This is remarkable, given that most of the theory on M-estimation (i.e. estimation obtained by maximising functions) is restricted to IID data.
More recently, Gutmann and Hyvärinen (2012) proposed an alternative approach to parameter estimation of un-normalised models, called noise contrastive estimation (NCE). It also relies on simulating artificial data-points x 1 , . . . , x m from distribution P ψ . The method consists in maximising the likelihood of a logistic classifier, where actual (resp. artificial) data-points are assigned label 1 (resp. 0). With symbols, the log-likelihood divided by n rewrites: where q θ,ν (x), the probability of label 1 for a value x, is defined through oddratio function: The NCE estimator of θ is obtained by maximising function NCE n,m (θ, ν) with respect to both θ ∈ Θ and ν ∈ R. In particular, when the considered model is exponential, i.e. when h θ (x) = exp θ T S(x) , for some statistic S, NCE n,m is the log-likelihood of a standard logistic regression, with covariate S(x). In that case, implementing NCE is particularly straightforward. This paper has two objectives: first, to establish the asymptotic properties of NCE when the artificial data-points are generated from an ergodic process (typically a MCMC sampler) in order to show that NCE is as widely applicable as MC-MLE; second, to compare the statistical efficiency of both methods.
As a preliminary step, we replace the original log-likelihood by a function defined on the extended space Θ × R, called Poisson transform: This function is so called as it corresponds to the log-likelihood (up to a linear transformation) of a Poisson process with intensity h θ (y) + ν, see Barthelmé and Chopin (2015) for details. The main property of this transformation is that it produces exactly the same MLE as the original likelihood: ( θ n , ν n ) maximises (4) if and only ifθ n maximises (1) and ν n = log Z(ψ)/Z(θ n ) .
In the same way, we replace the MC-MLE log-likelihood by function which has the same maximiser (with respect to θ) as function (2). We thus obtain three objective functions defined with respect to the same parameter space, Θ × R. This will greatly facilitate our analysis. The paper is organised as follows. In Section 2, we introduce the set up and notations. In Section 3, we study the behaviour of the NCE estimator as m → ∞ (while n is kept fixed). We prove that the NCE estimator converges to the MLE at the same m −1/2 rate as the MC-MLE estimator, and the difference between the two estimators converges faster, at rate m −1 . In Section 4, we let both m and n go to infinity while m/n → τ > 0. We obtain asymptotic variances for both estimators, which admit a simple and interpretable decomposition. Using this decomposition, we are able to establish that when the artificial data-points are IID, the asymptotic variance of NCE is always smaller than the asymptotic variance of MC-MLE (for the same computational budget). Section 5 assesses this variance reduction in a numerical example. Section 6 discusses the practical implications of our results. All proofs are delegated to the appendix.

Set-up and notations
Unless explicitly stated, we will consider Θ to be an open subset of R d , with natural topology associated to the Euclidean norm. We consider a parametric statistical model {P ⊗n θ : θ ∈ Θ}, corresponding to n IID data-points lying in space X ⊂ R k , associated with the corresponding Borel σ-Field. We assume that the model is identifiable, and equipped with some dominating measure μ, inducing the log-likelihood (1). From now on, we work directly with the "extended" version of approximate and exact log-likelihoods, i.e. functions (3), (4) and (5), which are functions of extended parameter ξ = (θ, ν), with ξ ∈ Ξ = Θ × R. When convenient, we also write n (ξ) for n (θ, ν) and so on. An open ball in Ξ, centered on ξ and of radius , is denoted B(ξ, ). We may also use this notation for balls in Θ.
The point of this paper is to study and compare point estimatesξ IS n,m and ξ NCE n,m , which maximise functions (5) and (3). For the sake of generality, we allow these estimators to be approximate maximisers; i.e. we will refer toξ IS n,m as an and with a similar definition forξ NCE n,m . The meaning of symbol o(1) in (6) depends on the asymptotic regime: in Section 3, n is kept fixed, while m → ∞, hence o(1) means "converges to zero as m → ∞". In Section 4, both m and n go to infinity, and the meaning of o(1) must be adapted accordingly.
In both asymptotic regimes, the main assumption regarding the sampling process is as follows.
(X1) The artificial data-points are realisations of a P ψ -ergodic process (X j ) j≥1 .
By P ψ -ergodicity, we mean that the following law of large number holds: for any measurable, real-valued function ϕ such that E ψ [|ϕ(X)|] < +∞.
Assumption (X1) is mild. For instance, if the X j 's are generated by a MCMC algorithm, this is equivalent to assuming that the simulated chain is aperiodic and φ-irreducible, which is true for all practical MCMC samplers; see e.g. Roberts and Rosenthal (2004).
Finally, note that, although notation P ψ suggests that the distribution of the artificial data-points belongs to the considered parametric model, this is not compulsory. All our results hold provided that the model is dominated by P ψ (i.e. P θ P ψ for every θ ∈ Θ).

Asymptotics of the Monte Carlo error
In this section, the analysis is conditional on the observed data: n and y 1 , ..., y n are fixed. The only source of randomness is the Monte Carlo error, and the quantity we seek to estimate is the (intractable) MLE. This regime was first studied for MC-MLE by Geyer (1994). For convenience, we suppose that the MLE exists and is unique; or equivalently thatξ n = (θ n , ν n ) is the unique maximiser of n .

Consistency
We are able to prove NCE consistency (towards the MLE) using the same approach as Geyer (1994) for MC-MLE. Our consistency result relies on the following assumptions: (C1) The random sequence ξNCE n,m m≥1 is an approximate NCE estimator, which belongs to a compact set almost surely. (H1) The maps θ → h θ (x) are: 1. lower semi-continuous at each θ ∈ Θ, except for x in a P ψ -null set that may depend on θ; 2. upper semi-continuous, for any x not in a P ψ -null set (that does not depend on θ), and for all x = y i , i = 1, . . . , n. This result is strongly linked to Theorems 1 and 4 of Geyer (1994), which state thatθ IS n,m →θ n as m → ∞ under essentially the same assumptions. These assumptions are very mild: they basically require continuity of the maps θ → h θ (x), without any integrability condition. Geyer (1994), the proof does not require Θ to be a subset of R d , consistency of MC-MLE as well as Theorem 1 hold more generally as soon as Θ is a separable metric space.

Asymptotic normality, comparison with MC-MLE
In order to compare the Monte Carlo error of MC-MLE and NCE estimators, we make the following extra assumptions: (H2) The maps θ → h θ (x) are twice continuously differentiable in a neighborhood ofθ n for P ψ -almost every x, and for x = y i , i = 1, . . . , n. The Hessian (G1) Estimatorsξ IS n,m andξ NCE n,m converge toξ n almost surely, and are such that (I1) For some ε > 0 the following integrability condition holds: Measurability of the suprema in (H2) and (I1) is ensured by the lower semicontinuity of the two first differentials in a neighbourhood ofθ n . Assumption (H2) is a regularity condition that ensures in particular that the partition function θ → Z(θ) = X h θ (x)μ(dx) is twice differentiable under the integral sign, in a neighbourhood ofθ n . Following Theorem 1, Assumption (G1) is trivial as soon as Assumptions (C1) and (H1) hold andξ IS n,m andξ NCE n,m are exact maximisers; in that case the gradients are zero. Integrability Assumption (I1) is the critical assumption. It is essentially a (locally uniform) second moment condition on the importance weights, with Pθ n as the target distribution. Theorem 2. Under assumptions (X1), (H2), (G1) and (I1): where H(ξ) = ∇ 2 ξ n (ξ), and v(ξ) is defined as follows: Before discussing the implications of Theorem 2, it is important to consider Geyer (1994)'s result about asymptotic normality of MC-MLE, which relies on the following assumption: As noticed by Geyer (1994), asymptotics of MC-MLE are quite similar to the asymptotics of maximum likelihood, and it can be shown that under assumptions (X1), (H2), (G1) and (N), Theorem 2 shows that the difference between the two point estimates is O(m −1 ), which is negligible relative to the O P (m −1/2 ) rate of convergence tô θ n . This proves that, when n is fixed, both approaches are asymptotically equivalent when it comes to approximate the MLE. In particular, Slutsky's lemma implies asymptotic normality of the NCE estimator with the same asymptotic variance as for MC-MLE.
Assumptions (H2) and (I1) admit a much simpler formulation when the model belongs to an exponential family. This is the point of the following Proposition.

Remark 2. Condition (N) requires a
√ m-CLT (central limit theorem) for the function ϕ : There has been an extensive literature on CLT's for Markov Chains, see e.g. Roberts and Rosenthal (2004) for a review. In particular, if (X j ) j≥1 is a geometrically ergodic Markov Chain with stationary distribution P ψ , then assumption (N) holds if for some δ > 0, ϕ ∈ L 2+δ (P ψ ). This assumption is very similar to assumption (I1), especially when the model is exponential.
In practice, implications of Theorem 2 must be considered cautiously, as the Euclidean norm of the limit in (7) will typically increase with n. For several well-known un-normalised models (e.g. Ising models, Exponential Random Graph Models), n is equal to one, in which case NCE and MC-MLE will always produce very close estimates. For other models however, it is known that the two estimators may behave differently, especially when the number of actual data-points is big and when simulations have a high computational cost (see Gutmann and Hyvärinen (2012)).
To investigate to which extent both approaches provide a good approximation of the true parameter value in these models, we will require both m and n to go to infinity. As it turns out, this will also make it possible to do finer comparison betweenξ IS n,m andξ NCE n,m (and thus betweenθ NCE n,m andθ IS n,m ). This is the point of the next section.

Asymptotics of the overall error
We now assume that observations y i are realisations of IID random variables Y i , with probability density f θ , for some true parameter θ ∈ Θ, while the artificial data-points (X j ) j≥1 remain generated from a P ψ -ergodic process. We also assume that (Y i ) i≥1 and (X j ) j≥1 are independent sequences; this regime was first studied for NCE by Gutmann and Hyvärinen (2012), although the X j 's were assumed IID in that paper.
This asymptotic regime has some drawbacks: it assumes that the model is well specified, and that P ψ is chosen independently from the data. This is rarely true in practice, as the user will generally try to choose P ψ as close as possible to the data distribution to reduce the mean square error (see Section 5). Nevertheless, allowing both m and n to go to infinity turns out to provide a better understanding of the asymptotic behaviours of NCE and MC-MLE, at least for situations where the number of actual data-points may be large.
We assume implicitly that m = m n is a non-decreasing sequence of positive integers going to infinity when n does, while m n /n → τ ∈ (0, +∞). Every limit when n goes to infinity should be understood accordingly. Finally, ξ = (θ , ν ) stands for the true extended parameter, where ν = log {Z(ψ)/Z(θ )}.

Consistency
Our results concerning the overall consistency (to ξ , as both m and n → ∞) of MC-MLE and NCE rely on the following assumptions: (C2) The random sequences ξIS n,m n≥1 and ξNCE n,m n≥1 are approximate MC-MLE and NCE estimators, and belong to a compact set almost surely. (H3) The maps θ → h θ (x) are continuous for P ψ -almost every x, and for any θ ∈ Θ there is some ε > 0 such that andξ NCE n,m converge almost surely to ξ as n, m → ∞, while m/n → τ . Our proofs of NCE and MC-MLE consistency are mainly inspired from Wald (1949)'s famous proof of MLE consistency, for which the same integrability condition (H3) is required. It is noteworthy that, under this regime, MC-MLE and NCE consistency essentially rely on the same assumptions as MLE consistency. Wald (1949), the proof does not require Θ to be a subset of R d . Theorem 3 holds as soon as Θ is a metric space.

Remark 3. As noticed by
Proposition 2. If the parametric model is exponential, i.e. if h θ (x) = exp θ T S(x) for some measurable statistic S, then assumption (H3) always holds.

Asymptotic normality
To ensure the asymptotic normality of both NCE and MC-MLE estimates, we make the following assumption.
(X2) The sequence (X j ) j≥1 is a Harris ergodic Markov chain (that is, aperiodic, φ-irreducible and positive Harris recurrent; for definitions see Meyn and Tweedie (2012)), with stationary distribution P ψ . The Markov kernel associated with the chain (X j ) j≥1 , noted P (x, dy), is reversible (satisfies detailed balance) with respect to P ψ , that is Moreover, the chain (X j ) j≥1 is geometrically ergodic, i.e. there is some ρ ∈ [0, 1) and a positive measurable function M such that for P ψ -almost every where P n (x; dy) denote the n-step Markov transition kernel corresponding to P , and . TV stands for the total variation norm. where In the equation above, Cov(ϕ(X 0 ), ϕ(X i )) stands for the i-th lag autocovariance of the chain at stationarity; that is with respect to the distribution defined by X 0 ∼ P ψ and X i+1 |X i ∼ P (X i , .). The sequence of artificial data-points (X j ) j≥1 is not assumed stationary. Since the chain is Harris recurrent, (10) holds whenever X 1 = x for any x ∈ X (see e.g. Roberts and Rosenthal (2004), especially Theorem 4 and Proposition 29). For convenience, we choose to assume that the kernel is reversible (which is true for any Metropolis-Hastings algorithm), but the reversibility assumption (8) is not compulsory, and may be replaced by slightly stronger integrability assumptions (see e.g. Roberts and Rosenthal (2004)); in particular, if reversibility is not assumed then (10) holds whenever ϕ ∈ L 2+δ (P ψ ). The critical assumption is geometric ergodicity.
Geometric ergodicity is obviously stronger than assumption (X1) which only requires a law of large numbers to hold. Nevertheless, geometric ergodicity remains a state of the art condition to ensure CLT's for Markov chains (see e.g. Roberts and Rosenthal (2004) and Bradley et al. (2005)), while it can often be checked for practical MCMC samplers. We present assumption (X2) as a practical condition for ensuring CLT's when the artificial data-points are generated from a MCMC sampler, while it also covers the IID case without loss of generality. Note though that CLT's can hold under more general conditions, e.g. when the Markov chain satisfies polynomial ergodicity (Jones, 2004).
Our asymptotic normality results rely on the following assumptions: (G2) Estimatorsξ IS n,m andξ NCE n,m converge in probability to ξ , and are such that (I3) At θ = θ , the following integrability condition holds:

Remark 4. Second moment condition (I3) is critical. It basically forbids P ψ to be chosen as a too thin tail distribution compared to P θ * . Assumption (I3) is needed for establishing MC-MLE asymptotic normality, but not for NCE (inequality 21 shows that condition (H4) is enough). This already shows that, under the considered regime, NCE is more robust (to P ψ ) than MC-MLE.
Assumptions (H4) and (I3) admit a simpler formulation when the model is exponential, as shown by the following proposition.

Comparison of asymptotic variances
Theorem 5. If the artificial data-points (X j ) j≥1 are IID, then under assumptions (H4) and (I3) Theorem 5 shows that, asymptotically, when m/n → τ > 0, and when the artificial data-points are IID, the variance of a NCE estimator is always lower than the variance of the corresponding MC-MLE estimator. This inequality is with respect to the Loewner partial order on symmetric matrices. To our knowledge, this is the first theoretical result proving that NCE dominates MC-MLE in terms of mean square error. We failed however to extend this result to correlated Markov chains.
This inequality holds for any fixed ratio τ ∈ (0, +∞), and any given sampling distribution P ψ , but the sharpness of the bound remains unknown. Typically, the bigger is τ , the closer the two variances will be, as the ratio τ f ψ /τ f ψ + f θ gets closer to one. It is also the case when the sampling distribution P ψ is close to the true data distribution P θ . Geyer (1994) noticed that MC-MLE performs better when P ψ is close to P θ . Next proposition shows that when P ψ = P θ , both variances can be related to the variance of the MLE.
Proposition 4. If the artificial data-points are IID sampled from P ψ = P θ , then under assumptions (H4) and (I3) we have It is straightforward to check that, under the usual conditions ensuring asymptotic normality of the MLE, the extended maximiser of the Poisson Transform n is also asymptotically normal with variance V MLE (ξ ). This proposition shows what we can expect from NCE and MC-MLE in a ideal scenario where the sampling distribution is the same as the true data distribution.

Numerical example
This section presents a numerical example that illustrates how the variance reduction brought by NCE may vary according to the sampling distribution P ψ and the ratio τ .
We consider observations IID distributed from the multivariate Gaussian distribution N p (μ, Σ) truncated to (0, +∞) p ; that is Y 1 , ..., Y n are IID with the following probability density with respect to Lebesgue's measure: The probability P (W ∈ (0, +∞) p ) is intractable for almost every (μ, Σ). Numerical approximations of such probabilities quickly become inefficient when p increases.
It is well known that (truncated) Gaussian densities form an exponential family under the following parametrisation: for a given μ ∈ R p and Σ ∈ S ++ p (the set of positive definite matrices of size p), define θ = (Σ −1 μ, triu(−(1/2)Σ −1 )), and S(x) = (x, triu(xx T )), where triu(.) is the upper triangular part. This parametrisation is minimal and the natural parameter space is a convex open subset of R q where q = p + p(p + 1)/2. Indeed, under the exponential formulation, we have Θ = Θ 1 × Θ 2 where Θ 1 = R p and Θ 2 is on open cone of R p(p+1)/2 , in bijection with S ++ p through the function triu(.). The observations are generated IID from P θ (using rejection) for some true parameter θ = θ , corresponding to in the usual Gaussian parametrisation. The artificial data-points are sampled IID from the distribution P ψ associated with density f μ,Σ with μ = 0 p and Σ = λI p for some λ > 0. (Since Σ is diagonal, we may sample the components independently; and since μ = 0 p , we may sample each component by taking the absolute value of a Gaussian variate.) The sample size is fixed to n = 1000, while m is chosen such that the ratio m/n is equal to τ ∈ {1, 5, 20, 100}. The distribution P ψ is chosen as stated above for λ ∈ [1.5, 20]. Figure 1 plots estimates and confidence intervals of the mean square error ratio (mean Euclidean norm square error of the estimator divided by the asymptotic variance of the MLE) of both estimators (NCE and MC-MLE), based on 1000 independent replications. (Regarding the denominator of this ratio, note that the variance of the MLE may be estimated by performing noise contrastive estimation with P ψ = P θ * , see Proposition 4.)  To facilitate the direct comparison between NCE and MC-MLE, we also plot in Figure 2 estimates and confidence intervals of the MSE ratio of MC-MCLE over NCE. As expected from Theorem 5, this ratio is always higher than one; it becomes larger and larger as τ decreases, or as λ moves away from its optimal value (around 4). This suggests that NCE is more robust than MC-MLE to a poor choice for the reference distribution (especially thin tails distributions, i.e. when λ goes to zero).
Finally, we discuss a technical difficulty related to the constrained nature of the parameter space Θ. In principle, both the NCE and the MC-MLE estimators should be obtained through constrained optimisation (i.e. as maximisers of their respective objective functions over Θ). However, it is much easier (here, and in many cases) to perform an unconstrained optimisation (over R q ). We must check then that the so obtained solution fulfils the constraint that defines Θ (here, that the solution corresponds to a matrix Σ which is definite positive). Figure 3 plots estimates and confidence intervals of the probability that both estimators belong to Θ. We see that NCE (when implemented without constraints) is much more likely to produce estimates that belong to Θ.
Note also that when the considered model is an exponential family (as in this case), both functions IS n,m and NCE n,m are convex. This implies that, when the unconstrained maximiser of these functions do not fulfil the constraint that defines Θ, then the constrained maximiser does not exist. (Any solution of the constrained optimisation program lies on the boundary of the constrained set.)

Conclusion
The three practical conclusions we draw from our results are that: (a) NCE is as widely applicable as MC-MLE (including when the X j s are generated using MCMC); (b) NCE and MC-MLE are asymptotically equivalent (as m → ∞) when n is fixed; (c) NCE may provide lower-variance estimates than MC-MLE when n is large (provided that m = O(n)). The variance reduction seems to be more important when the ratio τ = m/n is small, or when the reference distribution (for generating the X j 's) is poorly chosen. Note that we proved (c) under the assumption that the X j 's are IID, but we conjecture it also holds when they are generated using MCMC. Proving this conjecture may be an interesting avenue for future research.
As mentioned in the introduction, another advantage of NCE is its ease of implementation. In particular, when the considered model is exponential, NCE boils down to performing a standard logistic regression. For all these reasons, it seems reasonable to recommend NCE over MC-MLE to perform inference for un-normalised models.
French National Research Agency (ANR) as part of the Investissements d'Avenir program (ANR-11-LABEX-0047). We are grateful to Bernard Delyon for letting us include in the supplement an English translation of some technical results (and their proofs) on ergodic processes that he derived in lecture notes (available in French at https://perso.univ-rennes1.fr/bernard.delyon/param.pdf).

A.1. Technical lemmas
The following lemmas are prerequisites for the proofs of our main theorems. Most of them are 'classical' results, but for the sake of completeness, we provide the proofs of these lemmas in the supplement (Appendix B).
All these lemma apply to a P ψ -ergodic sequence of random variables, (X j ) j≥1 . First lemma is a slightly disguised version of the law of large numbers, combined with the monotone convergence of a sequence of test functions.

This result holds whether the expectation is finite or infinite.
Second lemma is a natural generalisation of Lemma 1 to dominated convergence.
Lemma 2. Let (f m ) m≥1 , f and g be measurable, real-valued functions, such that (f m ) m≥1 converges pointwise towards f ; for any m ≥ 1, |f m | ≤ g; and E ψ [g(X)] < +∞. Then we have: Third lemma is a generalisation of Lemma 1 to the degenerate case where the expectation is infinite. In that case, Lemma 3 shows that the monotonicity assumption is unnecessary.
Fourth lemma is a uniform law of large numbers. It is well known in the IID case. This result does not actually require the independence assumption.
We present a generalisation of this result to ergodic processes. The proof is due to Bernard Delyon, who made it available in an unpublished course in French (Delyon (2018)). We present an English translation of the proof in the supplement (Appendix B). x) a measurable function defined on K × X whose values lie on R p ; and suppose that the maps θ → ϕ(θ, x) are continuous for P ψ -almost every x. Moreover, suppose that Then the function θ → E ψ ϕ(θ, X) defined on K is continuous, and we have Consequently, if there is a random sequence ( θ m ) m≥1 converging almost surely to some parameter θ ∈ Θ. Then we have Fifth lemma is also a well known result. It is often used to prove the weak convergence (usually asymptotic normality) of Z-estimators.
Lemma 5. Define any probability space (Ω, F, P), and let ( n (θ, ω)) n≥1 be measurable real-valued functions defined on R d × Ω. Let θ ∈ R d and ε > 0 such that for any n ≥ 1 and for P-almost every ω ∈ Ω the map θ → n (θ, ω) is C 2 on B(θ , ε). Let ( θ n ) n≥1 be a random sequence converging in probability to θ . Suppose also that: Sixth lemma is a technical tool required for proving asymptotic normality of NCE. It is particularly straightforward to prove in the IID case. We present a generalisation of this result to reversible, geometrically ergodic Markov chains.

A.2. Proof of Theorem 1
A standard approach to establish consistency of M-estimators is to prove some Glivenko-Cantelli result (uniform convergence), but, to the best of our knowledge, no such result exists under the general assumption that the underlying random variables (the X j 's in our case) are generated from an ergodic process. Instead, we follow Geyer (1994)'s approach, which relies on establishing that function − NCE n,m epiconverges to − n . Epiconvergence is essentially a one sided locally uniform convergence, that ensures the convergence of minimisers; for a succint introduction to epiconvergence, see Appendix A of Geyer (1994) and Chapter 7 of Rockafellar and Wets (2009).
We follow closely Geyer (1994). In particular, Theorem 4 of Geyer (1994) shows that: if a sequence of functions n,m hypoconverges to some function n which has a unique maximiserθ n and if a random sequence ( θ n,m ) m≥1 is an approximate maximiser of n,m which belongs to a compact set almost surely, then θ n,m converges toθ n almost surely. Consequently, to prove Theorem 1, we only have to prove that NCE n,m hypoconverges to n (i.e. that − NCE n,m epiconverges to − n ); that is where N (θ, ν) denotes the set of neighborhoods of the point (θ, ν). Since Ξ = Θ×R is a separable metric space, there exists a countable base B = {B 1 , B 2 , ...} for the considered topology. For any point (θ, ν) define the countable base of neighborhoods N c (θ, ν) = B ∩ N (θ, ν) which can replace N (θ, ν) in the infima of the preceding inequalities. Choose a countable dense subset Γ c = {(θ 1 , ν 1 ), (θ 2 , ν 2 ), ...} as follows. For each k let (θ k , ν k ) be a point of B k such that:

3491
The proof is very similar to Theorem 1 of Geyer (1994). However, in this slightly different proof, we will need and (14) to hold simultaneously with probability one for any (θ, ν) ∈ Γ c and any B ∈ B. For any fixed (θ, ν), Lemma 1 applies to the maps x → (1 + x m ) m , and since any countable union of null sets is still a null set, convergence holds simultaneously for every element of Γ c and B with probability one. One may note that infima in the last equation are measurable under (H1) (in that case, an infima over any set B ∈ B can be replaced by an infima over the countable dense subset B ∩ Γ c ).

Lionel Riou-Durand and Nicolas Chopin
The inequality follows directly from superadditivity of the supremum (and subadditivity of the infimum) and the continuity and monotonicity of the maps x → log(1 + n m x) m n . The last equality holds because the infimum over N (θ, ν) can be replaced by the infimum over the countable set A c (θ, ν): the set of open balls centered on (θ, ν) of radius k −1 , k ≥ 1, which means the infimum is also the limit of a decreasing sequence, which can be splitted into three terms. The second term converges deterministically to zero, while convergences (13) and (14) apply for the first and third terms.
To conclude, apply the monotone convergence theorem to the remaining term: sup

A.3. Proof of Theorem 2
Define g ξ (x) = log h θ (x) + ν, and the following gradients (dropping n and m in the notation for convenience): By Taylor-Lagrange, for any component k, where Ψ IS k (ξ) denotes the k-th component of Ψ IS (ξ), and [ξ IS n,m ;ξ NCE n,m ] denotes the line segment in R d+1 which joinsξ IS n,m andξ NCE n,m . By assumption (G1), the left hand side is o P (m −1 ). The matrix form yields:

3493
Let us prove first the convergence of the Hessian matrix. Lemma 4 can be applied to each row component of the following matrix-valued function, the uniform norm of which is P ψ -integrable under (H2): Convergences of the d + 1 rows of H IS m can be combined to get the following result: . It turns out that H(ξ n ) is invertible as soon as (H2) holds. This is the point of the following lemma. This implies in particular that H IS m is eventually invertible with probability one.
The two last terms of the right hand side are residuals for which we want to bound the uniform norm over the ball B(θ n , ε). The sup norm of the second term is eventually bounded by: The sup norm of the third term is eventually bounded by 1 and Lemma 2 applies under (I1) to the sequence (f m ) m≥1 converging pointwise towards 0, and dominated by the integrable function g(x) = sup ξ∈B(ξn,ε) The limit of (m/n)Δ m is thus dictated by the behaviour of the first term. We apply Lemma 4 to the following vector-valued function, whose uniform norm is integrable under (I1) and under the continuity of the deterministic part assumed in (H2): Combination of these facts ensure that on a set of probability one, we have eventually:

A.4. Proof of Theorem 3
The proof of MC-MLE consistency under the considered regime is a very straightforward adaptation of Wald's proof of consistency for the MLE. We thus choose to present in appendix only the proof of NCE consistency, which is slightly more technical, although the sketch is similar. For the sake of completeness, a proof of MC-MLE consistency is presented in the supplement (Appendix B).

A.4.1. NCE consistency
For convenience, we choose to analyse a slightly different objective function (sharing the same maximiser with NCE n,m ), defined as: . We begin our proof with the following lemma.
The law of large numbers would apply directly if the sequence m n /n was exactly equal to τ . To handle this technical issue, we can consider the two following inequalities. Note that for any a ≥ b > 0: log This yields a useful uniform bound for any a, b > 0: , then the uniform bound (18) also ensures that: for any positive c > 0. The sequence can now be easily dominated and Lemma 2 applies; the first empirical average in (15) . Finally, let us prove that (θ , ν ) is the unique maximiser of M NCE τ . We have: by the log-sum inequality, which applies with equality if and only if e ν h θ (x) = e ν h θ (x) for P θ almost every x. This occurs if and only if ν and θ are chosen such that f θ (x) = e ν Z(ψ) h θ (x). The model being identifiable, there is only one choice for both the unnormalized density and the normalizing constant; θ = θ and ν = ν .
We now prove that the NCE estimator converges almost surely to this unique maximiser. Let η > 0, and define K η = {ξ ∈ K : d(ξ, ξ ) ≥ η} where K is the compact set defined in (C2).
Under (H3), monotone convergence ensures that for any ξ ∈ K η : Indeed, since maps θ → h θ (x) are continuous, the two previous expectations (on the left hand side) are respectively bounded from above for ε small enough, and bounded from below for any ε. Thus, for any ξ ∈ K η and any γ > 0 we can find ε ξ > 0 such that simultaneously: The compactness assumption ensures that there is a finite set {ξ 1 , ..., . This yields the following inequality: Choose any x for which the map θ → h θ (x) is continuous, and any ξ ∈ K η . From the definition of ζ (n) β , the following convergence is trivial: Moreover, using inequalities (16) et (17), one can easily show that the sequence inf is dominated (by a P ψ -integrable function). Lemma 2 applies: In the last inequality, the right hand side is P θ -integrable under (H3), and the sequence (in the middle) converges pointwise towards its lower bound whose negative part has either finite or infinite expectation. In both cases, either Lemma 2 or Lemma 3 can be applied and ensures that, almost surely: Combining these convergences simultaneously on a finite set, we get almost surely: This leads to the following inequality since γ is arbitrary small: This last inequality is the heart of the proof. To conclude, we need only to show that the right hand side is negative, this is the aim of the following lemma. The proof of Lemma 9 is straightforward. For the sake of completeness, we present a proof in the supplement (Appendix B).
Since an upper semi continuous function achieves its maximum on any compact set, this lemma proves in particular that sup Thus inequality (19) implies that we can always find some α < 0 such that eventually sup Combination of these facts show that with probability one we have eventually: This is enough to prove strong consistency. Indeed, with probability one,ξ NCE n,m eventually escapes from K η (otherwise there would be a contradiction with the inequality above). Since the sequence belongs to K by assumption, the sequence has no choice but to stay eventually in the ball of radius η. Thus with probability one, for any η > 0, we have eventually d(ξ NCE n,m , ξ ) < η. This is the definition of almost sure convergence.

A.5. Proof of Theorem 4
The proof of MC-MLE asymptotic normality is entirely classical. We choose to present in appendix only the proof of NCE asymptotic normality, which follows the same sketch but is slightly more technical. For the sake of completeness, a proof of MC-MLE asymptotic normality is presented in the supplement (Appendix B).

A.5.1. NCE asymptotic normality
We firstly show that the study can be reduced to the following random sequences: Noise contrastive estimation 3503 Finally, Lemma 5 applies:

A.6. Proof of Theorem 5
For convenience, we will use some shorthand notations. Define the real-valued measurable functions Q = f θ /f ψ and R = τ f ψ /(τ f ψ + f θ ). Note that we have the relationship QR = τ (1 − R). In the following, assume that ξ = ξ , and for any measurable function ϕ, note that E θ [ϕ] stands for the expectation of ϕ(X) where X ∼ P θ , and that ∇∇ T g ξ stands for the measurable matrix-valued function x → ∇ ξ g ξ (x)(∇ ξ g ξ (x)) T . We begin with the following computations: Fortunately, the expression of the asymptotic variances simplify, as shown by the following lemma.
Lemma 10. Let Z be any real-valued, non-negative measurable function such that E θ ∇∇ T g ξ Z is finite and invertible. Then: The proof of Lemma 10 follows from a direct block matrix computation. For the sake of completeness, we present a proof in the supplement (Appendix B).
Let M be defined as in Lemma 10, matrix calculations yield Summing up these expressions we finally get Now, to compare these variances, the idea is the following: (x, y) → x 2 /y is a convex function on R 2 , which means Jensen's inequality ensures that for any random variables X, Y such that the following expectations exist we have Here the variances are matrices, but it turns out that it is possible to use a generalization of Jensen's inequality to the Loewner partial order on matrices. We introduce the following notations: Proof. We just have to prove that f is convex with respect to , i.e. that for any λ ∈ [0, 1], and any ( ). Indeed, if this convex relationship on matrices is satisfied then for any x ∈ R m , the real-valued map q : B)x is necessarily convex on Δ n,m . Consequently, Jensen's inequality applies, i.e. for any random (A, B) ∈ Δ n,m a.s. and any x ∈ R m we have which is the claim of the lemma. Now, to prove that ϕ is convex with respect to , we use a property of the generalized Schur's complement in positive semi-definite matrices (see Boyd and Vandenberghe (2004) p.651): let A ∈ S n , B ∈ M n,m , C ∈ S m , and consider the block symmetric matrix Then we have This leads to a straightforward proof of the convexity of ϕ. To our knowledge, the following trick is due to Ando (1979), whose original proof was restricted to positive definite matrices. We use the generalized Schur's complement to extend this result to any (A, B) ∈ Δ n,m : let λ ∈ [0, 1], and (A 1 , B 1 ), (A 2 , B 2 ) ∈ Δ n,m . The sum of two positive semi definite matrices is positive semi-definite thus we have which is the same as Consequently, the generalised Schur's complement in this last block matrix is also positive semi-definite, i.e.
which proves the convexity of ϕ with respect to , and thus the claim of the lemma.
Finally, we compare the asymptotic variances of the two estimators. Note that for any (A, B) ∈ S n × S ++ n , and for every x ∈ R n , we have Indeed, if A is semi definite positive then for some integer k we can find P ∈ M k,n such that A = P T P , moreover, B being symmetric we have Inequality (22) is a direct application of Lemma 11 (let B = ∇∇ T g ξ , A = BR; note that (A, B) ∈ Δ d+1,d+1 almost surely; and use basic properties of the pseudo-inverse).
Moreover, since (X j ) j≥1 is P ψ -ergodic, the law of large numbers applies (even if the expectations are infinite, since f k and f are non-negative): Thus, there is a set of probability one on which for every k ∈ N, Since the inequality holds for any k ∈ N, it also holds for the supremum over k: Finally, the monotone convergence theorem yields Consequently, the lower and upper limits are both equal to E ψ [f (X)] almost surely.

B.1.2. Proof of Lemma 2
Since (X j ) j≥1 is P ψ -ergodic and f is dominated by the integrable function g, the law of large numbers applies to function f . Thus, we just need to prove that To do so, use the fact that Finally g is integrable, thus the remainder converges almost surely towards zero:

B.1.3. Proof of Lemma 3
Since g is integrable and (X j ) j≥1 is P ψ -ergodic we have Thus we only need to show that Define h m = g − sup k≥m f k , an increasing sequence of non negative functions converging pointwise to g − f . Lemma 1 applies whether g − f is integrable or not: The following inequality shows that the expectation is indeed infinite:

B.1.4. Proof of Lemma 4
To begin, note that measurability of the supremum is ensured by the lower semi-continuity of the maps θ → ϕ(θ, x) on a set of probability one that does not depend on θ.
For every θ ∈ K, consider the following function: Dominated convergence implies that f θ (η) converges to zero when η goes to zero. This is enough to ensure the continuity of the map θ → E ψ ϕ(θ, X) because of the following inequality: Let ε > 0. For every θ ∈ K, we can always find η (θ,ε) > 0 small enough such that f θ (η (θ,ε) ) < ε. Note that the open balls centered on θ ∈ K of radius η (θ,ε) , form an open cover of K, from which we can extract a finite subcover thanks to the compactness assumption. Thus we can build a finite set {φ 1 , ..., φ I } ⊂ K (centers of the balls) such that Now, for any θ ∈ K define i θ as the smallest integer i ∈ {1, ..., I} such that θ ∈ B i , and consider the following equality: The three last terms are functions of θ for which we want to bound the uniform norm.
To sum up, we have just proven that for any ε > 0, almost surely, Since ε is arbitrary small, we get the first claim of the lemma. Now, if θ m → θ, we have eventually θ m − θ ≤ ε with probability one. This yields the following inequality for m large enough: The first term converges to zero since the first claim of the lemma applies to the compact closure of B( θ, ε). The continuity of the map θ → E ψ ϕ(θ, X) ensures that the second term also goes to zero, proving the second claim of the Lemma.

B.3.1. MC-MLE consistency
The following proof is a straightforward adaptation of Wald's proof of consistency for the MLE (Wald (1949)). The sketch of proof is mainly inspired from Geyer (2012), which has the merit of giving a very accessible presentation of this technical proof.
To begin, define the opposite of the Kullback-Leibler divergence: Since the model is identifiable, λ has a unique maximizer achieved at θ . It may be −∞ for some values of θ, but this does not pose problems in the following proof.

B.4. Proofs related to exponential families
The following calculations are entirely classical. For the sake of completeness, we present the few tricks required for proving Propositions 1, 2 and 3. To begin, define b(x) = sgn(S(x)), the vector composed by the signs of each component of S(x). Note that for any θ ∈ Θ, the following supremum is necessarily achieved on the boundary of the 1-ball, in the direction of the sign vector: sup φ−θ 1 ≤ε exp φ T S(x) = exp (θ + εb(x)) T S(x) .
Since S(x) 1 = b(x) T S(x), we have (for the 1-norm for instance): which proves the claim of Proposition 2, since For Propositions 1 and 3, use also the fact that S(x) 1 = b(x) T S(x) and that y ≤ e y for any y ∈ R. We have Equations (26) and (27) can be combined as follows: Choosing θ =θ n in the preceding inequalities yields Proposition 1, while choosing θ = θ yields Proposition 3.