Efficiency Bound of Local Z-Estimators on Discrete Sample Spaces

Kanamori, Takafumi

doi:10.3390/e18070273

Open AccessArticle

Efficiency Bound of Local Z-Estimators on Discrete Sample Spaces

by

Takafumi Kanamori

Department of Computer Science and Mathematical Informatics, Nagoya University, Furocho, Chikusaku, Nagoya 464-8603, Japan

Entropy 2016, 18(7), 273; https://doi.org/10.3390/e18070273

Submission received: 14 June 2016 / Revised: 16 July 2016 / Accepted: 20 July 2016 / Published: 23 July 2016

Download Versions Notes

Abstract

:

Many statistical models over a discrete sample space often face the computational difficulty of the normalization constant. Because of that, the maximum likelihood estimator does not work. In order to circumvent the computation difficulty, alternative estimators such as pseudo-likelihood and composite likelihood that require only a local computation over the sample space have been proposed. In this paper, we present a theoretical analysis of such localized estimators. The asymptotic variance of localized estimators depends on the neighborhood system on the sample space. We investigate the relation between the neighborhood system and estimation accuracy of localized estimators. Moreover, we derive the efficiency bound. The theoretical results are applied to investigate the statistical properties of existing estimators and some extended ones.

Keywords:

Z-estimator; stochastic localization; efficiency; pseudo-likelihood; composite likelihood

1. Introduction

For many statistical models on a discrete sample space, the computation of the normalization constant is often intractable. Because of that, the maximum likelihood estimator (MLE) is not of practical use to estimate probability distributions, although the MLE has some nice theoretical properties such as the statistical consistency and efficiency under some regularity conditions [1].

In order to circumvent the computation of the normalization constant, alternative estimators that require only a local computation over the sample space have been proposed. In this paper, estimators on the basis of such a concept are called localized estimators. Examples of localized estimators include pseudo-likelihood [2], composite likelihood [3,4], ratio matching [5,6], proper local scoring rules [7,8], and many others. These estimators are used for discrete statistical models such as conditional random fields [9], Boltzmann machines [10], restricted Boltzmann machines [11], discrete exponential family harmoniums [12], and Ising models [13].

In this paper, we present a theoretical analysis of localized estimators. We use the standard tools in the statistical asymptotic theory. In our analysis, a class of localized estimators including pseudo-likelihood and composite likelihood is treated as M-estimator or Z-estimator which is an extension of the MLE [1]. The localized estimators require local computation around a neighborhood of observed points. Hence, the asymptotic variance of the localized estimator depends on the size of the neighborhood. We investigate the relation between the estimation accuracy and the neighborhood system. A similar result is given by [14], in which asymptotic variances between specific composite likelihoods are compared. In our approach, we consider a stochastic variant of localized estimators, and derive a general result that the larger neighborhood leads to more efficient estimator under a simple condition. The pseudo-likelihood and composite likelihood are obtained as the expectation of a stochastic localized estimator. We derive the exact efficiency bound for the expected localized estimator. As far as we know, the derivation of the efficiency bound is a new result for a class of localized estimators, though upper and lower bounds have been proposed [14] for some localized estimators.

The rest of the paper is organized as follows. In Section 2, we introduce basic concepts such as pseudo-likelihood, composite likelihood, and Z-estimators. Section 3 is devoted to define stochastic local Z-estimator associated with a neighborhood system over the discrete sample space. In Section 4, we study the relation between the neighborhood system and asymptotic efficiency of the stochastic local Z-estimator. In Section 5, we define local Z-estimator as the expectation of the stochastic local Z-estimator, and present its efficiency bound. The theoretical results are applied to study asymptotic properties of existing estimators and some extended ones. Finally, Section 6 concludes the paper with discussions.

2. Preliminaries

M-estimators and Z-estimators were proposed as an extension of the MLE. In practice, M-estimators and Z-estimators are often computationally demanding due to the normalization constant in statistical models. To circumvent computational difficulty, localized estimators have been proposed. We introduce some existing localized estimators especially on discrete sample spaces. In later sections, we consider statistical properties of a localized variant of Z-estimators.

Let us summarize the notations to be used throughout the paper. Let

R

be the set of all real numbers. The discrete sample space is denoted as

X

. The statistical model

p_{θ} (x)

for

x \in X

with the parameter

θ \in Θ \subset R^{d}

is also expressed as

p (x; θ)

. The vector a usually denotes the column vector, and

\cdot^{T}

denotes the transposition of vector or matrix. For a linear space T and an integer d,

{(T)}^{d}

denotes the d-fold product space of T, and the element

c \in {(T)}^{d}

is expressed as

c = (c_{1}, \dots, c_{d})

. The product space of two subspaces

T_{1}

and

T_{2}

that are orthogonal to each other is denoted as

T_{1} \oplus T_{2}

. For the function

f (θ)

of the parameter

θ \in R^{d}

,

\nabla f

denotes the gradient vector

{(\frac{\partial f}{\partial θ_{1}}, \dots, \frac{\partial f}{\partial θ_{d}})}^{T}

. The indicator function is denoted as

1 [A]

that takes 1 if A is true and 0 otherwise.

2.1. M- and Z-Estimators

Suppose that samples

x_{1}, \dots, x_{m}

are i.i.d. distributed from the probability

p (x)

over the discrete sample space

X

. A statistical model

p_{θ} (x) = p (x; θ)

with the parameter

θ \in Θ \subset R^{d}

is assumed to estimate

p (x)

. In this paper, our concern is the statistical efficiency of estimators. Hence, we suppose that the statistical model includes

p (x)

.

The MLE is commonly used to estimate

p (x)

. It uses the negative log-likelihood of the model,

- log p_{θ} (x)

, as the loss function and the estimator is given by the minimum solution of its empirical mean,

- \frac{1}{m} \sum_{i = 1}^{m} log p_{θ} (x_{i})

.

Generally, the estimator obtained by the minimum solution of a loss function is referred to as the M-estimator. The MLE is an example of M-estimators. When the loss function is differentiable, the gradient of the loss function vanishes at the estimated parameter. Instead of minimizing loss function, a solution of the system of equations also provides an estimator of the parameter. Such an estimator is called the Z-estimator [1]. In the MLE, the system of equations is given as

\begin{matrix} \frac{1}{m} \sum_{i = 1}^{m} \nabla log p_{θ} (x_{i}) = 0, \end{matrix}

where

0 \in R^{d}

is the null-vector. The gradient

\nabla log p_{θ} (x)

is known as the score function of the model

p_{θ} (x)

. In this paper, the score function is denoted as

\begin{matrix} u_{θ} (x) = \nabla log p_{θ} (x) . \end{matrix}

In general, the Z-estimator is defined as the solution of the system of equations

\begin{matrix} \frac{1}{m} \sum_{i = 1}^{m} f_{θ} (x_{i}) = 0, \end{matrix}

where the

R^{d}

-valued function

f_{θ} (x) = f (x; θ)

is referred to as the identification function [15,16]. In the M-estimator, the identification function is given as the gradient of the loss function. In general, however, the identification function itself is not necessarily expressed as the gradient of a loss function, if it is not integrable. The identification function

f_{θ} (x)

is also called Z-estimator with some abuse of terminology.

2.2. Localized Estimators

Below, let us introduce some localized estimators. The statistical model defined on the discrete set

X

is denoted by

\begin{matrix} p_{θ} (x) = \frac{{\tilde{p}}_{θ} (x)}{Z_{θ}} \end{matrix}

(1)

for

x \in X

, where

Z_{θ}

is the normalization constant at the parameter θ, i.e.,

\begin{matrix} Z_{θ} = \sum_{x \in X} {\tilde{p}}_{θ} (x) . \end{matrix}

Throughout the paper, we assume

p_{θ} (x) > 0

for all

x \in X

and all

θ \in Θ \subset R^{d}

.

Example 1 (Pseudo-likelihood).

Suppose that

X

is expressed as the product space

X = X_{1} \times \dots \times X_{n}

, where

X_{1}, \dots, X_{n}

are finite sets such as

{0, 1}

. For

x = (x_{1}, \dots, x_{n}) \in X

, let

x_{- k}

be the

n - 1

dimensional vector defined by dropping the k-th element of x. The loss function of the pseudo-likelihood,

S_{PS}

, is defined as the negative log-likelihood of the conditional probability

p_{θ} (x_{k} | x_{- k})

defined from

p_{θ} (x)

, i.e.,

\begin{matrix} S_{PS} (x, p_{θ}) = - \sum_{k = 1}^{n} log p_{θ} (x_{k} | x_{- k}) = - \sum_{k = 1}^{n} {log {\tilde{p}}_{θ} (x) - log (\sum_{x_{k} \in X_{k}} {\tilde{p}}_{θ} (x))} . \end{matrix}

(2)

The pseudo-likelihood does not require the normalization constant, and it satisfies the statistical consistency of the parameter estimation [2,17]. The identification function of the corresponding Z-estimator is obtained by the gradient vector of the loss Function (2).

Example 2 (Composite likelihood).

The composite likelihood was proposed as an extension of the pseudo-likelihood [3]. Suppose that

X

is expressed as the product space as in Example 1. For the index subset

A \subset {1, \dots, n}

, let

x_{A} = {(x_{i})}_{i \in A}

be the subvector of

x \in X

. For each

ℓ = 1, \dots, M

, Suppose that

A_{ℓ}

and

B_{ℓ}

are a pair of disjoint subsets in

{1, \dots, n}

, and let

C_{ℓ}

be the complement of the union

A_{ℓ} \cup B_{ℓ}

, i.e.,

C_{ℓ} = {(A_{ℓ} \cup B_{ℓ})}^{c}

. Given positive constants

γ_{1}, \dots, γ_{M}

, the loss function of the composite likelihood,

S_{CL}

, is defined as

\begin{matrix} S_{CL} (x, p_{θ}) = - \sum_{ℓ = 1}^{M} γ_{ℓ} log p_{θ} (x_{A_{ℓ}} | x_{B_{ℓ}}) = - \sum_{ℓ = 1}^{M} γ_{k} log {\sum_{x_{C_{ℓ}}} {\tilde{p}}_{θ} (x) - \sum_{x_{B_{ℓ}^{c}}} {\tilde{p}}_{θ} (x)} . \end{matrix}

The composite likelihood using the subsets

A_{ℓ} = {ℓ}, B_{ℓ} = A_{ℓ}^{c}

and positive constant

γ_{ℓ} = 1

for

ℓ = 1, \dots, n

yields the pseudo-likelihood. As well as the pseudo-likelihood, the composite likelihood has the statistical consistency under some regularity condition [4].

Originally, the pseudo and composite likelihoods were proposed to deal with spatial data [2,3]. As a generalization of these estimators, a localized variant of scoring rules works efficiently to the statistical analysis of discrete spatial data [18].

3. A Stochastic Variant of Z-Estimators

In this section, we define a stochastic variant of Z-estimators. For the discrete sample space

X

, suppose that the neighborhood system N is defined as a subset of the power set

2^{X}

, i.e., N is a family of subsets in

X

. Let us define the neighborhood system at

x \in X

by

N_{x} = {e \in N | x \in e}

. We assume that

N_{x}

is not empty for any x. In some class of localized estimators, the neighborhood system is expressed using an undirected graph on

X

[7]. In our setup, the neighborhood system is not necessarily expressed by an undirected graph, and we allow the neighborhood system to possess multiple neighbors at each point x.

Let us define the stochastic Z-estimator. A conditional probability of the set

e \in N

given

x \in X

is denoted as

q (e | x)

. We assume that

q (e | x) = 0

if

e \notin N_{x}

throughout the paper. Given a sample x, we randomly generate a neighborhood e from the conditional probability

q (e | x)

. Using i.i.d. copies of

(x, e)

, we estimate

p (x)

. Here, the statistical model

p_{θ} (x)

of the form (1) is used. We use the Z-estimator

f_{θ} (x, e) = f (x, e; θ) \in R^{d}

to estimate the parameter

θ \in Θ \subset R^{d}

. The element of

f_{θ} (x, e)

is denoted as

f_{θ, k} (x, e)

or

f_{k} (x, e; θ)

for

k = 1, \dots, d

. The expectation under the probability

p_{θ} (x) q (e | x)

is written as

E_{θ, q} [\cdot]

. Suppose that the equality

\begin{matrix} E_{θ, q} [f_{θ}] = 0 \end{matrix}

(3)

holds for all

θ \in Θ

. In addition, we assume that the vectors

E_{θ, q} [\nabla f_{θ, k}], k = 1, \dots, d

are linearly independent, meaning that

f_{θ}

depends substantially on the parameter θ [19]. The solution of the system of equations

\begin{matrix} \frac{1}{m} \sum_{i = 1}^{m} f_{θ} (x_{i}, e_{i}) = 0 \end{matrix}

(4)

produces a statistically consistent estimator under some regularity condition [1]. In the parameter estimation of the model

p_{θ} (x)

, the stochastic Z-estimator is defined as the solution of (4) using the identification function satisfying (3). As shown in Section 5, stochastic Z-estimators are useful to investigate statistical properties of the standard pseudo-likelihood and composite likelihood in Examples 1 and 2.

The computational tractability of the stochastic Z-estimator is not necessarily guaranteed. The MLE using the score function

f_{θ} (x, e) = u_{θ} (x)

is regarded as a stochastic Z-estimator for any

q (e | x)

and it may not be computationally tractable because of the normalizing constant. As a class of computationally efficient estimators, let us define the stochastic local Z-estimator as the stochastic Z-estimator using

f_{θ} (x, e)

satisfying

\begin{matrix} E_{θ, q} [f_{θ} | e] = 0 \end{matrix}

(5)

for any neighborhood

e \in N

, where

E_{θ, q} [\cdot | e]

is the conditional expectation given e. The conditional probability

p (x | e)

of

p_{θ} (x) q (e | x)

can take a positive value only when

x \in e

. Hence,

f_{θ} (x, e)

depends only on the neighborhood of x and its computation will be tractable.

Example 3 (Stochastic pseudo-likelihood).

Let us define the stochastic variant of the pseudo-likelihood estimator. On the sample space

X = X_{1} \times \dots \times X_{n}

, the neighborhood system

N_{x}

at

x = (x_{1}, \dots, x_{n}) \in X

is defined as

N_{x} = {e_{x, k} | k = 1, \dots, n}

, where

e_{x, k} \subset X

is given as

\begin{matrix} e_{x, k} = {(x_{1}, \dots, x_{k - 1}, z_{k}, x_{k + 1}, \dots, x_{n}) | z_{k} \in X_{k}} . \end{matrix}

In order to estimate the parameters in complex models such as conditional random fields and Boltzmann machines, the union,

\cup_{e \in N_{x}} e

, is often used as the neighborhood at x [2,9,17]. Let the conditional probability

q (e | x)

on

N_{x}

as

q (e_{x, k} | x) = q_{k}, k = 1, \dots, n

, where

q_{1}, \dots, q_{n}

are positive numbers satisfying

\sum_{k = 1}^{n} q_{k} = 1

. The identification function of the stochastic pseudo-likelihood is defined by

\begin{matrix} f_{θ} (x, e) = \nabla log \frac{p_{θ} (x)}{\sum_{z \in e} p_{θ} (z)} \end{matrix}

for

e \in N_{x}

. Then,

f_{θ} (x, e_{x, k})

is equal to

\nabla log p_{θ} (x_{k} | x_{- k})

. The conditional probability

p (x | e)

derived from

p_{θ} (x) q (e | x)

is given as

\begin{matrix} p (x | e_{x, k}) = \frac{p_{θ} (x) q (e_{x, k} | x)}{\sum_{z \in e_{x, k}} p_{θ} (z) q (e_{x, k} | z)} = \frac{p_{θ} (x) q_{k}}{\sum_{z \in e_{x, k}} p_{θ} (z) q_{k}} = p_{θ} (x_{k} | x_{- k}), \end{matrix}

where we used the equality

e_{x, k} = e_{z, k}

for

z \in e_{x, k}

. Hence, the equality (5) holds for any

{(q_{k})}_{k = 1, \dots, n}

. When

q_{k}

depends on x,

p (x | e_{x, k})

is different from

p_{θ} (x_{k} | x_{- k})

in general.

Example 4 (Stochastic variant of composite likelihood).

Let us introduce a stochastic variant of composite likelihood on the sample space

X = X_{1} \times \dots \times X_{n}

. Below, notations in Example 2 are used. Let us define

e_{x, ℓ}^{'}, ℓ = 1, \dots, M

by the subset

e_{x, ℓ}^{'} = {y \in X | y_{B_{ℓ}} = x_{B_{ℓ}}}

, and the neighborhood system

N_{x}^{'}

by

N_{x}^{'} = {e_{x, ℓ}^{'} | ℓ = 1, \dots, M}

. We assume that the map from ℓ to

B_{ℓ}

is one to one. In other words, the disjoint subsets

A_{ℓ}, B_{ℓ}, C_{ℓ}

can be specified from the neighborhood

e_{x, ℓ}^{'}

. Suppose that the conditional probability

q^{'} (e^{'} | x)

on

N_{x}^{'}

is defined as

q^{'} (e_{x, ℓ}^{'} | x) = q_{ℓ}^{'}

for

ℓ = 1, \dots, M

, where

q_{1}^{'}, \dots, q_{M}^{'}

are positive numbers satisfying

\sum_{ℓ = 1}^{M} q_{ℓ}^{'} = 1

. As well as Example 3, we see that the conditional probability

q^{'} (x | e_{x, ℓ}^{'})

defined from

p_{θ} (x) q^{'} (e^{'} | x)

is given as

p_{θ} (x_{A_{ℓ}}, x_{C_{ℓ}} | x_{B_{ℓ}})

. Let us consider the identification function,

f_{θ} (x, e_{x, ℓ}^{'}) = \nabla log \frac{\sum_{x_{C_{ℓ}}} p_{θ} (x)}{\sum_{x_{B_{ℓ}^{c}}} p_{θ} (z)},

(6)

which is nothing but

\nabla log p_{θ} (x_{A_{ℓ}} | x_{B_{ℓ}})

. Then, (5) holds under the joint probability

p_{θ} (x) q^{'} (e^{'} | x)

. Indeed, we have

E_{θ, q^{'}} [f_{θ} | e_{x, ℓ}^{'}] = \sum_{x_{A_{ℓ}}, x_{C_{ℓ}}} p_{θ} (x_{A_{ℓ}}, x_{C_{ℓ}} | x_{B_{ℓ}}) \nabla log p_{θ} (x_{A_{ℓ}} | x_{B_{ℓ}}) = 0

for any

{(q_{ℓ}^{'})}_{ℓ = 1, \dots, M}

. In this paper, the Z-estimator using (6) is called the reduced stochastic composite likelihood (reduced-SCL). The stochastic composite likelihood proposed in [20] is a randomized extension of the above

f_{θ} (x, e^{'})

. Let

z = (z_{1}, \dots, z_{M})

be a binary random vector taking an element of

{0, 1}^{M}

, and

α_{1}, \dots, α_{M}

be positive constants. The SCL is defined as the Z-estimator obtained by

\begin{matrix} f (x, z; θ) = \sum_{ℓ = 1}^{M} α_{ℓ} z_{ℓ} \nabla log p (x_{A_{ℓ}} | x_{B_{ℓ}}; θ) . \end{matrix}

The statistical consistency and the normality of the SCL is shown in [20].

4. Neighborhood Systems and Asymptotic Variances

We consider the relation between neighborhood systems and statistical properties of stochastic local Z-estimators.

4.1. Tangent Spaces of Statistical Models

At the beginning, let us introduce some geometric concepts to investigate statistical properties of localized estimators. These concepts are borrowed from information geometry [21]. For the neighborhood system N with the conditional probability

q (e | x)

, let us define the linear space

T_{θ, q}

as

\begin{matrix} T_{θ, q} = {a : X \times N \to R | E_{θ, q} [a] = 0, a (x, e) = 0 if q (e | x) = 0} . \end{matrix}

The inner product for

a_{1}, a_{2} \in T_{θ, q}

is defined as

E_{θ, q} [a_{1} a_{2}]

. A geometric meaning of

T_{θ, q}

is the tangent space of the statistical model

{p_{θ} (x) q (e | x) | θ \in Θ}

. For any

a \in T_{θ, q}

and sufficiently small

ε > 0

, the perturbation of

p_{θ} (x) q (e | x)

to the direction

a (x, e)

leads to the probability function

p_{θ} (x) q (e | x) (1 + ε a (x, e))

. Each element of the score function

u_{θ, j} (x), j = 1, \dots, d

is a member of

T_{θ, q}

by regarding as

u_{θ, j} (x) \cdot 1 [q (e | x) \neq 0]

.

Let us consider the stochastic Z-estimator derived from

f_{θ} = (f_{θ, 1}, \dots, f_{θ, k})

satisfying

f_{θ} \in {(T_{θ, q})}^{d}

for any θ. It leads to a Fisher consistent estimator. Stochastic local Z-estimators use an identification function in the linear subspace

\begin{matrix} T_{θ, q}^{L} = {f \in T_{θ, q} | E_{θ, q} [f | e] = 0, \forall e \in N} . \end{matrix}

(7)

The orthogonal complement of

T_{θ, q}^{L}

in

T_{θ, q}

is denoted as

T_{θ, q}^{E}

, which is given as

\begin{matrix} T_{θ, q}^{E} = {f \in T_{θ, q} | f (x, e) does not depend on x} . \end{matrix}

Indeed, the orthogonality of

T_{θ, q}^{L}

and

T_{θ, q}^{E}

is confirmed by

\begin{matrix} E_{θ, q} [a (x, e) b (e)] = E_{θ, q} [b (e) E [a (x, e) | e]] = 0 \end{matrix}

for any

a \in T_{θ, q}^{L}

and any

b \in T_{θ, q}^{E}

. In addition, any

f \in T_{θ, q}

can be decomposed into

\begin{matrix} f = (f - E_{θ, q} [f | e]) + E [f | e] \end{matrix}

such that

f - E_{θ, q} [f | e] \in T_{θ, q}^{L}

and

E_{θ, q} [f | e] \in T_{θ, q}^{E}

.

The efficient score

u_{θ}^{I} = {(u_{θ, k}^{I})}_{k = 1, \dots, d}

is defined as the projection of each element of the score

u_{θ}

onto

T_{θ, q}^{L}

, i.e.,

\begin{matrix} u_{θ}^{I} (x, e) = u_{θ} (x) - E_{θ, q} [u_{θ} | e] = \nabla log \tilde{p} (x; θ) - E_{θ, q} [\nabla log \tilde{p} (x; θ) | e] . \end{matrix}

The efficient score is computationally tractable when the size of the neighborhood e is not exponential order but linear or low-degree polynomial order of n, where n is the dimension of x. The trade-off between the computational and statistical efficiency is presented in Theorems 1 and 2 in Section 4.2.

Another expression of the efficient score is

\begin{matrix} u_{θ}^{I} (x, e) = \nabla log p (x | e; θ, q), \end{matrix}

(8)

where the conditional probability

p (x | e; θ, q)

is defined from

p_{θ} (x) q (e | x)

. The above equality is obtained by

\begin{matrix} u_{θ}^{I} (x, e) & = \nabla log p_{θ} (x) - \sum_{x \in e} \frac{p_{θ} (x) q (e | x)}{\sum_{x^{'}} p_{θ} (x^{'}) q (e | x^{'})} \nabla log p_{θ} (x) \\ = \nabla log (p_{θ} (x) q (e | x)) - \frac{\sum_{x \in e} \nabla p_{θ} (x) q (e | x)}{\sum_{x^{'} \in e} p_{θ} (x^{'}) q (e | x^{'})} \\ = \nabla log \frac{p_{θ} (x) q (e | x)}{\sum_{x^{'}} p_{θ} (x^{'}) q (e | x^{'})} = \nabla log p (x | e; θ, q) . \end{matrix}

We define

T_{θ, q}^{I}

as the subspace of

T_{θ, q}^{L}

spanned by

{u_{θ, k}^{I} | k = 1, \dots, d}

, and

T_{θ, q}^{A}

be the orthogonal complement of

T_{θ, q}^{I}

in

T_{θ, q}^{L}

. As a result, we obtain

\begin{matrix} T_{θ, q} = T_{θ, q}^{L} \oplus T_{θ, q}^{E}, T_{θ, q}^{L} = T_{θ, q}^{I} \oplus T_{θ, q}^{A} . \end{matrix}

We describe statistical properties of stochastic local Z-estimators using the above tangent spaces.

4.2. Asymptotic Variance of Stochastic Local Z-Estimators

Under a fixed conditional probability

q (e | x)

, we derive the asymptotically efficient stochastic local Z-estimator in the same way to semi-parametric estimation [19,22]. In addition, we consider the monotonicity of the efficiency w.r.t. the size of the neighborhood. Given i.i.d. samples

(x_{i}, e_{i}),

i = 1, \dots, m

, generated from

p (x) q (e | x)

, the estimator

\hat{θ}

of the parameter in the model (1) is obtained by solving the system of Equation (4), where

f_{θ} \in {(T_{θ, q})}^{d}

for any

θ \in Θ

. Suppose that the true probability function

p (x)

is realized by

p_{θ} (x)

of the model (1). As shown in [1], the standard asymptotic theory yields that the asymptotic variance of the above Z-estimator is given as

\begin{matrix} lim_{n \to \infty} m \cdot V [\hat{θ}] = E_{θ, q} {[f_{θ} u_{θ}^{T}]}^{- 1} E_{θ, q} [f_{θ} f_{θ}^{T}] E_{θ, q} {[u_{θ} f_{θ}^{T}]}^{- 1} . \end{matrix}

(9)

The derivation of the asymptotic variance is presented in Appendix for completeness of the presentation.

We consider the asymptotic variance of the stochastic local Z-estimators. A simple expression of the asymptotic variance is obtained using the efficient score

u_{θ}^{I}

. Without loss of generality, the identification function of the stochastic local Z-estimator,

f_{θ} \in {(T_{θ, q}^{L})}^{d}

, is expressed as

\begin{matrix} f_{θ} (x, e) = u_{θ}^{I} (x, e) + a_{θ} (x, e), \end{matrix}

where

a_{θ} \in {(T_{θ, q}^{A})}^{d}

. The reason is briefly shown below. Suppose that

f_{θ} (x, e)

is decomposed into

f_{θ} (x, e) = B_{θ} u_{θ}^{I} (x, e) + a_{θ} (x, e)

, where

B_{θ}

is a d by d matrix that does not depend on x and e. The condition that the matrix

E_{θ, q} [\nabla f_{θ}]

is invertible assures that

B_{θ}

is invertible, since

\begin{matrix} E_{θ, q} [\frac{\partial}{\partial θ_{i}} f_{θ, k}] & = \sum_{j} \frac{\partial B_{θ, k j}}{\partial θ_{i}} E_{θ, q} [u_{θ, j}^{I}] + \sum_{j} B_{θ, k j} E_{θ, q} [\frac{\partial}{\partial θ_{i}} u_{θ, j}^{I}] + E_{θ, q} [\frac{\partial}{\partial θ_{i}} a_{θ, j}] \\ = - \sum_{j} B_{θ, k j} E_{θ, q} [u_{θ, i} u_{θ, j}^{I}] - E_{θ, q} [u_{θ, i} a_{θ, j}] = - \sum_{j} B_{θ, k j} E_{θ, q} [u_{θ, i}^{I} u_{θ, j}^{I}] \end{matrix}

holds. In the above equalities, we use the formula

\begin{matrix} E_{θ, q} [\frac{\partial}{\partial θ_{i}} f_{θ}] = - E_{θ, q} [u_{θ, i} f_{θ}] \end{matrix}

for

f_{θ} \in T_{θ, q}

that is obtained by differentiating the identity

E_{θ, q} [f_{θ}] = 0

. Clearly,

f_{θ} (x, e)

provides the same estimator as

B_{θ}^{- 1} f_{θ} (x, e)

. See [19] for details of the standard form of Z-estimators.

Theorem 1.

Let us define the d by d matrix

G_{θ}^{I}

by

\begin{matrix} G_{θ}^{I} = E [u_{θ}^{I} {(u_{θ}^{I})}^{T}] . \end{matrix}

Then, for a fixed conditional probability

q (e | x)

, the asymptotic variance of any stochastic local Z-estimator

\hat{θ}

satisfies the inequality

\begin{matrix} lim_{m \to \infty} m \cdot V [\hat{θ}] ⪰ {(G_{θ}^{I})}^{- 1} \end{matrix}

in the sense of the non-negative definiteness. The equality is attained by the Z-estimator using

u_{θ}^{I}

.

Proof.

Let us compute each matrix in (9). According to the above argument, without loss of generality, we assume

f_{θ} = u_{θ}^{I} + a_{θ}

for

a_{θ} \in {(T_{θ, q}^{A})}^{d}

. The matrix

E_{θ, q} [u_{θ} f_{θ}^{T}]

is then expressed as

\begin{matrix} E_{θ, q} [u_{θ} f_{θ}^{T}] = E_{θ, q} [(u_{θ}^{I} + E_{θ, q} [u_{θ} | e]) {(u_{θ}^{I} + a_{θ})}^{T}] = E_{θ, q} [u_{θ}^{I} {(u_{θ}^{I})}^{T}] = G_{θ}^{I} \end{matrix}

due to

u_{θ}^{I} \in {(T_{θ, q}^{I})}^{d}, E [u_{θ} | e] \in {(T_{θ, q}^{E})}^{d}

and

a_{θ} \in {(T_{θ, q}^{A})}^{d}

. Let us define

A = E [a_{θ} a_{θ}^{T}]

. Then, we have

\begin{matrix} E_{θ, q} [f_{θ} f_{θ}^{T}] = G_{θ}^{I} + A . \end{matrix}

As a result, we obtain

\begin{matrix} lim_{m \to \infty} m \cdot V [\hat{θ}] = {(G_{θ}^{I})}^{- 1} (G_{θ}^{I} + A) {(G_{θ}^{I})}^{- 1} = {(G_{θ}^{I})}^{- 1} + {(G_{θ}^{I})}^{- 1} A {(G_{θ}^{I})}^{- 1} ⪰ {(G_{θ}^{I})}^{- 1} . \end{matrix}

When

f_{θ} = u_{θ}^{I}

, the matrix A becomes the null matrix and the minimum asymptotic variance is attained. ☐

The minimum variance of stochastic local Z-estimators is attained by the efficient score. This conclusion agrees to the result of the asymptotically efficient estimator in semi-parametric models including nuisance parameters [19,22].

Remark 1.

Let us consider the relation between the stochastic pseudo-likelihood

\nabla log p_{θ} (x_{k} | x_{- k})

and efficient score

u_{θ}^{I} (x, e_{x, k})

. Suppose that the neighborhood system

N_{x}

and the conditional distribution

q (e | x)

on

N_{x}

are defined as shown in Example 3. Then, we have

u_{θ}^{I} (x, e_{x, k}) = \nabla log p_{θ} (x_{k} | x_{- k})

. Likewise, we find that the reduced-SCL,

\nabla log p_{θ} (x_{A_{ℓ}} | x_{B_{ℓ}})

, is equivalent with the efficient score under the setup in Example 4 when the index subset

A_{ℓ}

is defined as

B_{ℓ}^{c}

.

4.3. Monotonicity of Asymptotic Efficiency

As described in [23], for the composite likelihood estimator with the index pairs

(A_{ℓ}, B_{ℓ})

,

ℓ = 1, \dots, M

, it is widely believed that by increasing the size of

A_{ℓ}

(and correspondingly decreasing the size of

B_{ℓ} = A_{ℓ}^{c}

), one can capture more dependency relations in the model and increase the accuracy. For the stochastic local Z-estimators, we can obtain the exact relation between the neighborhood system and asymptotic efficiency.

Let us consider two stochastic local Z-estimators; one is defined by

q (e | x)

on the neighborhood system

e \in N_{x}

and the other is given by

q^{'} (e^{'} | x)

on the neighborhood system

e^{'} \in N_{x}^{'}

. The efficient score are respectively written as

u_{θ}^{I} (x, e)

for

q (e | x)

and

u_{θ}^{' I} (x, e^{'})

for

q^{'} (e | x)

. In addition, let us define

G_{θ}^{I} = E_{θ, q} [u_{θ}^{I} {(u_{θ}^{I})}^{T}]

and

G_{θ}^{' I} = E_{θ, q^{'}} [u_{θ}^{' I} {(u_{θ}^{' I})}^{T}]

.

Theorem 2.

Let

p (x, e, e^{'})

be the joint probability of

(x, e, e^{'}) \in X \times N \times N^{'}

and suppose that probability functions,

q (e | x), q^{'} (e^{'} | x)

and

p_{θ} (x)

, are obtained from

p (x, e, e^{'})

. We assume that

\begin{matrix} E [E [u_{θ} | e] | e^{'}] = E [u_{θ} | e^{'}] \end{matrix}

(10)

holds under the probability distribution

p (x, e, e^{'})

. Then, we have

\begin{matrix} {(G_{θ}^{I})}^{- 1} ⪰ {(G_{θ}^{' I})}^{- 1}, \end{matrix}

(11)

i.e., the efficiency bound of

N_{x}^{'}

and

q^{'} (x | e)

is smaller than or equal to that of

N_{x}

and

q (x | e)

.

Proof.

We use the basic formula of the conditional variance

\begin{matrix} V [X] = V [E [X | Z]] + E [V [X | Z]] ⪰ V [E [X | Z]] \end{matrix}

(12)

for random variables X and Z. The above formula is applied to the score

u_{θ} (x)

and the efficient score

u_{θ}^{I} (x, e)

. Note that

E [V [u_{θ} | e]] = E [u_{θ}^{I} {(u_{θ}^{I})}^{T}] = G_{θ}^{I}

holds. Then, we have

\begin{matrix} V [u_{θ}] & = V [E [u_{θ} | e]] + E [V [u_{θ} | e]] = V [E [u_{θ} | e]] + E [u_{θ}^{I} {(u_{θ}^{I})}^{T}] \\ = V [E [u_{θ} | e]] + G_{θ}^{I} = V [E [u_{θ} | e^{'}]] + G_{θ}^{' I} . \end{matrix}

The last equality comes from the fact that the score

u_{θ} (x)

is common in both setups. Since the equality (10) holds, again the Formula (12) with

X = E [u_{θ} | e]

and

Z = e^{'}

yields

\begin{matrix} V [E [u_{θ} | e]] = V [E [u_{θ} | e^{'}]] + E [V [E [u_{θ} | e] | e^{'}]] ⪰ V [E [u_{θ} | e^{'}]] . \end{matrix}

Thus, we obtain

\begin{matrix} G_{θ}^{I} = V [u_{θ}] - V [E [u_{θ} | e]] ⪯ V [u_{θ}] - V [E [u_{θ} | e^{'}]] = G_{θ}^{' I} . \end{matrix}

As a result, we have (11). ☐

A similar inequality is derived in [24] for the mutual Fisher information. The mutual Fisher information is rather similar to

V [E [u_{θ} | e]]

than

G_{θ}^{I}

. Theorem 13 of [24] corresponds to the one-dimensional version of the inequality

V [E [u_{θ} | e]] ⪰ V [E [u_{θ} | e^{'}]]

.

Let us show an example that agrees to (10). We define two neighborhood systems

N = {N_{x} | x \in X}

and

N^{'} = {N_{x}^{'} | x \in X}

such that, for any

e \in N_{x}

, there exists

e^{'} \in N_{x}^{'}

satisfying

e \subset e^{'}

. For the joint probability

p (x, e, e^{'})

, suppose that x and

e^{'}

are conditionally independent given e and that the conditional probability

r^{'} (e^{'} | e)

derived from

p (x, e, e^{'})

is equal to zero unless

e \subset e^{'}

. Under these conditions,

q^{'} (e^{'} | x)

derived from

p (x, e, e^{'})

takes 0 if

e^{'} \notin N_{x}^{'}

. The conditional independence assures that

p (x, e, e^{'})

is expressed as

p (x, e, e^{'}) = p_{θ} (x) q (e | x) r^{'} (e^{'} | e) = p (x | e) r (e | e^{'}) q (e^{'})

. Hence, the conditional probability

p (x | e^{'})

is expressed as

\sum_{e \in N} p (x | e) r (e | e^{'})

. Thus, we obtain

\begin{matrix} E [E [u_{θ} | e] | e^{'}] = \sum_{e \in N} \sum_{x \in X} u_{θ} (x) p (x | e) r (e | e^{'}) = \sum_{x \in X} u_{θ} (x) \sum_{e \in N} p (x | e) r (e | e^{'}) = \sum_{x \in X} u_{θ} (x) p (x | e^{'}) . \end{matrix}

As a result, the better efficiency bound is obtained by the larger neighborhood. A similar result is presented in [25] for the composite likelihood estimators. The relation of the result in [25] and ours is explained in Section 5.3 of this paper.

Example 5.

Let

N_{x}

be a neighborhood system at x endowed with the conditional distribution

q (e | x)

. Another neighborhood system is defined as

N_{x}^{'} = {X}

for all x, and

q^{'} (e^{'} | x) = 1

for

e^{'} = X

. Let us define

p (x, e, e^{'}) = p_{θ} (x) q (e | x)

for

e^{'} = X

and otherwise

p (x, e, e^{'}) = 0

. Since

e^{'}

always takes

X

, x and

e^{'}

are conditionally independent given e. Thus, we have

G_{θ}^{I} ⪯ G_{θ}^{' I}

. Indeed,

G_{θ}^{' I}

is the Fisher information matrix of the model

p_{θ} (x)

.

We compare the stochastic pseudo-likelihood and reduced-SCL. Let

N_{x} = {e_{x, k} | k = 1, \dots, n}

be the neighborhood system defined in Example 3, and N be

\cup_{x \in X} N_{x}

. The conditional distribution on

N_{x}

is given by

q (e_{x, k} | x) = q_{k}, k = 1, \dots, n

. As shown in Remark 1, the corresponding efficient score is nothing but the stochastic pseudo-likelihood, i.e.,

u_{θ}^{I} (x, e_{x, k}) = \nabla log p_{θ} (x_{k} | x_{- k})

. Let us define another neighborhood system

N_{x}^{'}

in the same way as Example 4. For the subsets

B_{ℓ} \subset X

and

A_{ℓ} = B_{ℓ}^{c},

ℓ = 1, \dots, M

, we define

e_{x, ℓ}^{'}

as

{y \in X | y_{B_{ℓ}} = x_{B_{ℓ}}}

and

N_{x}^{'} = {e_{x, ℓ}^{'} | ℓ = 1, \dots, M}

. Let

N^{'}

be

\cup_{x \in X} N_{x}^{'}

. The conditional distribution on

N_{x}^{'}

is given as

q^{'} (e_{x, ℓ}^{'} | x) = q_{ℓ}^{'}

for

ℓ = 1, \dots, M

. Then, the efficient score associated with

N^{'}

and

q^{'}

is equal to the reduced-SCL, i.e.,

u_{θ}^{' I} (x, e_{x, ℓ}^{'}) = \nabla log p_{θ} (x_{A_{ℓ}} | x_{B_{ℓ}})

. As the direct conclusion of Theorem 2 and the above argument about the property of the conditional independence between x and

e^{'} \in N^{'}

given

e \in N

, we obtain the following corollary.

Corollary 1.

We define

N_{e}^{'}

for

e \in N

by

N_{e}^{'} = {e^{'} \in N^{'} | e \subset e^{'}}

. Let

r^{'} (e^{'} | e)

be a conditional probability on

N_{e}^{'}

given

e \in N

, where

r^{'} (e^{'} | e) = 0

is assumed for

e^{'} \notin N_{e}^{'}

. If the equality

q_{ℓ}^{'} = \sum_{k = 1}^{n} q_{k} r^{'} (e_{x, ℓ}^{'} | e_{x, k})

holds, the reduced-SCL with

N^{'}

and

q^{'}

is more efficient than stochastic pseudo-likelihood with N and q.

Example 6.

Suppose that the size of

N_{e}^{'}

is the same for all

e \in N

and that the size of the set

{e \in N | x \in e \subset e^{'}}

is the same for any

x \in X

and

e^{'} \in N^{'}

such that

x \in e^{'}

. Let

q (e | x)

(resp.

q^{'} (e^{'} | x)

) be the uniform distribution on

N_{x}

(resp.

N_{x}^{'}

). Then, the reduced-SCL is more efficient than stochastic pseudo-likelihood. Indeed, the assumption ensures that the sum

\sum_{e \in N} q (e | x) r^{'} (e^{'} | e)

does not depend on x and

e^{'}

. Thus, the uniform distribution

q^{'} (e^{'} | x)

meets the condition of the above corollary. For example, let

B_{1}, \dots, B_{M}

be all subsets of size

n - 2

in

{1, \dots, n}

. Then, we have

M = n (n - 1) / 2

. The size of

N_{e}^{'}

is

n - 1

, and the size of

{e \in N | x \in e \subset e^{'}}

is equal to 2.

5. Local Z-Estimators and Efficiency Bounds

In this section, we define the local Z-estimator as the expectation of a stochastic local Z-estimator, and derive its efficiency bound.

5.1. Local Z-Estimators

Computationally tractable estimators such as pseudo-likelihood and composite likelihood are obtained by the expectation of an identification function in

T_{θ, q}^{L}

. Let us define the local Z-estimator as the Z-estimator using

\begin{matrix} {\bar{f}}_{θ} (x) = E_{θ, q} [f_{θ} | x], \end{matrix}

where

f_{θ} \in {(T_{θ, q}^{L})}^{d}

. The conditional expectation given x is regarded as the projection onto the subspace

T_{θ, q}^{X}

which is defined as

\begin{matrix} T_{θ, q}^{X} = {f \in T_{θ, q} | f (x, e) does not depend on e} . \end{matrix}

Let

Π_{X}

be the projection operator onto

T_{θ, q}^{X}

and

Π_{X}^{⊥}

be the one onto the orthogonal complement of

T_{θ, q}^{X}

. Then, one can prove

Π_{X} [f] = E [f | x]

and

Π_{X}^{⊥} [f] = f - E [f | x]

for

f \in T_{θ, q}

. When the number of elements in the neighborhood

N_{x}

is reasonable, the computation of the local Z-estimator is tractable.

Below, we show that some estimators are expressed as the local Z-estimator.

Example 7 (Pseudo-likelihood and composite likelihood).

In the setup of Example 3, the conditional expectation of the efficient score,

E_{θ, q} [u_{θ}^{I} | x]

, yields the pseudo-likelihood when

q (e | x)

is the uniform distribution on

N_{x}

. In the setup of Example 4, let us assume

A_{ℓ} = B_{ℓ}^{c}

and

q^{'} (e_{x, ℓ}^{'} | x) = q_{ℓ}^{'}

. Then, the conditional expectation of the efficient score

u_{θ}^{I} (x, e_{x, ℓ}^{'}) = \nabla log p_{θ} (x_{A_{ℓ}} | x_{B_{ℓ}})

yields

\begin{matrix} E_{θ, q^{'}} [u_{θ}^{I} | x] = \sum_{ℓ = 1}^{M} q_{ℓ}^{'} \nabla log p_{θ} (x_{A_{ℓ}} | x_{B_{ℓ}}), \end{matrix}

which is the general form of the composite likelihood in Example 2 with

γ_{ℓ} = q_{ℓ}^{'}

.

5.2. Efficiency Bounds

We derive the efficiency bound of the local Z-estimator. Without loss of generality, the local Z-estimator

{\bar{f}}_{θ} (x) \in {(T_{θ, q}^{X})}^{d}

is represented as

\begin{matrix} {\bar{f}}_{θ} (x) = E [f_{θ} | x], f_{θ} = u_{θ}^{I} + a_{θ} \in {(T_{θ, q}^{L})}^{d}, a_{θ} \in {(T_{θ, q}^{A})}^{d} . \end{matrix}

Under the model

p_{θ} (x)

, we calculate the asymptotic variance (9) of the local Z-estimator

\hat{θ}

using

{\bar{f}}_{θ} (x)

. The matrix

E_{θ, q} [u_{θ} {\bar{f}}_{θ}^{T}]

in (9) is given as

\begin{matrix} E_{θ, q} [u_{θ} {\bar{f}}_{θ}^{T}] = E_{θ, q} [u_{θ} {(u_{θ}^{I} + a_{θ})}^{T}] = E [u_{θ}^{I} {(u_{θ}^{I})}^{T}] = G_{θ}^{I} . \end{matrix}

Hence, we have

\begin{matrix} lim_{m \to \infty} m \cdot V [\hat{θ}] & = {(G_{θ}^{I})}^{- 1} E_{θ, q} [{\bar{f}}_{θ} {\bar{f}}_{θ}^{T}] {(G_{θ}^{I})}^{- 1} . \end{matrix}

Here, the expectation

E_{θ, q} [{\bar{f}}_{θ} {\bar{f}}_{θ}^{T}]

can be written as the expectation under

p_{θ} (x)

, i.e.,

E_{θ} [{\bar{f}}_{θ} {\bar{f}}_{θ}^{T}]

, since

u_{θ}

and

{\bar{f}}_{θ}

depend only on x. The orthogonal decomposition

f_{θ} = {\bar{f}}_{θ} + Π_{X}^{⊥} [f_{θ}]

leads to

\begin{matrix} {(G_{θ}^{I})}^{- 1} E_{θ, q} [f_{θ} f_{θ}^{T}] {(G_{θ}^{I})}^{- 1} = {(G_{θ}^{I})}^{- 1} E_{θ, q} [{\bar{f}}_{θ} {\bar{f}}_{θ}^{T}] {(G_{θ}^{I})}^{- 1} + {(G_{θ}^{I})}^{- 1} E_{θ, q} [Π_{X}^{⊥} [f_{θ}] Π_{X}^{⊥} {[f]}_{θ}^{T}] {(G_{θ}^{I})}^{- 1} \\ ⪰ {(G_{θ}^{I})}^{- 1} E_{θ, q} [{\bar{f}}_{θ} {\bar{f}}_{θ}^{T}] {(G_{θ}^{I})}^{- 1}, \end{matrix}

(13)

meaning that the asymptotic variance of the stochastic local Z-estimator using

f_{θ} (x, e)

is larger then or equal to that of the local Z-estimator using

{\bar{f}}_{θ} (x)

.

We consider the optimal choice of

a_{θ} \in {(T_{θ, q}^{A})}^{d}

in

{\bar{f}}_{θ} (x) = E_{θ, q} [u_{θ}^{I} + a_{θ} | x]

. Let us define the subspace

T_{θ, q}^{X A}

as

Π_{X} T_{θ, q}^{A} = {Π_{X} [a] | a \in T_{θ, q}^{A}}

, and

Π_{X A}

be the projection operator onto

T_{θ, q}^{X A}

. Then, we define

v_{θ, j}^{I} (x) \in T_{θ, q}^{X}

as the projection of

u_{θ, j}^{I} (x, e) \in T_{θ, q}^{I}

onto the orthogonal complement of

T_{θ, q}^{X A}

in

T_{θ, q}^{X}

, i.e.,

\begin{matrix} v_{θ, j}^{I} = (Π_{X} - Π_{X A}) [u_{θ, j}^{I}] \end{matrix}

for

j = 1, \dots, d

. In this paper, we call

v_{θ}^{I} = {(v_{θ, 1}^{I}, \dots, v_{θ, d}^{I})}^{T}

the local efficient score.

Theorem 3.

Let us define d by d matrix

H_{θ}^{I}

as

E_{θ, q} [v_{θ}^{I} {(v_{θ}^{I})}^{T}]

. Then, the efficiency bound of the local Z-estimator

\hat{θ}

is given as

\begin{matrix} lim_{m \to \infty} m \cdot V [\hat{θ}] ⪰ {(G_{θ}^{I})}^{- 1} H_{θ}^{I} {(G_{θ}^{I})}^{- 1} . \end{matrix}

The equality is attained by the local Z-estimator using the local efficient score

v_{θ}^{I} = (Π_{X} - Π_{X A}) [u_{θ}^{I}]

.

Proof.

{\bar{f}}_{θ} (x) = E_{θ, q} [u_{θ}^{I} + a_{θ} | x]

has the orthogonal decomposition

v_{θ}^{I} + b_{θ}

, where

b_{θ} \in {(T_{θ, q}^{X A})}^{d}

. Hence, we obtain

E_{θ, q} [{\bar{f}}_{θ} {\bar{f}}_{θ}^{T}] ⪰ E_{θ} [v_{θ}^{I} {(v_{θ}^{I})}^{T}] = H_{θ}^{I}

and

\begin{matrix} {(G_{θ}^{I})}^{- 1} E_{θ, q} [{\bar{f}}_{θ} {\bar{f}}_{θ}^{T}] {(G_{θ}^{I})}^{- 1} ⪰ {(G_{θ}^{I})}^{- 1} H_{θ}^{I} {(G_{θ}^{I})}^{- 1} . \end{matrix}

The left-hand side of the above inequality is the asymptotic variance of the local Z-estimator. The equality is attained by the local Z-estimator using

v_{θ}^{I}

. ☐

We consider the relation between the local efficient score

v_{θ}^{I} (x)

and the score

u_{θ} (x)

. We define

T_{θ, q}^{M L}

as the subspace spanned by the score

u_{θ, j} (x), j = 1, \dots, d

. For any

a \in T_{θ, q}^{A}

, we have

\begin{matrix} E_{θ, q} [u_{θ, j} E_{θ, q} [a | x]] = E_{θ, q} [u_{θ, j} a] = 0, j = 1, \dots, d, \end{matrix}

meaning that

T_{θ, q}^{M L}

and

T_{θ, q}^{X A}

are orthogonal to each other. Hence,

T_{θ, q}^{X}

is decomposed into

\begin{matrix} T_{θ, q}^{X} = T_{θ, q}^{X A} \oplus T_{θ, q}^{M L} \oplus T_{θ, q}^{X C}, \end{matrix}

where

T_{θ, q}^{X C}

is the orthogonal complement of

T_{θ, q}^{X A} \oplus T_{θ, q}^{M L}

in

T_{θ, q}^{X}

. Eventually, subspaces in

T_{θ, q}

satisfy the following relations,

\begin{matrix} T_{θ, q} = T_{θ, q}^{E} \oplus T_{θ, q}^{I} \oplus T_{θ, q}^{A}, T_{θ, q}^{X} = (Π_{X} T_{θ, q}^{A}) \oplus T_{θ, q}^{M L} \oplus T_{θ, q}^{X C} . \end{matrix}

Let us define

T_{θ, q}^{X I}

as the subspace spanned by the local efficient score

v_{θ, j}^{I} (x), j = 1, \dots, d

. Under a mild assumption,

T_{θ, q}^{X I}

and

T_{θ, q}^{M L}

has the same dimension. Since

v_{θ}^{I} (x)

is orthogonal to

Π_{X} T_{θ, q}^{A}

,

T_{θ, q}^{X I}

is included in

T_{θ, q}^{M L} \oplus T_{θ, q}^{X C}

. Hence,

T_{θ, q}^{X C}

is interpreted as the subspace expressing the information loss caused by the localization of the score

u_{θ}

.

5.3. Relation to Existing Works

5.3.1. Comparison of Local Z-Estimators

We compare the asymptotic variances of two local Z-estimators that are connected to composite likelihoods.

One estimator is defined from the neighborhood system N which consists of the singleton

N_{x} = {e_{x}}, x \in X

. Here, we assume that

e_{x} = e_{x^{'}}

holds for

x^{'} \in e_{x}

and

\cup_{x \in X} e_{x} = X

. Such a neighborhood system N is called the equivalence class [25]. An equivalence class corresponds to a partition of the sample space. The conditional probability

q (e | x)

takes 1 for

e = e_{x}

and 0 otherwise. Let

u_{θ}^{I} (x, e)

be the efficient score defined from N and

q (e | x)

, and

{\bar{u}}_{θ}^{I} (x)

be the local Z-estimator

{\bar{u}}_{θ}^{I} (x) = E_{θ, q} [u_{θ}^{I} | x]

.

Another localized estimator is defined from the neighborhood system

N^{'}

which consists of

N_{x}^{'}, x \in X

, where

N_{x}^{'}

is not necessarily a singleton. Suppose that

e_{x} \subset e^{'}

holds for any

e^{'} \in N_{x}^{'}

. The conditional probability

q^{'} (e^{'} | x)

is defined as

q^{'} (e^{'} | x) = r^{'} (e^{'} | e_{x})

, where

r^{'} (e^{'} | e_{x})

is a conditional probability of

e^{'} \in N_{x}^{'}

given

e_{x}

. The corresponding efficient score is denoted as

u_{θ}^{' I} (x, e)

and let us define

{\bar{u}}_{θ}^{' I} (x) = E_{θ, q^{'}} [u_{θ}^{' I} | x]

as the local Z-estimator associated with

N^{'}

and

q^{'} (e^{'} | x)

.

From the definition, the joint probability

p_{θ} (x) q (e | x) r^{'} (e^{'} | e)

agrees to

q (e | x)

an

q^{'} (e^{'} | x)

. Hence we see that x and

e^{'}

are conditionally independent given e. Hence, Theorem 2 guarantees the inequality

{(G_{θ}^{I})}^{- 1} ⪰ {(G_{θ}^{' I})}^{- 1}

.

The efficient score

u_{θ}^{I} (x, e)

can take a non-zero value only when

e = e_{x}

. Hence,

u_{θ}^{I} (x, e)

is regarded as the function of x, i.e,

u_{θ}^{I} (x, e) \in {(T_{θ, q}^{X})}^{d}

, and the asymptotic variance of the local Z-estimator obtained by

{\bar{u}}_{θ}^{I} (x) = u_{θ}^{I} (x, e_{x})

is

{(G_{θ}^{I})}^{- 1}

. On the other hand, the asymptotic variance of the local Z-estimator derived from

{\bar{u}}_{θ}^{' I} (x)

is less than or equal to

{(G_{θ}^{' I})}^{- 1}

due to (13). Therefore,

{\bar{u}}_{θ}^{' I}

with

N^{'}

and

q^{'}

provides more efficient estimators than

{\bar{u}}_{θ}^{I}

with N and q.

Liang and Jordan presented a similar result in [25]. In their setup, the larger neighborhood

N_{x}^{'}

is a singleton

{e_{x}^{'}}

and the smaller one,

N_{x}

, can have multiple neighborhoods at each x. In such a case, the similar relation holds, i.e., the estimator with

N^{'}

is more efficient. However, their approach is different from ours. In [25], the randomness is introduced over the patterns of the partition of

X

. Moreover, their identification function corresponding to our

{\bar{u}}_{θ}^{I} (x)

is decomposed into two terms; one is the term conditioned on the partition and the other is its orthogonal complement. On the other hand, our approach uses the decomposition of

u_{θ}^{I} (x, e)

into

{\bar{u_{θ}}}^{I} (x)

and its orthogonal complement. In their analysis, the simplified expression of the asymptotic variance shown in (9) and the standard expression of the identification function,

f (x, e) = u_{θ}^{I} (x, e) + a (x, e)

, are not used. Hence, the evaluation of the asymptotic variance yields rather a complex dependency on the estimator. As a result, their approach does not show the efficiency bound, though the asymptotic variance of the composite likelihood for exponential families is presented under the misspecified setup.

5.3.2. Closed Exponential Families

The so-called closed exponential family has an interesting property from the viewpoint of localized estimators, as presented in [26]. Let

p_{θ} (x) = exp {θ^{T} t (x) - c (θ)}

be the exponential family defined for

x = (x_{1}, \dots, x_{n}) \in X = X_{1} \times \dots \times X_{n}

with the parameter

θ \in Θ \subset R^{d}

. The function

t (x) \in R^{d}

is referred to as the sufficient statistic. Given disjoint index subsets

A, B \subset {1, \dots, n}

, let

t_{B} (x)

be all elements of

t (x)

that depend just on

x_{B}

, and

t_{A, B} (x)

be the other elements. Hence,

t_{B} (x)

is expressed as

t_{B} (x_{B})

. The parameter θ is correspondingly decomposed into

θ = (θ_{A, B}, θ_{B})

. Thus, we have

θ^{T} t (x) = θ_{A, B}^{T} t_{A, B} (x) + θ_{B}^{T} t_{B} (x_{B})

. The exponential family

p_{θ} (x)

is called the closed exponential family, when the marginal distribution of

x_{B}

is expressed as the exponential family with the sufficient statistic

t_{B} (x_{B})

.

We consider the composite likelihood of the closed exponential family. For the pairs of two disjoint index subsets,

{A_{ℓ}, B_{ℓ}}, ℓ = 1, \dots, M

, suppose that any element of

t (x)

is included in

t_{A_{ℓ}, B_{ℓ}} (x)

at least one ℓ. Then, the local Z-estimator using the composite likelihood

\sum_{ℓ = 1}^{M} log p_{θ} (x_{A_{ℓ}} | x_{B_{ℓ}})

is identical to the MLE [26]. Hence, the composite likelihood of the closed exponential family attains the efficiency bound of the MLE.

For the general statistical model

p_{θ} (x)

, let us restate the above result in terms of the tangent spaces in

T_{θ, q}

. Let us decompose

p_{θ} (x)

into

\begin{matrix} p_{θ} (x) = p (x_{A} | x_{B}; θ) p (x_{B}; θ) . \end{matrix}

We assume that for any index subset B, all elements of

\nabla log p (x_{B}; θ)

are included in

T_{θ, q}^{M L}

that is spanned by the elements of

u_{θ} (x) = \nabla log p_{θ} (x)

. Then,

\nabla log p (x_{A} | x_{B}; θ)

also lies in

{(T_{θ, q}^{M L})}^{d}

. Thus,

\nabla log p (x_{A_{ℓ}} | x_{B_{ℓ}}; θ)

is expressed as

C_{ℓ} \nabla log p_{θ} (x)

using a d by d matrix

C_{ℓ}

. If

\sum_{ℓ = 1}^{M} C_{ℓ}

is invertible, the local Z-estimator obtained by

\sum_{ℓ = 1}^{M} \nabla log p_{θ} (x_{A_{ℓ}} | x_{B_{ℓ}})

is identical to the MLE. In this case,

Π_{X} T_{θ, q}^{I} = T_{θ, q}^{M L}

, i.e.,

T_{θ, q}^{X I} = T_{θ, q}^{M L}

holds. Therefore, there is no information loss caused by the localization. The matrix

C_{ℓ}

for the closed exponential family is given as the projection matrix onto the subspace spanned by

t_{A_{ℓ}, B_{ℓ}} (x) - E_{θ} [t_{A_{ℓ}, B_{ℓ}} | x_{B}]

that is included in

T_{θ, q}^{M L}

. The above result implies that the tangent space

T_{θ, q}^{X C}

expressing the information loss will be related to the score of the marginal distribution,

\nabla log p_{θ} (x_{B})

.

6. Conclusions

In this paper, some statistical properties of stochastic local Z-estimators and local Z-estimators are investigated. The class of local Z-estimators includes pseudo-likelihood and composite likelihood. For stochastic local Z-estimators, we established the exact relation between neighborhood systems and the efficiency bound under a simple and general condition. In addition, the efficiency bound of the local Z-estimators was presented.

Future works include the study of more general class of localized estimators. Indeed, local Z-estimators do not include the class of proper local scoring rules [7]. It is worthwhile to derive the efficiency bound for more general localized estimators. Exploring nice applications of the efficiency bound will be another interesting direction of our study. In our setup, the local efficient score expressed by the projection of the score attains the efficiency bound among local Z-estimators. An important problem is to develop a computationally tractable method to obtain the projection onto tangent subspaces.

one

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Asymptotic Variance of Stochastic Local Z-Estimators

Given i.i.d. samples

(x_{i}, e_{i}), i = 1, \dots, m

from

p_{θ} (x) q (e | x)

, we estimate the parameter θ using the stochastic local Z-estimator obtained by

\begin{matrix} \frac{1}{m} \sum_{i = 1}^{m} f (x_{i}, e_{i}; \hat{θ}) = 0, \end{matrix}

where the identification function satisfies

f_{θ} \in {(T_{θ, q})}^{d}

for any

θ \in Θ \subset R^{d}

. The Taylor expansion around the true parameter θ yields

\begin{matrix} \frac{1}{m} \sum_{i = 1}^{m} f (x_{i}, e_{i}; θ) + \frac{1}{m} \sum_{i = 1}^{m} \nabla f (x_{i}, e_{i}; θ) (\hat{θ} - θ) + O (∥ \hat{θ} - θ ∥^{2}) = 0, \end{matrix}

where the element

{(\nabla f)}_{i j}

is given as

\frac{\partial f_{i}}{\partial θ_{j}}

. As m tends to infinity, the asymptotic distribution of

\hat{θ}

is given as the multivariate normal distribution,

\begin{matrix} E_{θ, q} [\nabla f_{θ}] \sqrt{m} (\hat{θ} - θ) \sim N_{d} (0, E_{θ, q} [f_{θ} f_{θ}^{T}]) . \end{matrix}

Since

E_{θ, q} [f_{θ}] = 0

holds for any θ, the derivative

\nabla E_{θ, q} [f_{θ}]

is the null matrix. This fact yields

\begin{matrix} E_{θ, q} [\nabla f_{θ}] = - E_{θ, q} [f_{θ} \nabla log {(p_{θ} q)}^{T}] = - E_{θ, q} [f_{θ} u_{θ}^{T}] . \end{matrix}

Hence, the asymptotic distribution of

\sqrt{m} (\hat{θ} - θ)

is the d-dimensional normal distribution with mean

0

and variance

E_{θ, q} {[f_{θ} u_{θ}^{T}]}^{- 1} E_{θ, q} [f_{θ} f_{θ}^{T}] E_{θ, q} {[u_{θ} f_{θ}^{T}]}^{- 1}

.

References

Van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
Besag, J. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B 1974, 36, 192–236. [Google Scholar]
Lindsay, B.G. Composite likelihood methods. Contemp. Math. 1988, 80, 220–239. [Google Scholar]
Varin, C.; Reid, N.; Firth, D. An overview of composite likelihood methods. Stat. Sin. 2011, 21, 5–42. [Google Scholar]
Hyväinen, A. Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. IEEE Trans. Neural Netw. 2007, 18, 1529–1531. [Google Scholar] [CrossRef]
Hyvärinen, A. Some extensions of score matching. Comput. Stat. Data Anal. 2007, 51, 2499–2512. [Google Scholar] [CrossRef]
Dawid, A.P.; Lauritzen, S.; Parry, M. Proper local scoring rules on discrete sample spaces. Ann. Stat. 2012, 40, 593–608. [Google Scholar] [CrossRef]
Kanamori, T.; Takenouchi, T. Graph-Based Composite Local Bregman Divergences on Discrete Sample Spaces. 2016; arXiv:1604.06568. [Google Scholar]
Lafferty, J.D.; McCallum, A.; Pereira, F.C.N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, San Francisco, CA, USA, 28 June–1 July 2001; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2001; pp. 282–289. [Google Scholar]
Ackley, H.; Hinton, E.; Sejnowski, J. A learning algorithm for boltzmann machines. Cognit. Sci. 1985, 9, 147–169. [Google Scholar] [CrossRef]
Smolensky, P. Information Processing in Dynamical Systems: Foundations of Harmony Theory. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1; MIT Press: Cambridge, MA, USA, 1986; pp. 194–281. [Google Scholar]
Welling, M.; Rosen-Zvi, M.; Hinton, G.E. Exponential family harmoniums with an application to information retrieval. In Advances in Neural Information Processing Systems 17; Saul, L.K., Weiss, Y., Bottou, L., Eds.; MIT Press: Cambridge, MA, USA, 2005; pp. 1481–1488. [Google Scholar]
Ising, E. Beitrag zur Theorie des Ferromagnetismus. Zeitschrift für Physik 1925, 31, 253–258. (In German) [Google Scholar] [CrossRef]
Marlin, B.; de Freitas, N. Asymptotic efficiency of deterministic estimators for discrete energy-based models: Ratio matching and pseudolikelihood. In Uncertainty in Artificial Intelligence (UAI); AUAI Press: Corvallis, OR, USA, 2011. [Google Scholar]
Gneiting, T. Making and evaluating point forecasts. J. Am. Stat. Assoc. 2011, 106, 746–762. [Google Scholar] [CrossRef]
Steinwart, I.; Pasin, C.; Williamson, R.C.; Zhang, S. Elicitation and identification of properties. In Proceedings of the 27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, 13–15 June 2014; pp. 482–526.
Hyväinen, A. Consistency of pseudolikelihood estimation of fully visible boltzmann machines. Neural Comput. 2006, 18, 2283–2292. [Google Scholar] [CrossRef] [PubMed]
Dawid, A.; Musio, M. Estimation of spatial processes using local scoring rules. AStA Adv. Stat. Anal. 2013, 97, 173–179. [Google Scholar] [CrossRef]
Amari, S.; Kawanabe, M. Information geometry of estimating functions in semi-parametric statistical models. Bernoulli 1997, 3, 29–54. [Google Scholar] [CrossRef]
Dillon, J.V.; Lebanon, G. Stochastic composite likelihood. J. Mach. Learn. Res. 2010, 11, 2597–2633. [Google Scholar]
Cichocki, A.; Amari, S. Families of alpha- beta- and gamma-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
Bickel, P.J.; Klaassen, C.A.J.; Ritov, Y.; Wellner, J.A. Efficient and Adaptive Estimation for Semiparametric Models; Springer-Verlag: New York, NY, USA, 1998. [Google Scholar]
Asuncion, A.U.; Liu, Q.; Ihler, A.T.; Smyth, P. Learning with blocks: Composite likelihood and contrastive divergence. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; Volume 9, pp. 33–40.
Zegers, P. Fisher information properties. Entropy 2015, 17, 4918–4939. [Google Scholar] [CrossRef]
Liang, P.; Jordan, M.I. An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, Helsinki, Finland, 5–9 July 2008; ACM: New York, NY, USA, 2008; pp. 584–591. [Google Scholar]
Mardia, K.V.; Kent, J.T.; Hughes, G.; Taylor, C.C. Maximum likelihood estimation using composite likelihoods for closed exponential families. Biometrika 2009, 96, 975–982. [Google Scholar] [CrossRef]

© 2016 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kanamori, T. Efficiency Bound of Local Z-Estimators on Discrete Sample Spaces. Entropy 2016, 18, 273. https://doi.org/10.3390/e18070273

AMA Style

Kanamori T. Efficiency Bound of Local Z-Estimators on Discrete Sample Spaces. Entropy. 2016; 18(7):273. https://doi.org/10.3390/e18070273

Chicago/Turabian Style

Kanamori, Takafumi. 2016. "Efficiency Bound of Local Z-Estimators on Discrete Sample Spaces" Entropy 18, no. 7: 273. https://doi.org/10.3390/e18070273

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficiency Bound of Local Z-Estimators on Discrete Sample Spaces

Abstract

1. Introduction

2. Preliminaries

2.1. M- and Z-Estimators

2.2. Localized Estimators

3. A Stochastic Variant of Z-Estimators

4. Neighborhood Systems and Asymptotic Variances

4.1. Tangent Spaces of Statistical Models

4.2. Asymptotic Variance of Stochastic Local Z-Estimators

4.3. Monotonicity of Asymptotic Efficiency

5. Local Z-Estimators and Efficiency Bounds

5.1. Local Z-Estimators

5.2. Efficiency Bounds

5.3. Relation to Existing Works

5.3.1. Comparison of Local Z-Estimators

5.3.2. Closed Exponential Families

6. Conclusions

Conflicts of Interest

Appendix A. Asymptotic Variance of Stochastic Local Z-Estimators

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI