Nearest-neighbor Entropy Estimators with Weak Metrics

A problem of improving the accuracy of nonparametric entropy estimation for a stationary ergodic process is considered. New weak metrics are introduced and relations between metrics, measures, and entropy are discussed. Based on weak metrics, a new nearest-neighbor entropy estimator is constructed and has a parameter with which the estimator is optimized to reduce its bias. It is shown that estimator's variance is upper-bounded by a nearly optimal Cramer-Rao lower bound.


1.
Introduction. This study is concerned with improving the accuracy of estimation of the entropy (entropy rate) of information sources with a finite state space, whose statistical characterization is unknown. Sequences of symbols or strings drawn from some finite alphabet appear in many applications where objects can be encoded into strings in natural ways. Such sequences are often viewed as realizations of stochastic processes also known as "information sources". An important quantity characterizing an information source is its entropy (entropy rate). For a comprehensive review of previous work on entropy estimation, see, for example, [6]. Most widely used are so-called "nonparametric" entropy estimators. However, an analytical evaluation of the accuracy of those estimators is very difficult and few results are known. So, in most published work on nonparametric entropy estimation, only an asymptotical convergence to entropy is proved and tested by computer simulation.
For a given data sample of the size n, the most important characterization of an estimator h n is its efficiency (accuracy), or L 2 -error E(h n − h) 2 , where h is the entropy of the source and n is the number of observations. We recall the relation where the quantity E(h n − h) is called bias.
Most known nonparametric entropy estimators are based either on Lempel-Ziv compression or the nearest-neighbor method, see, for instance, their review in [6].

EVGENIY TIMOFEEV AND ALEXEI KALTCHENKO
Those estimators are shown to converge almost everywhere(for most general results, see [6], [8]). Unfortunately, due to a slow convergence (less then O(1/ log n)), their accuracy is not good for many practical applications, with a relatively small sample size of log 2 n (log 2 n ≤ 30 − 40). This motivates a search for estimators with a more rapid convergence.
Lempel-Ziv estimators are very hard to analyze and evaluate their accuracy analitically. For example, the initial motivation for paper [1] was the desire to obtain asymptotic properties for an entropy estimation algorithm due to Ziv [13], but they show that calculations of the bias and the variance is very difficult. Up to date, there are no published work on Lempel-Ziv estimator's bias or variance.
From now on, we will focus on nearest-neighbor estimators and will now briefly state most important published results. We point out that it is often more convenient to estimate, instead of the entropy, its inverse quantity 1/h. Two modifications of a Grassberger's estimator [5] are proposed in [11]. In this paper notation, they are written as r (k,m) n (ρ)/ log n (see 7) and η (k,m) n (ρ) (see 16), where ρ is a metric.
For estimator r (k,m) n (ρ)/ log n, L 1 -convergence and variance bound O(n −c ) are shown [11] under certain restriction on source measures. For metric 3 specifically, this measure restriction is relaxed (see 8) and convergence almost everywhere is established in [6]. It is also shown [6] that variance bound O(n −c ) holds for any c < 1.
For estimator η (k,m) n (ρ), L 1 -convergence is established in [11] under certain restrictions on metrics and source measures.
For metric 3 computer simulation [6] showed that the estimator 16 with metric 3 is more efficient than the estimator r (k,m) n (ρ)/ log n. But in a subsequent work [7] for symmetric Bernoulli measures, it was established that the estimator's bias is a periodic function, with a period proportional to log n. In a computer simulation, such a bias was difficult to catch because its amplitude was less than 10 −6 for sources with a small entropy (h < 3).
In [12], the bias was also explicitly calculated for Markov measures and the metric 3. This bias was equal to zero if the logarithms of the transitional probabilities were rationally incommensurable. Otherwise, the bias was a periodic function with a period proportional to log n. This result demonstrates a new obstacle in an estimator's analytical evaluation, namely, an estimator' bias can be a discontinues function of measure parameters.
The objective of this research is to construct a new estimator based on an existing nearest-neighbor estimator and its modifications to achieve efficiency O(n −c ) for some measures, where c > 0 is a constant. The main idea of this construction is as follows. A nearest-neighbor estimator is based on some metric. We introduce a wider class of so-called "weak " [2] metrics, for which the triangle inequality holds with some constant C > 1. The new estimator now has a parameter which is a non-decreasing function. We expect that the function can be selected so that to reduce the bias. Specifically, we introduce a class of functions with one parameter which we optimize to reduce the bias. It is shown that for symmetric Bernoulli measures there exists such a parameter value for which the bias is asymptotically zero.
Our paper is organized as follows: • In Section 3, we introduce new weak metrics and discuss a connection between metrics, measures, and entropy.
• In Section 4, we discuss a nearest-neighbor statistic and show that the statistic's variance is upper-bounded by O(n −1 ) for a large class of measures and weak metrics, • In Section 5, we introduce our new nearest-neighbor estimator (based on the statistic of Section 4) and its modifications and prove that this estimator is unbiased for symmetric Bernoulli measures.
2. Notation and Definitions. For our purposes, an information source, or stationary process, is a shift-invariant ergodic measure µ on the space Ω = A N of right-sided infinite sequences drawn from a finite alphabet Thus, an infinite random sequence generated by µ is viewed as a point in Ω chosen randomly with respect to µ and is denoted by ξ = (ξ 1 , ξ 2 , . . .). For a cylinder centered at x ∈ Ω, s = 1, 2, . . . , we use the following notation Let ρ be a metric on Ω.
We denote an open ball of radius r centered at x by B(x, r, ρ) = {y ∈ Ω : ρ(x, y) < r}. In order to simplify the notation, it is convenient to write B(x, r) for B(x, r, ρ).
Let ξ = (ξ 1 , ξ 2 , . . .) be a point in Ω chosen randomly with respect to µ. Recall that the entropy h (entropy rate) of a measure µ is defined as follows here and throughout the paper, all logarithms are to base e, i.e., natural.
Problem Statement: Let µ be a shift-invariant ergodic probability measure on Ω = A N . Let ξ 0 , ξ 1 , . . . , ξ n be independent random variables taking values in Ω and identically distributed with a common law µ. We want to evaluate the entropy of the measure µ.
In particular, for λ(t) = 0, 0 ≤ t < ∞, we obtain the following well-known metric: We stress that metric 2 is bi-Lipschitz equivalent to metric 3, i.e. we have Therefore, according to [2], ρ is a weak metric (or near-metric), i.e. the triangle inequality holds with some constant C > 1.
While each point x has infinitely many coordinates, for any practical estimate calculations, we need to limit the number of coordinates which are used for calculation. We make it by introducing a truncation of a metric that uses only the first m coordinates of the points.

4.
Nearest-neighbor statistics. In this section, we consider a nonparametric statistic r (k,m) n (ρ). This statistic is based on a sample of n + 1 independent points ξ 0 , . . . , ξ n in the space Ω chosen randomly with respect to µ and the metric ρ on Ω and is defined as follows: NEAREST-NEIGHBOR ENTROPY ESTIMATORS WITH WEAK METRICS 5 where ρ is a metric 2 and min (k) {X 1 , . . . , X N } is defined by min (k) {X 1 , . . . , We stress that this statistic uses only first m coordinates of points ξ 0 , . . . , ξ n . Theorem 1 and Proposition 8 of [11] imply the following statement: Proposition 2. Let ξ 0 , . . . , ξ n be n + 1 independent points in the space Ω chosen randomly with respect to µ and k = O(log n), then the following limit holds: Lemma 4.1. Let a measure µ satisfy the following condition then, there exist constants c 1 , c 2 such that the following inequality holds: Proof. Arguing as in proof of Theorem 1 [11], we obtain From an identity we get log r µ(B(x, r, ρ)) k−1 (1 − µ(B(x, r, ρ))) n−k d r µ(B(x, r, ρ))dµ(x).
Therefore, we have Calculating the above integral, we obtain If we set c 2 > 1/a, then the inequality 9 follows from the above equality.
We introduce a function f : Ω n+1 → R defined as In order to apply McDiarmid's method, we need to show that f satisfies the inequality sup x0,...,xn,y∈Ω for all 0 ≤ i ≤ n.
We prove this inequality for For brevity, we introduce the following notation Let J = j = i : g j (X) = g j (X) . Since g j (X) = g j (X), j / ∈ J, j = i; Let us prove that |J| ≤ km.
Substituting (13) into the above inequality, we get Substituting r (k,m) n (ρ) = f (ξ 0 , . . . , ξ n ), we get 15. 5. Entropy Estimator. In this section, we consider a nonparametric estimator η (k,m) n (ρ) for the inverse entropy 1/h, where metric ρ is defined in 2. This estimator is based on a sample of n + 1 independent points ξ 0 , . . . , ξ n in the space Ω chosen randomly with respect to µ and the metric 2 on Ω and is defined as follows: where r (k,m) n (ρ) is defined in 7.

Applying Theorem 4.2 and inequality Var
, we obtain the following statement: Thus, we have just calculated the estimator's variance. We note that a calculation of the estimator's bias is much more complicated and we will do it for a asymmetric Bernoulli measure.
Proposition 4. Let µ be a asymmetric Bernoulli measure and the function λ(t) of 2 be such that the following identity holds for 0 < β < 1.
Then, for β = 1/|A|, we have Proof. We introduce notation Clearly, for a symmetric Bernoulli measure (with equiprobable symbols), F does not depend on x and satisfies an equation If we replace β − log r by x, we obtain Calculating the integral (see [4, 4.253.1]), we get where H n are harmonic numbers H n = n s=1 1 s .
6. Conclusion. In this work, we have introduced a new nearest-neighbor entropy estimator which is based on a new large family of weak metrics. The estimator has a parameter with which it is optimized to reduce its bias. We have calculated estimator's variance and shown that it is upper-bounded by a nearly optimal Cramér-Rao lower bound. We have explicitly calculated the estimator's bias for a special case -symmetrical Bernoulli measures. In a subsequent work, we expect to calculate the bias for a general case as well as to develop an efficient estimator's algorithm implementation.