Discrete minimax estimation with trees

We propose a simple recursive data-based partitioning scheme which produces piecewise-constant or piecewise-linear density estimates on intervals, and show how this scheme can determine the optimal $L_1$ minimax rate for some discrete nonparametric classes.

monotone densities on [0, 1], L-Lipschitz densities for a given constant L > 0, and log-concave densities, to name a few. By minimax estimation, we mean density estimation in the minimax sense, i.e., we are interested in the existence of a density estimate which minimizes its approximation error, even in the worst case.
There is a long line of work in the statistics literature about density estimation, and a growing interest coming from the theoretical computer science and machine learning communities; for a selection of new and old books on this topic, see [10,11,13,18,26,27]. The study of nonparametric density estimation began as early as in the 1950's, when Grenander [16] described and studied properties of the maximum likelihood estimate of an unknown density taken from the class of bounded monotone densities on [0,1]. This estimator and this class received much further treatment over the years, in particular by Prakasa Rao [23], Groeneboom [17], and Birgé [6,7,8], who identified the optimal L 1 -error minimax rate, up to a constant factor. Since then, countless more nonparametric classes have been studied, and many different all-purpose methods have been developed to obtain minimax results about these classes: for the construction of density estimates, see e.g., the maximum likelihood estimate, skeleton estimates, kernel estimates, and wavelet estimates, to name a few; and for minimax rate lower bounds, see e.g., the methods of Assouad, Fano, and Le Cam [10,11,13,29].
One very popular style of density estimate is the histogram, in which the support of the random data is partitioned into bins, where each bin receives a weight proportional to the number of data points contained within, and such that the estimate is constant with the given weight along each bin. Then, the selection of the bins themselves becomes critical in the construction of a good histogram estimate. Birgé [7] showed how histograms with carefully chosen exponentially increasing bin sizes will have L 1 -error within a constant factor of the optimal minimax rate for the class of bounded non-increasing densities on [0, 1]. In general, the right choice of an underlying partition for a histogram estimate is not obvious.
In this work, we devise a recursive data-based approach for determining the partition of the support for a histogram estimate of discrete non-increasing densities. We also use a similar approach to build a piecewise-linear estimator for discrete non-increasing convex densities-see Anevski [1], Jongbloed [20], and Groeneboom, Jongbloed, and Wellner [19] for works concerning the maximum likelihood and minimax estimation of continuous non-increasing convex densities. Both of our estimators are minimax-optimal, i.e., their minimax L 1 -error is within a constant factor of the optimal rate. Recursive data-based partitioning schemes have been extremely popular in density estimation since the 1970's with Gessaman [15], Chen and Zhao [9], Lugosi and Nobel [22], and countless others, with great interest coming from the machine learning and pattern recognition communities [12]. Still, it seems that most of the literature involving recursive data-based partitions are not especially concerned with the rate of convergence of density estimates, but rather other properties such as consistency under different recursive schemes. Moreover, most of the density estimation literature is concerned with the estimation of continuous probability distributions. In discrete density estimation, not all of the constructions or methods used to develop arguments for analogous continuous classes will neatly apply, and in some cases, there are discrete phenomena that call for a different approach.

Preliminaries and summary
Let F be a given class of probability densities with respect to a base measure µ on the measurable space (X , Σ), and let f ∈ F. If X is a random variable taking values in (X , Σ), we write X ∼ f to mean that for each A ∈ Σ.
∼ f means that X i ∼ f for each 1 ≤ i ≤ n, and that X 1 , . . . , X n are mutually independent.
Typically in density estimation, either X = R d , Σ is the Borel σ-algebra, and µ is the Lebesgue measure, or X is countable, Σ = P(X ), and µ is the counting measure. The former case is referred to as the continuous setting, and the latter case as the discrete setting, where f is more often called a probability mass function in the literature. Throughout this paper, we will only be concerned with the discrete setting, and even so, we still refer to F as a class of densities, and f as a density itself. Plainly, in this case, X ∼ f signifies that for each A ∈ P(X ).
Let f ∈ F be unknown. Given the n samples X 1 , . . . , X n i.i.d.
∼ f , our goal is to create a density estimatef n : X n → R X , such that the probability measures corresponding to f andf n (X 1 , . . . , X n ) are close in total variation (TV) distance, where for any probability measures µ, ν, their TV-distance is defined as The TV-distance has several equivalent definitions; importantly, if µ and ν are probability measures with corresponding densities f and g, then where for any function h : X → R, we define the L 1 -norm of h as (In the continuous case, this sum is simply replaced with an integral.) In view of the relation between TV-distance and L 1 -norm in (2), we will abuse notation and write TV(f, g) = f − g 1 /2.
There are various possible measures of dissimilarity between probability distributions which can be considered in density estimation, e.g., the Hellinger distance, Wasserstein distance, L p -distance, χ 2 -divergence, Kullback-Leibler divergence, or any number of other divergences; see Sason and Verdú [25] for a survey on many such functions and the relations between them. Here, we focus on the TV-distance due to its several appealing properties, such as being a metric, enjoying the natural probabilistic interpretation of (1), and having the coupling characterization (3).
Iff n is a density estimate, we define the risk of the estimatorf n with respect to the class F as where the expectation is over the n i.i.d. samples from f , and possible randomization of the estimator. From now on we will omit the dependence off n on X 1 , . . . , X n unless it is not obvious. The minimax risk or minimax rate for F is the smallest risk over all possible density estimates, We can now state our results precisely. Let k ∈ N and let F k be the class of non-increasing densities on {1, . . . , k}, i.e., set of of all probability vectors f : {1, . . . , k} → R for which for all x ∈ {1, . . . , k − 1}.
Theorem 2.1. For sufficiently large n, Let G k be the class of all non-increasing convex densities on {1, . . . , k}, so each f ∈ G k satisfies (4) and Theorem 2.2. For sufficiently large n, We emphasize here that the above results give upper and lower bounds on the minimax rates R n (F k ) and R n (G k ) which are within universal constant factors of one another, for the entire range of k and n.
Our upper bounds will crucially rely on the next results, which allow us to relate the minimax rate of a class to an old and well-studied combinatorial quantity called the Vapnik-Chervonenkis (VC) dimension [28]: For A ⊆ P(X ) a family of subsets of X , the VC-dimension of A, denoted by VC(A), is the size of the largest set X ⊆ X such that for every Y ⊆ X, there exists B ∈ A such that X ∩ B = Y . See, e.g., the book of Devroye and Lugosi [13] for examples and applications of the VC-dimension in the study of density estimation. [13]). Let F be a class of densities supported on X , and let F Θ = {f n,θ : θ ∈ Θ} be a class of density estimates satisfying For f ∈ F, let µ be the probability measure corresponding to f . Let also µ n be the empirical measure based on X 1 , . . . , X n Then, there is an estimate ψ n for which The estimate ψ n in Theorem 2.3 is called the minimum distance estimate in [13]-we omit the details of its construction, though we emphasize that if computing A f n,θ takes one unit of computation for any θ and A, then selecting ψ n takes time polynomial in the size of A, which is often exponential in the quantities of interest.
Theorem 2.4 (Devroye, Lugosi [13]). Let F, X , f, µ, µ n be as in Theorem 2.3, and let A ⊆ P(X ). Then, there is a universal constant c > 0 for which Corollary 2.6. Let F, A Θ , f n,θ , f, µ, µ n be as in Theorem 2.3. Then, there is a universal constant c > 0 for which

Non-increasing densities
This section is devoted to presenting a proof of the upper bound of Theorem 2.1. The lower bound is proved in Appendix A.1 using a careful but standard application of Assouad's Lemma [2]. Part of our analysis in proving Theorem 2.1 will involve the development of an explicit efficient estimator for a density in F k .

A greedy tree-based estimator
Suppose that k is a power of two. This assumption can only, at worst, smudge some constant factors in the final minimax rate. Using the samples X 1 , . . . , X n i.i.d.
∼ f ∈ F k , we will recursively construct a rooted ordered binary tree T which determines a partition of the interval {1, . . . , k}, from which we can build a histogram estimatef n for f . Specifically, let ρ be the root of T , where I ρ = {1, . . . , k}. We say that ρ covers the interval I ρ . Then, for every node u in T covering the interval are the first and second halves of I u , we verify the condition where N v , N w are the number of samples which fall into the intervals I v , I w , i.e., The inequality (5) is referred to as the greedy splitting rule. If (5) is satisfied, then create nodes v, w covering I v and I w respectively, and add them to T as left and right children of u. If not, make u a leaf in T . After applying this procedure, one obtains a (random) tree T with leaves L, and the set {I u : u ∈ L} forms a partition of the support {1, . . . , k}. Letf n be the histogram estimate based on this partition, i.e., The density estimatef n will be called the greedy tree-based estimator. See Figure 1 for a typical plot off n , and a visualization of the tree T . Remark 3.1. Intuitively, we justify the rule (5) as follows: We expect that N v is at least as large as N w by monotonicity of the density f , and the larger the difference |N v − N w |, the finer a partition around I v and I w should be to minimize the error of a piecewise constant estimate of f . However, even if N v and N w were equal in expectation, we expect with positive probability that N v may deviate from N w on the order of a standard deviation, i.e., on the order of √ N v + N w , and this determines the threshold for splitting.
Remark 3.2. One could argue that any good estimate of a non-increasing density should itself be non-increasing, and the estimatef n does not have this property. This can be rectified using a method of Birgé [7], who described a transformation of piecewise-constant density estimates which does not increase risk with respect to non-increasing densities. Specifically, suppose that the estimatef n is not non-increasing. Then, there are consecutive intervals I v , I w such thatf n has constant value y v on I v and y w on I w , and y v < y w . Let the transformed estimate be constant on I v ∪ I w , with value i.e., the average value off n on I v ∪ I w . Iterate the above transformation until a non-increasing estimate is obtained. It can be proven that this results in a unique estimate f n , regardless of the order of merged intervals, and that

An idealized tree-based estimator
Instead of analyzing the greedy tree-based estimatorf n of the preceding section, we will fully analyze an idealized version. Indeed, in (5), the quantities N z are distributed as Binomial(n, f z ) for z ∈ {v, w}, where we define If we replace the quantities in (5) with their expectations, we obtain the idealized splitting rule where we note that f v ≥ f w , since f is non-increasing. Using the same procedure as in the preceding section, replacing the splitting rule with (6), we obtain a deterministic tree T * = T * (f ) with leaves L * , and we set i.e., f * n is constant and equal to the average value of f on each interval I u for u ∈ L * . We call f * n the idealized tree-based estimate. See Figure 2 for a visualization of f * n and T * . It may be instructive to compare Figure 2 to Figure 1. Of course, T * and f * n both depend intimately upon knowledge of the density f ; in practice, we only have access to the samples X 1 , . . . , X n i.i.d.
∼ f , and the density f itself is unknown. In particular, we cannot practically use f * n as an estimate for unknown f . Importantly, as we will soon show, we can still use f * n along with Corollary 2.6 to get a minimax rate upper bound for F k .
Proof. Writing out the TV-distance explicitly, we have Let u ∈ L * , and define A u = x∈Iu |f u − f (x)|. If |I u | = 1, then A u = 0, so assume that |I u | > 1. In this case, let I v and I w be the left and right halves of the interval I u . In order to compute A u , we refer to Figure 3. We view A u as the positive area between the curve f and the linef u ; in Figure 3, this is the patterned area. Letf v andf w be the average value of f on I v and I w respectively. Then, B v is the positive area between f andf v on I v , which is represented as the gray area on I v in Figure 3, and B w is the positive area between f andf w on I w , the gray area on I w in Figure 3. The points x v , x u , x w are the unique points for which f (x z ) =f z , for z ∈ {v, u, w}.
Since the gray area on I v to the left of x v is equal to the gray area on I v to the right of x v , and the gray area on I w to the left of x w is equal to the gray area on I w to the right of x w , then Then where this last line follows from the splitting rule (6), since u ∈ L * and |I u | > 1. So, where the last line follows from the Cauchy-Schwarz inequality.

2/3
. Proof. Note that T * has height at most log 2 k. Let U j be the set of nodes at depth j in T * which have at least one leaf as a child, for 0 ≤ j ≤ log 2 k, and label the children of the nodes in U j in order of appeareance from right to left in T * as u 1 , u 2 , . . . , u 2|Uj | . Since none of the nodes in U j are themselves leaves, then by (6), and in particular since f u1 ≥ 0, then f u2 > f u2 /n, so that f u2 > 1/n. In general, and this recurrence relation can be solved to obtain that Let L j be the set of leaves at level j in T * . The leaves at level j + 1 form a subsequence v 1 , v 2 , . . . , v |Lj+1| of u 1 , u 2 , . . . , u 2|Uj | . Write q j for the total probability mass of f held in the leaves L j , i.e., By (7), Summing over all leaves except on the last level, and using the facts that n ≥ 64 and 2n 1/3 ≤ k < n 1/3 2 n , |{u ∈ L * : |I u | > 1}| = By Hölder's inequality, , so finally |L * | ≤ n 1/3 + 4 log 2 (k/n 1/3 ) + 5n 1/3 log 2 (k/n 1/3 ) Proof of the upper bound in Theorem 2.1. The case k ≥ n 1/3 2 n is trivial, and follows simply because the TV-distance is always upper bounded by 1. Suppose next that 2n 1/3 > k. In this regime, we can use a histogram estimator for f with bins of size 1 for each element of {1, . . . , k}. It is well known that risk of this estimator is on the order of k/n [13].
Finally, suppose that 2n 1/3 ≤ k < n 1/3 2 n . Let F Θ be the class of all piecewiseconstant probability densities on {1, . . . , k} which have = |{u ∈ L * : |I u | > 1}| parts; in particular, f * n ∈ F Θ . Let A Θ be the Yatracos class of F Θ , By Proposition 3.4, we see that for sufficiently large n, there is a universal constant c 3 > 0 such that

Non-increasing convex densities
Recall that G k is the class of non-increasing convex densities supported on {1, . . . , k}. Then, G k forms a subclass of F k , which we considered in Section 3. This section is devoted to extending the techniques of Section 3 in order to obtain a minimax rate upper bound on G k . Again, the lower bound is proved using standard techniques in Appendix A.2.
In this section, we assume that k is a power of three. In order to prove the upper bound of Theorem 2.2, we construct a ternary tree just as in Section 3, now with a ternary splitting rule, where if a node u has children v, w, r in order from left to right, we split and recurse if Here we obtain a tree T † = T † (f ) with leaves L † . If u ∈ L † has children v, w, r from left to right, let m z be the midpoint of I z for z ∈ {v, w, r}. Let the estimate f * n on I u to be composed of the secant line passing through the points (m v ,f v ) and (m r ,f r ). The estimate f † n is again called the idealized tree-based estimate for f .  .
Before proving this, we first note that by convexity of f , the slope of the secant line passing through (m w ,f w ) and (m r ,f r ) is at least the slope of the secant line passing through (m v ,f v ) and (m w ,f w ). Equivalently, Proof of Proposition 4.1. As in the proof of Proposition 3.3, we have We refer to Figure 4 for a visualization of the quantity A u , which is depicted as the patterned area.
Let B v , B w , and B r denote the gray area on I v , I w , and I r respectively. It can be shown that The result then follows from the splitting rule (8) and the Cauchy-Schwarz inequality.

Discussion
The techniques of this paper seem to naturally extend to higher dimensions. Take, for instance, the class of block-decreasing densities, whose minimax rate was identified by Biau and Devroye [5]. This is the class of densities supported on [0, 1] d bounded by some constant B ≥ 0, such that each density is nonincreasing in each coordinate if all other coordinates are held fixed. In order to estimate such a density, one could devise an oriented binary splitting rule analogous to (6) and carry out a similar analysis as performed in Section 3.2. Furthermore, we expect that there are many other classes of one-dimensional densities whose optimal minimax rate could be identified using our approach, like the class of -monotone densities on {1, . . . , k}, where a function f is called -monotone if it is non-negative and if (−1) j f (j) is non-increasing and convex for all j ∈ {0, . . . , − 2} if ≥ 2, and where f is non-negative and non-increasing if = 1. This paper tackles the cases of = 1 and = 2. See Balabdaoui and Wellner [3,4] for texts concerning the density estimation of -monotone densities. Our approach also likely can be applied to the class of all log-concave discrete distributions, where we recall that f : We note here that the optimal minimax rate for the class of d-dimensional logconcave continuous densities is unknown as of the time of writing [14,21,24].
Appendix A: Lower bounds Lemma A.1 (Assouad's Lemma [2,13]). Let F be a class of densities supported on the set X . Let A 0 , A 1 , . . . , A r be a partition of X , and g ij : A i → R for 0 ≤ i ≤ r and j ∈ {0, 1} be some collection of functions. For θ = (θ 1 , . . . , θ r ) ∈ {0, 1} r , define the function f θ : X → R by such that each f θ is a density on X . Let ζ i ∈ {0, 1} n agree with θ on all bits except for the i-th bit. Then, suppose that Let H be the hypercube of densities A.1. Proof of the lower bound in Theorem 2.1.
Suppose first that e 8 n 1/3 ≤ k ≤ n 1/3 e n . Let A 1 , . . . , A r be consecutive intervals of even cardinality, starting from the leftmost atom 1. Split each A i in two equal parts, A i and A i . Let ε ∈ (0, 1/ √ 2), and set It is clear that each f θ is a density. In order for each f θ to be monotone, we require that and in particular Pick Let |A i | be the smallest even integer at least equal to a i , so that a i ≤ |A i | ≤ a i +2, and thus Since the support of our densities is {1, . . . , k}, then we ask that this last upper bound not exceed k. We can guarantee this in particular with a choice of r and ε for which On the other hand, Note that k ≤ n 1/3 e n now implies that ε ∈ (0, 1/ √ 2). With this choice, Lemma A.1 implies that So we need only verify that these choices of ε and r are compatible with (11).
Since k ≥ e 8 n 1/3 , then there is an integer choice of r in the range In particular, we can verify that 2r ≤ n log 2 (k/n 1/3 ) 1/3 ≤ k 2 , since 2 log 2/3 x ≤ x for all x ≥ 0. Moreover, since k ≥ e 8 n 1/3 , then ε ≥ 1 2n 1/3 , so that e 4εr 2ε ≤ n 1/3 e 4εr ≤ n 1/3 e 1 2 log(k/n 1/3 ) where this last inequality holds since √ x ≤ x/2 for all x ≥ e 8 , so that (12) is proved. When k ≥ n 1/3 e n , we argue by inclusion that The only remaining case is k ≤ e 8 n 1/3 . In this case, we offer a different construction. Now, each A i will have size 2 for 1 ≤ i ≤ r, where r = k/2 . Fix a, b ∈ R to be specified later, and set for some 0 ≤ ε ≤ 1. Since each f θ must be a density, we need that Both of these conditions will be satisfied if we pick b = ε 2r 2 , and a = b + 1 + ε 2r , Furthermore, the largest probability of an atom here is for k ≥ 2. Then, for 1 ≤ i ≤ r, we can compute .

A.2. Proof of the lower bound in Theorem 2.2.
Let A 1 , . . . , A r be the partition in Lemma A.1, for an integer r ≥ 1 to be specified. Let j i be the smallest element of A i , and suppose that each |A i | is chosen to be a positive multiple of 3. We will define the functions f θ based on parameters β i , ∆ i ∈ R, to be specified. Let g i0 linearly interpolate between the points (j i , β i ), and on {j i , j i + 1, . . . , j i + |A i |/3 − 1}, and between the points on {j i + |A i |/3, . . . , j i + |A i | − 1}. Let g i1 linearly interpolate between the points (j i , β i ), and on {j i , j i + 1, . . . , j i + 2|A i |/3 − 1}, and between the points Then, each f θ will be nonincreasing as long as β i ≥ β i+1 for each 1 ≤ i ≤ r, and Each f θ will be convex as long as the largest slope on A i is at most the smallest slope on A i+1 for each 1 ≤ i ≤ r. Equivalently, Now, pick β i = (1 − ε) i−1 β for some β ∈ R, ε ∈ (0, 1) to be specified, and for which (13) is immediately satisfied. The condition (14) is then equivalent to |A i+1 | |A i | ≥ 1 + ε.
Pick |A 1 | = 3. It is sufficient to make the choice Let |A i | be the smallest integer multiple of 3 at least as large as a i , so that a i ≤ |A i | ≤ a i + 3. If ε ≤ 1/2, then r i=1 |A i | ≤ 3r + 3e 2εr 2ε .
Since the support of our densities is {1, . . . , k}, this upper bound must not exceed k, so we impose that 3r ≤ k 2 , and 3e 2εr 2ε ≤ k 2 .