A recursive procedure for density estimation on the binary hypercube

This paper describes a recursive estimation procedure for multivariate binary densities (probability distributions of vectors of Bernoulli random variables) using orthogonal expansions. For $d$ covariates, there are $2^d$ basis coefficients to estimate, which renders conventional approaches computationally prohibitive when $d$ is large. However, for a wide class of densities that satisfy a certain sparsity condition, our estimator runs in probabilistic polynomial time and adapts to the unknown sparsity of the underlying density in two key ways: (1) it attains near-minimax mean-squared error for moderate sample sizes, and (2) the computational complexity is lower for sparser densities. Our method also allows for flexible control of the trade-off between mean-squared error and computational complexity.


Introduction
This paper considers the problem of estimating a multivariate binary density from a number of independent observations. That is, we have n observations of the form X i ∈ {0, 1} d which are independent and identically distributed (i.i.d.) samples from a population with a probability density f (with respect to the counting measure on the d-dimensional binary hypercube {0, 1} d ). We wish to estimate f on the basis of these observations. Multivariate binary data arise in a variety of applications: • Biostatistics. Each X i could represent a biochemical profile of a bacterial strain, where every component is a "yes-no" indicator of a presence of a particular biochemical marker [18,38]. A recent paper [31] proposed a methodology for representing gene expression data using binary vectors. More classical scenarios include recording the occurrence of a given symptom or a medical condition in a patient over time [12] or outcomes of a series of medical tests [1].
• Quantitative methods in social sciences. Each X i could represent a respondent in a survey or a panel, where every component is a "yes-no" answer to a question [6], describe a voting record of a legislator, or correspond to co-occurrences of events in social networks [32]. • Artificial intelligence. Each X i could represent a user query to a search engine or a database, where every component corresponds to the presence or absence of a particular keyword [13], or an image stored on a website like Flickr 1 , where every component corresponds to a user-supplied tag from a given list.
Many situations involving multivariate binary data have the following features: (1) the number of covariates (or the dimension of the hypercube) d is such that the number of possible values each observation could take (2 d ) is much larger than the sample size n; (2) there is a "clustering effect" in the population, meaning that the shape of the underlying density is strongly influenced mainly by a small number of constellations of the d covariates. For example, a particular class of bacterial strains may be reliably identified by looking at a particular subset of the biomarkers; there may be several such classes in the population of interest, each associated with a distinct subset of biomarkers. Similarly, when working with panel data, it may be the case that the answers to some specific subset of questions are highly correlated among a particular group of the panel participants, and the responses of these participants to other questions are nearly random; moreover, there may be several such distinct groups in the panel.
These considerations call for a density estimation procedure that can effectively cope with "thin" samples (i.e., those samples for which n < 2 d ) in terms of both estimation error and computational complexity, and at the same time automatically adapt to the possible clustering in the population, in the sense described above. We take the minimax point of view, where we assume that the unknown density f comes from a particular function class F and seek an estimator that exactly or approximately attains the minimax mean-squared error where the infimum is over all estimators based on n i.i.d. samples from f . We will choose the class F to model the "constellation effects" via a certain sparsity condition. Our choice of the L 2 risk, as opposed to other measures of risk more commonly used in density estimation, such as Hellinger, Kullback-Leibler or total variation risks, is dictated by the fact that the sparsity condition mentioned above is most naturally stated in a Hilbert space framework, which in turn facilitates the design of our estimation procedure, as well as the derivation of both upper and lower bounds on R * n (f ). We refer the reader to several other works on density estimation that use L 2 risk [24,7,10,40,20,19,11]. We also note that, because the Euclidean L 1 norm (which for probability densities gives the total variation risk) dominates the Euclidean L 2 norm, and because the square of the Kullback-Leibler divergence dominates the total variation distance [8], lower bounds on the squared L 2 risk automatically translate into lower bounds on the squared L 1 (total variation) risk and on the Kullback-Leibler risk.
Because of the host of applications in which multivariate binary data naturally arise, several authors have investigated algorithms for estimation of their probability densities (see, e.g., [1,28,24,7]). However, existing approaches either have very slow rates of error convergence or are computationally prohibitive when the number of covariates is very large. For example, the kernel density estimation scheme proposed by Aitchison and Aitken [1] has computational complexity O(nd) (where n is the number of observations and d the number of covariates), yet its squared L 2 error decays at the rate O(n −4/(4+d) ) [33], which is disastrously slow for large d. In contrast, orthogonal series methods, which can potentially achieve near-minimax error decay rates, require the estimation of 2 d basis coefficients and do not easily admit computationally tractable estimation methods for very large d. For instance, using the Fast Walsh-Hadamard Transform to estimate the coefficients of a density in the Walsh basis (see below) using n samples requires O(nd2 d ) operations (see Appendix B in [28] and references therein).
In this paper we present a computationally tractable orthogonal series estimation method based on recursive block thresholding of empirical Walsh coefficients. In particular, the proposed method entails recursively examining empirical estimates of whole blocks of the 2 d different Walsh coefficients. At each stage of the algorithm, the overall weight of basis coefficients computed at previous stages is used to decide which remaining coefficients are most likely to be significant or insignificant, and computing resources are allocated accordingly. It is shown that this decision is accurate with high probability, so that insignificant coefficients are not estimated, while the significant coefficients are. This approach is similar in spirit to the algorithm of Goldreich and Levin [17], originally developed for applications to cryptography and later adopted by Kushilevitz and Mansour [22,25] to the problem of learning Boolean functions using membership queries. Although there are significant differences between the problems of density estimation and function learning which are reflected in our estimation procedure, our algorithm inherits the computational tractability of the Goldreich-Levin scheme: in particular, it runs in probabilistic polynomial time.
The proposed estimator adapts to unknown sparsity of the underlying density in two distinct ways. First, it is near-minimax optimal for "moderate" sample sizes d n 2 2d/p , with an L 2 error decay rate of O(2 −d (d/n) 2r/(2r+1) ), where p ∈ (0, 1] is a measure of sparsity and r = 1/2−1/p. Moreover, the computational complexity of our algorithm is automatically lower for sparser densities. Sparsity has been recently recognized as a crucial enabler of accurate estimation in "big-d, small-n" type problems [4,5]. Specifically for densities on the binary hypercube, sparsity in the Walsh basis has a natural qualitative interpretation that the shape of the density is influenced mainly by a small number of constellations of the covariates. For example, if the components of a multidimensional binary vector represent positive/negative outcomes in a series of medical tests, it is often the case that the outcomes of certain small constellations of tests play the determining role in the diagnosis. There are several different series expansions on the binary hypercube presented in the literature, including the Rademacher-Walsh orthogonal series (see Appendix A in [28] and references therein) and the Bahadur expansion [2,18]. We focus in this paper on the Walsh system, which is derived from Fourier analysis on finite groups (see, e.g., Chap. 4 of Tao and Vu [36]), for two reasons. First, the coefficients of a particular function in the Walsh system give us information about the influence of the various subsets of the d variables on the value of the function [34,9]. Second, the Walsh functions of a length-d vector can be factorized into products of Walsh functions of multiple shorter vectors with lengths summing to d; this is detailed in Section 2.2. This factorization is central to the efficiency of the proposed coefficient estimation method. The Walsh system is widely used in the context of learning Boolean functions [25], as well as in harmonic analysis of real-valued functions on the binary hypercube [34,9].

Organization of the paper
The remainder of the paper is organized as follows. Section 2 contains the preliminaries on notation, the Walsh system, and sparsity classes on the binary hypercube. Next, in Section 3 we describe the motivation behind the thresholding methods in orthogonal series estimation on the binary hypercube, introduce our recursive thresholded estimator, and analyze its MSE and computational complexity. The theorems of Section 3 are proved in Section 4. Some illustrative simulation results are given in Section 5. The contributions of the paper are summarized in Section 6. Finally, some technical results are relegated to the appendices.

Notation
The basic set {0, 1} will be denoted by B. For any integer k > 1, the components of binary strings x ∈ B k will be denoted by x (j) , 1 ≤ j ≤ k: for any x ∈ B k , we have x = (x (1) , . . . , x (k) ). We will use juxtaposition to denote concatenation of strings: if y ∈ B k and z ∈ B l , then yz ∈ B k+l is the string x = (y (1) , . . . , y (k) , z (1) , . . . , z (l) ). For any 0 ≤ k ≤ d, we will define the prefix mapping π k : B d → B k and the suffix mapping σ k : so that x = π k (x)σ k (x) for any x ∈ B d (note that both π 0 and σ d return an empty string). Whenever we deal with vectors whose components are indexed by the elements of B k for some k, we will always assume that the components are arranged according to the lexicographic ordering of the binary strings in B k . Given two real numbers a, b, we let a ∧ b denote min{a, b}. Also, throughout the paper, C is used to denote a generic constant whose value may change from line to line; specific absolute constants will be denoted by C 1 , C 2 , etc. Throughout the paper, we let M ≡ 2 d .

The Walsh system
For any integer k ≥ 1, denote by µ k the counting measure on B k and endow the space of functions f : B k → R with the structure of the real Hilbert space L 2 (µ k ) via the standard inner product The Walsh system (see references in the Introduction) in L 2 (µ k ) is an orthonor- where θ s f, ϕ s , s ∈ B k . To keep the notation simple, we will not explicitly mark the underlying dimension when working with the Walsh functions. When k = d, we will write Φ instead of Φ d .
For any k, the Walsh system Φ k is a tensor product basis induced by for any x ∈ B. That is, for any k ≥ 1, any ϕ s ∈ Φ k has the form which means that This generalizes to the following useful factorization property of the Walsh functions: for any k ≥ 1 and any l ≤ k, we have where π l,k and σ l,k denote the prefix and the suffix mappings defined on B k analogously to (1). This means that, for products of functions on disjoint subsets of the d variables, the Fourier-Walsh coefficients also have the product form.

Sparsity and weak-p balls
Our interest lies with functions whose Fourier-Walsh representations satisfy a certain sparsity constraint. Given a function f on B d , let θ(f ) denote the vector of its Fourier-Walsh coefficients. We will assume that the components of θ(f ) decay according to a power law. Formally, let θ (1) , . . . , θ (M ) , where M = 2 d , be the components of θ(f ) arranged in decreasing order of magnitude: Given some 0 < p < ∞, we say that θ(f ) belongs to the Marcinkiewicz, or weak-p , ball of radius R [3,21], and write θ(f ) ∈ w p (R), if It is not hard to show that the Fourier-Walsh coefficients of any probability density on B d are bounded by 1/ √ M . With this in mind, let us define the function class We are particularly interested in the case 0 < p ≤ 1. We will need approximation properties of weak-p balls as listed, e.g., in [5]. The basic fact is that the power-law condition (4) particularized to the elements of F d (p) is equivalent to the concentration estimate valid for all λ > 0. Additionally, for any 1 ≤ k ≤ M , let θ k (f ) denote the vector θ(f ) with θ (k+1) , . . . , θ (M ) set to zero. Then it follows from (4) that where r 1/p − 1/2, and C is some constant that depends only on p. Hence, given any f ∈ F d (p) and denoting by f k the function obtained from it by truncating all but the k largest Fourier-Walsh coefficients, we get from Parseval's identity that Thus, the assumption that f belongs to the sparsity class F d (p) for some p can be interpreted qualitatively as saying that the behavior of f is strongly influenced by a small number of subsets of the d covariates. The number of these influential subsets decreases as p → 0.

Density estimation via recursive Walsh thresholding
Let X 1 , . . . , X n be independent random variables in B d with common unknown density f . We wish to estimate f on the basis of this sample. For densities defined on the Euclidean space, nonparametric estimators based on hard or soft thresholding of empirically estimated coefficients of the target density in a suitably chosen basis (e.g., a wavelet basis) attain near-minimax rates of convergence of the squared-error risk over rich classes of densities [10,19]. Thresholding is a means of controlling the bias-variance trade-off. Several authors have investigated the use of term-by-term thresholding rules for density estimation on the binary hypercube. There, one begins by computing the empirical estimates of the Fourier-Walsh coefficients of f , and then forming the thresholded esti- where T (·) is some real-valued statistic and I {·} is the indicator function. Based on the observation that while the squared bias incurred by omitting the term θ s ϕ s from the estimator (10) is θ 2 s , Ott and Kronmal [28] considered the ideal thresholded estimator Clearly, f * is impractical because the thresholding criterion depends on the unknown coefficients θ s . Instead, Ott and Kronmal [28] suggested that one could mimic the ideal estimator (12) by replacing θ 2 s in the thresholding criterion by the unbiased estimator (n θ 2 s − 1/M )/(n − 1), leading to the practical estimator where WT stands for "Walsh thresholding." This estimator was further improved by Liang and Krishnaiah [24] and Chen, Krishnaiah and Liang [7], who replaced the hard thresholding rule in (13) with shrinkage rules. The main disadvantage of such termwise thresholding is the need to compute empirical estimates of all M = 2 d Fourier-Walsh coefficients. While this is not an issue when d is comparable to log n, it is clearly impractical when d log n. In order to alleviate this difficulty, we will consider a recursive thresholding approach, which will allow us to reject whole groups of empirical coefficients based on efficiently implementable thresholding rules. The main idea behind this approach is motivated by the following argument. Given This leads to the following observation: for any λ > 0, The usefulness of this observation for our purposes comes from the fact that we can represent the strings s ∈ B d , and hence the elements of the Walsh system in L 2 (µ d ), by the leaves of a complete binary tree of depth d. Suppose we wanted to pick out only those coefficients of f whose squared magnitude exceeds some threshold λ. If we knew that W u ≤ λ for some u ∈ B k , then this would tell us that the square of every coefficient corresponding to a leaf descending from u does not exceed λ. Hence, we could start at the root of the tree and at each internal node u that has not yet been visited check whether W u ≥ λ; if not, then we would delete u and all of its descendants from the tree without having to compute explicitly the corresponding coefficients. At the end of the process (i.e., when we get to the leaves), we will be left only with those s ∈ B d for which θ 2 s ≥ λ. If f ∈ F d (p) for some p, then the resulting squared L 2 error will be where r = 1/p − 1/2, as before. We will follow this reasoning in constructing our density estimator. We begin by developing a suitable estimator for W u . To do that, we shall rely on the following lemma (see Appendix A.1 for the proof): For any density f on B d , any 1 ≤ k ≤ d, and any u ∈ B k , we have imsart-ejs ver. 2011/11/15 file: binary_hypercube.tex date: December 3, 2012 and This lemma suggests that, for each 1 ≤ k ≤ d and each u ∈ B k , an empirical estimate of W u can be obtained by Although this is a biased estimator, it has the following useful property (see Appendix A.2 for the proof): where each θ uv is an empirical estimate of θ uv computed according to (9).
Another advantage of computing W u indirectly via (14), rather than (15), is that, while the latter requires O(2 d−k n) operations, the former requires only O(n 2 d) operations. This can amount to significant computational savings when k < d − log(nd). When k ≥ d − log(nd), it becomes more efficient to use the direct estimator (15). Now that we have a way of estimating W u , we can define our density estimation procedure. Provided the threshold scales appropriately with the sample size, we will be able to achieve a good balance between the estimation error (variance) and the approximation error (squared bias) and attain near-minimax rates of convergence. In our analysis, we shall actually consider a more flexible strategy: for every 1 ≤ k ≤ d, we shall compare the estimate W u of W u to a threshold that depends not only on the sample size n, but also on k. More specifically, we will let where the sequence In particular, this set-up covers the following two extreme cases: 1. α k = const for all k -this covers the standard case of always comparing against the same threshold (that depends on n) 2. α k = const · 2 d−k -this corresponds to thresholding not the sum of (a particular subset of) the coefficients, but their average.
As we shall see, this k-dependent scaling will allow us to flexibly trade off the expected L 2 error and the computational complexity of the resulting estimator.
Now we describe our density estimator. Given the sequence λ = {λ k,n } d k=1 , define the set and consider the density estimate where RWT stands for "recursive Walsh thresholding." To implement this estimator on a computer, we call the routine RecursiveWalsh, shown as Algorithm 1, with u = ∅ (the empty string, corresponding to the root of the tree) and with the desired threshold sequence λ. The factors of 1/2 in the updates for W u0 and W u1 arise because of the factorization property (3) of the Walsh basis functions: for any k ≥ 0 and any s, x, x ∈ B k+1 we have .
Our first main result is that, with appropriately tuned thresholds, the estimator (18) adapts to unknown sparsity of the Fourier-Walsh representation of f : There exist absolute constants C 1 , C 2 > 0, such that the following holds. Suppose the threshold sequence λ = {λ k } d k=1 is chosen in such a way that for all 0 < p ≤ 1, where, as before, for all n.
Remark 1. Positivity and normalization issues. As is the case with orthogonal series estimators, f RWT may not necessarily be a bona fide density. In particular, there may be some x ∈ B d such that f RWT (x) < 0, and it may happen that f RWT dµ d = 1. In principle, this can be handled by clipping the negative values at zero and renormalizing; this procedure can only improve the expected L 2 error. In practice this may be computationally expensive when d is very large. If the estimate is suitably sparse, however, the renormalization can be carried out approximately using Monte-Carlo estimates of the appropriate sums. Moreover, in many applications the scaling of the density is not important.
Remark 2. Logarithmic factors in the risk bound. For each 0 < p ≤ 1, the bound (20) will hold for all n ≤ 2rM 2r+1 . Since it follows from Theorem 2 below that f RWT with thresholds λ 1 = . . . = λ d ∼ nd/M is minimax-optimal for each 0 < p ≤ 1, assuming that the sample sample size n satisfies d ≡ log M n M 2/p . For very small sample sizes n log M and for very large sample sizes n ≥ M 2/p , f RWT will be suboptimal. With a more conservative choice of thresholds, λ 1 = . . . = λ d ∼ nd log n/M , the bound (21) will hold for all values of n. In particular, in this case the minimax rate will be attained, up to logarithmic factors, for all values of p ∈ (0, 1) simultaneously in the moderate sample regime log M n M 2/p . Our second main result is a lower bound on the minimax L 2 risk attainable by any estimator over F +,1 d (p). It shows, in particular, that our recursive estimator f RWT is minimax for "moderate" sample sizes log M n M 2/p . For large sample sizes, n M 2/p , f RWT is no longer optimal -in particular, it is outperformed by both the simple histogram estimator and by the unthresholded orthogonal series estimator both of which attain the optimal O(1/n) risk. The precise statement is as follows: Theorem 2. Consider the problem of estimating an unknown f ∈ F +,1 d (p) from n i.i.d. samples X 1 , . . . , X n . Then the following statements hold: where, as before, M = 2 d . 2. Suppose that n ≥ M 2/p and M ≥ 4. Then there exists an absolute constant C > 0, such that Our third, and final, main result bounds the running time of the algorithm used for computing f RWT : . Given any δ ∈ (0, 1), provided each α k is chosen so that where a k,n 1 5 α k n − C 2 2 2 k n and C 1 , C 2 > 0 are certain absolute constants, then Algorithm 1 runs in time with probability at least 1 − δ, where K(α, p) d k=1 α −p/2 k and, as before, M = 2 d .
Remark 3. Trade-off between time complexity and MSE. By controlling the rate at which the sequence α k decays with k, we can trade off MSE against complexity. Consider the following two extreme cases: (1) α 1 = . . . = α d ∼ 1/M and (2) α k ∼ 2 d−k /M . The first case, which reduces to the term-by-term thresholding, achieves the same bias-variance trade-off as the Ott-Kronmal estimator [28]. However, it has K(α, p) = O(M p/2 d), resulting in O(d 2 n 2+p/2 ) complexity. The second case, which leads to a very severe estimator that will tend to reject a lot of coefficients, has MSE of O(n −2r/(2r+1) M −1/(2r+1) ), but K(α, p) = O(M p/2 ), leading to a considerably better O(dn 2+p/2 ) complexity. From the computational viewpoint, it is preferable to use rapidly decaying thresholds. However, this reduction in complexity will be offset by a corresponding increase in MSE. In fact, using RWT with exponentially decaying α k 's in practice is not advisable as its low complexity is mainly due to the fact that it will tend to reject even the big coefficients very early on, especially when d is large. To achieve a good balance between complexity and MSE, a moderately decaying threshold sequence might be best, e.g., α k ∼ (d − k + 1) m /M for some m ≥ 1. As p → 0, the effect of λ on complexity becomes negligible, and the complexity tends to O(n 2 d).
Remark 4. Incoherence. Note that for any of the above choices of α k , the proposed method requires polylog(M ) operations. One intuitive explanation for why such fast computation is possible is that the Walsh basis is "incoherent" (to use term common in compressed sensing and group testing literature) with the canonical basis of L 2 (µ d ). Similar polylog computation results were achieved by Gilbert et al. in the context of fast sparse Fourier approximation [14,15] and group testing [16]. Their strategies also had connections to the Goldreich-Levin algorithm [17], as well as to the work of Kushilevitz and Mansour on sparse Boolean function estimation [22,25].

Proofs of the theorems
In this section we prove our three main results. However, before proceeding to the proofs, we collect all the technical tools that we will be using: moment bounds, concentration inequalities, and an approximation-theoretic lemma pertaining to class F +,1 d (p).

Moment bound
We will need the following result of Rosenthal [30]. Let U 1 , . . . , U n be i.i.d. random variables with EU i = 0 and EU 2 i ≤ σ 2 . Then for any m ≥ 2 there exists some c m such that

Concentration bounds
We will need the well-known Hoeffding inequality: if U 1 , . . . , U n are i.i.d.
bounded random variables such that EU i = 0 and |U i | ≤ b < ∞ for all 1 ≤ i ≤ n, then The following result is due to Talagrand [35]. Let U 1 , . . . , U n be i.i.d. random variables, let ε 1 , . . . , ε n be independent Rademacher random variables [i.e., P(ε i = −1) = P(ε i = 1) = 1/2] also independent of U 1 , . . . , U n , and let F be a class of functions uniformly bounded by L > 0. Then if there exist some v, H > 0 such that sup g∈F Var g(U ) ≤ v and for all n, then there are universal constants C 1 and C 2 such that, for every τ > 0, where is the empirical process indexed by F.

Remark 5.
Typically, some additional regularity conditions on F are needed to ensure measurability of the supremum sup g∈F ν n (g) of the empirical process (30). However, when U takes values in a finite set, as is the case in this paper, there is no need for such conditions because any uniformly bounded class of real-valued functions on a finite set is separable: it contains a countable subset F 0 , such that for any g ∈ F there exists a sequence g 1 , g 2 , . . . ∈ F 0 converging to g pointwise. Such a separability property ensures measurability of suprema over F [37, p. 110].

Large separated subsets of F +,1 d (p)
In the sequel, we will be interested in large subsets of the class of densities F +,1 d (p) ⊂ F d (p), whose elements are separated from one another by a given fixed amount, as measured by the norm · L 2 (µ d ) . The following lemma, whose proof is given in Appendix A, will be useful: Suppose that k and a are such that Then the following statements hold: 1. The set F(k, a) is contained in F +,1 d (p).

Proof of Theorem 1
Let us decompose the squared L 2 error as We start by observing that any s ∈ A(λ) necessarily satisfies W s ≡ θ 2 s ≥ λ d,n , while for any s ∈ A(λ) c there exists some 1 ≤ k ≤ d such that W π k (s) < λ k,n , which implies, in turn, that θ 2 π k (s)t < λ k,n for all t ∈ B d−k and, in particular, that θ 2 s < λ k,n ≤ λ 1,n . Therefore, defining the sets we can bound T 1 and T 2 as Further, defining B = s ∈ B d : θ 2 s < λ d,n /2 and S = s ∈ B d : θ 2 s ≥ 3λ 1,n /2 , we can write and Applying (6) and (11), we get To bound T 22 we apply (8): In order to deal with the large-deviation terms T 11 and T 21 , we will need some moment and concentration bounds which are listed in Section 4.1. First, using Cauchy-Schwarz, we get To estimate the fourth moment in (35), we apply the bound (26) to U i = ϕ s (X i ) − θ s , 1 ≤ i ≤ n, and m = 4. Then To handle the probability of A 1 ∩ B, we first estimate From this we conclude that θ 2 s ≥ λ d,n and θ 2 s < λ d,n /2 together imply | θ s − θ s | ≥ 1 5 λ d,n = 1 5 α d n (the factor of 1/5 is simply a lower bound on 1 − 1/ √ 2). Therefore, Applying Hoeffding's inequality (27) to U i = ϕ s (X i ) − θ s , 1 ≤ i ≤ n, with b = 2/ √ M and using the fact that α d ≥ C 1 d/M , we get for some absolute constant C > 0. If C 1 is chosen so that C ≥ 2, then we will have Finally, Using the same argument as above, we can write (with the same choice of the constant C 1 as before). Then where the first inequality uses the fact that f ∈ F d (p), while the second inequality follows from Since n ≤ 2rM 2r+1 , M −(2r+1) /2r ≤ n −1 . Consequently, we will have Putting together Eqs. (33), (34), (36), and (37), we get (20). The second bound (21) is proved along the same lines, except that the extra log n factor in the thresholds will give ET 11 ≤ C/M n and ET 21 ≤ C/M n.

Proof of Theorem 2
The proof of the first part uses a popular information-theoretic technique due to Yang and Barron [39,40]; we only outline the main steps. The first step is to lower-bound the minimax risk by the minimum probability of error in a multiple hypothesis test. Let F 0 be an arbitrary subset of F +,1 d (p). Then In particular, suppose that the set F 0 is finite, F 0 = {f (1) , . . . , f (N ) }, and δseparated in L 2 (µ d ), i.e., Then a standard argument [40] gives where the random variable Z is uniformly distributed over the set {1, . . . , N }, and the minimum is over all estimatorsf based on X n that take values in the packing set {f (1) , . . . , f (N ) }. Applying Fano's inequality [8], we can write where I(Z; X n ) is the Shannon mutual information [8] between the random index Z ∈ {1, . . . , N } and the observations X 1 , . . . , X n i.i.d.
∼ f (Z) . In this particular case, we have wheref denotes the mixture density N −1 N =1 f ( ) . The next step consists in upper-bounding this mutual information. To that end, suppose that there exists some ∆ > 0, such that Using convexity of the relative entropy and (42), for every k ∈ {1, . . . , N } we have Substituting this into (41), we see that I(Z; X n ) ≤ n∆. Combining this bound with (39), we get where is the minimal L 2 (µ d )-separation between any two distinct elements ofF(k, a).
We can now consider the following cases: 1. Suppose that M 2/p ≤ n. Then we take k = M − 1 and a 2 = C M n . Because in this case and log |F(M − 1, a)| ≥ (M − 1)/8, the conditions (31) and (44) will be satisfied for a suitable choice of C. Moreover, by Lemma 3, we have Substituting this into (43), we obtain (23).

Suppose that
If we choose C and C in such a way that (2C) 1/p C ≤ 1/2, then (31) will be satisfied. Moreover, we must have With our assumptions on n and M , this will hold for all sufficiently large M . Next, we check that (44) is satisfied. Assuming that (46) holds, Lemma 3 implies that Again, using our assumption on n and M , as well as the fact that a 2 = C log M M n , we can guarantee that (44) holds, with an appropriate choice of C = C(p, ). By Lemma 3, we will have , and we obtain (22).

Proof of Theorem 3
The time complexity of the algorithm is determined by the number of recursive calls made to RecursiveWalsh. Recall that, for each 1 ≤ k ≤ d, a recursive call to RecursiveWalsh is made for every u ∈ B k for which W u ≥ λ k,n . Let us say that a recursive call to RecursiveWalsh(u, λ) is correct if W u ≥ λ k,n /2. We will show that, with high probability, only the correct recursive calls are made at every 1 ≤ k ≤ d. The probability of making at least one incorrect recursive call is given by Then W u ≥ λ k,n and W u < λ k,n /2 together imply that can be expressed as a supremum of an empirical process over a suitable function class, to which we can then apply Talagrand's bound (29) with L = 1/ √ 2 k , v = 1/2 k , and H = 1/ √ 2 k n. Hence, where for each 1 ≤ k ≤ d, Here, C 1 and C 2 are the absolute constants in Talagrand's bound (29). Given δ > 0, if we choose α k according to (24), then Summing over 1 ≤ k ≤ d and u ∈ B k , we see that, with probability at least 1 − δ, only the correct recursive calls will be made. Next, we give an upper bound on the number of the correct recursive calls. For each 1 ≤ k ≤ d, W u ≥ λ k,n /2 implies that there exists at least one v ∈ B d−k such that θ 2 uv ≥ λ k,n /2. Since for every 1 ≤ k ≤ d each θ s contributes to exactly one W u , we have by the pigeonhole principle that where in the second line we used (6). Hence, the number of the correct recursive calls in Algorithm 1 is bounded by At each recursive call, we compute an estimate of the corresponding W u0 and W u1 , which requires O(n 2 d) operations. Therefore, with probability at least 1 − δ, the time complexity of the algorithm is given by (25).

Simulations
Although an extensive empirical evaluation is outside the scope of this paper, we have implemented the proposed estimator, and now present some simulation results to demonstrate its small-sample performance on synthetic data in lowand high-dimensional regimes. In the low-dimensional regime, it is feasible to obtain the "ground truth" by exhaustively computing all the 2 d Walsh coefficients and to compare it with our estimate. In the high-dimensional regime, our comparison is based on the density values at randomly generated samples. Additionally, we present a number of computational strategies that greatly enhance computational efficiency in the high-dimensional regime.

Low-dimensional simulations
We generated synthetic observations from a mixture density f on a 15dimensional binary hypercube. The mixture has 10 components, where each component is a product density with 12 randomly chosen covariates having Bernoulli(1/2) distributions, and the other three having Bernoulli(0.9) distributions. For d = 15, it is still feasible to quickly compute the ground truth, consisting of 32768 values of f and its Walsh coefficients. These values are shown in Fig. 1 (left). As can be seen from the coefficient profile in the bottom of the figure, this density is clearly sparse. Fig. 1 also shows the estimated probabilities and the Walsh coefficients for sample sizes n = 5000 (middle) and n = 10000 (right).
To study the trade-off between MSE and complexity, we implemented three different thresholding schemes: (1) constant, λ k,n = 2/(2 d n), (2) logarithmic, λ k,n = 2 log(d − k + 2)/(2 d n), and (3) linear, λ k,n = 2(d − k + 1)/(2 d n). The thresholds at k = d are set to twice the variance of the empirical estimate of any coefficient whose value is zero; this forces the estimator to reject empirical coefficients whose values cannot be reliably distinguished from zero. Occasionally, spurious coefficients get retained, as can be seen in Fig. 1 (middle) for the estimate for n = 5000. Fig 2 shows the performance of f RWT . Fig. 2(a) is a plot of MSE vs. sample size. In agreement with the theory, MSE is the smallest for the constant thresholding scheme [which is simply an efficient recursive implementation of a term-by-term thresholding estimator with λ n ∼ 1/M n], and then it increases for the logarithmic and for the linear schemes. Fig. 2(b,c) shows the running time (in seconds) and the number of recursive calls made to RecursiveWalsh vs. sample size. The number of recursive calls is a platformindependent way of gauging the computational complexity of the algorithm, although it should be kept in mind that each recursive call has O(n 2 d) overhead. Also, the number of recursive calls depends on whether a binary or N -ary tree is utilized. The N -ary tree scheme is explained in detail below, in Section 5.2. We have used N = 256 in the simulations, as this setting leads to much reduced computations times vs. a binary tree.
The running time increases polynomially with n, and is the largest for the constant scheme, followed by the logarithmic and the linear schemes. We see that, while the MSE of the logarithmic scheme is fairly close to that of the constant scheme, its complexity is considerably lower, in terms of both the number of recursive calls and the running time. In all three cases, the number of recursive calls decreases with n due to the fact that weight estimates become increasingly accurate with n, which causes the expected number of false discoveries (i.e., making a recursive call at an internal node of the tree only to reject its descendants later) to decrease. Finally, Fig. 2(d) shows the number of coefficients retained in the estimate. This number grows with n as a consequence of the fact that the threshold decreases with n, while the number of accurately estimated coefficients increases.
Additionally, we have performed comparisons between our proposed method and two alternatives: the Ott and Kronmal thresholding estimator [28]; and an exhaustive search over all possible thresholds for the best MSE. As seen in Fig. 3, our thresholding estimator provides close to the best possible MSE with far lower computational cost than the alternatives.

High-dimensional simulations
Although our algorithm has been implemented in MATLAB and therefore can be much further optimized for speed, we have devised several strategies for traversing the coefficient tree efficiently and circumventing computer-architecture related challenges for high-dimensional problems. We describe those strategies and demonstrate them using the same experimental set-up as above, but with much larger dimensionality.
• Direct computation of the coefficients near the leaves of the tree.
As discussed in 2.3, the direct estimator (15) for W u becomes computationally more efficient than the indirect estimator (14) for k ≥ d − log(nd). Hence, near the bottom of the tree, instead of continuing to traverse the remaining levels based on the weights W u , we simply compute all coefficients θ u at the leaves of the corresponding subtree. This is always the most sensible course of action, given the fact that (15) requires the θ u anyway. • N -ary tree. The number of levels to be traversed can be reduced by a factor of log N by considering an N -ary, rather than a binary, tree, at the cost of an increased number (N ) of branches per level. We have found this trade-off to be worthwhile in many cases due to the possibility of vectorizing the computation of W u for all the branches in each level and taking advantage of optimized routines for matrix algebra. • Open-node queue. While our algorithm lends itself to recursive implementation, many computer/operating system architectures impose a hard limit on the recursion level due to stack-size restrictions (recursive function calls typically use stack memory). This becomes a problem when the dimension d is high and the tree is accordingly very deep. We have circumvented the issue by implementing a queue system where the so-called open nodes (the tree branches awaiting processing) are sorted according to some criterion. This amounts to transferring the "stack" to user memory, where the only limit on the number of nodes is the free memory size. Possible criteria for sorting the nodes include depth-first, breadth-first and highest weight W u . We have used the latter criterion in our high-dimensional simulations. • Pruning high-frequency Walsh coefficients. For a given Walsh function ϕ s , the Hamming weight (i.e., number of "on" bits) of s is a measure of frequency; higher-frequency coefficients have a higher proportion of ones in s. Because in many problems it is appropriate to assume that the signals of interest have low frequency, we have included the ability to impose a limit on the order of the Walsh coefficients by ignoring any branches with more than m "on" bits. The choice of m depends on the problem context and on computational resources. • Weight-adaptive thresholding. For some datasets, significant gains can be achieved by varying the thresholds λ k,n in a data-driven manner at each level of the tree; as an alternative to the preset schedules α k in (16), it is possible to take the weights W u for each branch at level k, and then expand only the top q branches with the highest W u . This is equivalent to making α k not only level-dependent, but also dependent on the sequence of weights at level k. The value of q controls the trade-off between computation speed and accuracy.
The first three strategies are important modifications to a naive implementation of our algorithm, but in no way impact the MSE. The latter two techniques, however, provide an approximation to the estimator proposed and analyzed in this paper; for appropriate values of m and q, they yield significant computational savings for a modest increase in MSE.
In Figure 4, we present plots of the MSE and computation time for simulated data with d = 50 and multiple sample sizes (n), using the aforementioned optimizations. As before, the data were generated from a Bernoulli mixture density, similar to the one used for Figure 2 but using ten 50-dimensional mixture components, where each component has 47 covariates with a Bernoulli(1/2) distribution and three covariates with a Bernoulli(0.9) distribution. The results are averaged over ten independent runs. We have limited the number of "on" bits in the Walsh binary strings to three and eight (i.e., m = 3 and m = 8 respectively), expanded only the 16 subtrees with highest W u at each level (i.e., q = 16) and used an N -ary tree with N = 256. The subtrees in the open-node queue were sorted by decreasing W u . Even in this high-dimensional regime, we achieve steadily decreasing MSE as a function of n, as well as approximately linear scaling in computation time. It is also apparent that setting m = 3 achieves essentially the same MSE but with an order-of-magnitude reduction in the computational effort.

Summary and conclusion
We have presented a computationally efficient adaptive procedure for estimating a multivariate binary density in the "big d, small n" regime, which essentially forces a "nonparametric" approach. Many problems of current practical interest that involve multivariate binary data seem to pertain to populations with certain "constellation" effects among the d covariates. We have formalized this observation by focusing on a class of densities whose Walsh representations exhibit a certain power-law behavior. For moderate sample sizes, our estimator attains nearly minimax rates of MSE convergence over this class and runs in polynomial time with high probability. Moreover, the complexity improves for sparser densities. We have also reported the results of simulations, which show that our implemented estimator behaves in accordance with the theory even in the small-sample regime. In the future, we plan to test our method on real high-dimensional data sets. Another promising future direction is to investigate the relationship between various smoothness classes of densities on the binary hypercube defined in terms of their Fourier-Walsh representations and probability densities of binary Markov random fields [23]. Such a density will have the form where for each i ∈ {1, . . . , d} we have a neighborhood N i ⊆ {1, . . . , d}\{i}, the corresponding "local energy" function h i (·) depends only on x (i) and on x Ni = (x (j) : j ∈ N i ), and Z is the normalization constant known as the partition function. It is reasonable to assume that if most of the neighborhoods N i are small, then most of the Fourier-Walsh coefficients of f will be small as well. Assuming specific bounds on the decay of the Fourier-Walsh coefficients of f amounts to assuming something about the decay of correlations in the Markov random field governed by f . It remains to be seen whether sparsity classes of the type investigated in this paper can serve as a good model of binary Markov random fields with polynomial decay of correlations, or whether one would need to introduce a binary analog of something like (weak) Besov bodies [3] in order to account for localization both in space (small number of i's with large neighborhoods) and in frequency (small number of large Fourier-Walsh coefficients). In either case, it would be worthwhile to investigate the use of recursive thresholding estimators of the type introduced in this paper to estimate Markov graphical models on the binary hypercube.
From the factorization properties of the Walsh functions, we have Adding these two expressions and using the fact that This proves (A.1). By induction, we have where L(u) denotes the set of all leaves descending from u.
the lemma is proved.
• for any u, v ∈ K m with u = v, There is a one-to-one correspondence between F(M − 1, a) and the binary hy- Given the Varshamov-Gilbert set Then for any two f, f ∈F(M − 1, a), we have Finally, we consider Item 4. To that end, we will need a refinement of the Varshamov-Gilbert bound due to Reynaud-Bouret [29], which for our purposes can be stated as follows (see Lemma 4.10 in [26]): For any m, k ∈ N with m ≥ k, let B m k denote the subset of B m that consists of all u ∈ B m with m i=1 I {u (i) =1} = k. If m ≥ 4k, then there exists a set K m,k ⊂ B m k with the following properties: • for any u, v ∈ K m,k with u = v, • log |K m,k | ≥ 0.233k log(m/k) Now assume that M − 1 ≥ 4k and consider the corresponding set K M −1,k . To each u ∈ K M −1,k we can associate 2 k elements of F(M − 1, k), say {f u,v : v ∈ B k }, such that u determines the locations of the k nonzero Walsh-Fourier coefficients taking values in {−a, a}, while the choice of v ∈ B k determines the signs of these coefficients. Thus, let F(k, a) f u,v : u ∈ K m,k , v ∈ B k .

Appendix B: Empirical process representation
In this appendix, we show that for each u ∈ B k , the norm f u − f u L 2 (µ d−k ) can be expressed as a supremum of an empirical process over a suitable function class; this was a key element of the proof of Theorem 3. First, we show that f u − f u L 2 (µ d−k ) can be expressed as an empirical process of the form (30) indexed by a suitable function class. To this end, define Indeed, let X 1 , . . . , X n be n i.i.d. copies of X ∼ f . Then ν n (g) = 1 n n i=1 g(X i ) − Eg(X) where in the last line we used Cauchy-Schwarz. This proves (B.1). Next, we determine the constants L, v and H that are needed to apply Talagrand's bound (29). For any ξ ∈ L 2 (µ d−k ) with unit norm, we have Hence, any g ∈ F is bounded by L ≡ 2 −k/2 . From this, we also get the bound Var g ≤ v with v = L 2 = 2 −k . Finally, to bound the Rademacher average (28), we note that F is the unit ball in the RKHS induced by the kernel Then standard arguments (see, e.g., Section 2.4.2 in [27]) lead to the bound which gives H = 1/ √ 2 k n.