Adaptive estimation in the nonparametric random coefficients binary choice model by needlet thresholding

In the random coefficients binary choice model, a binary variable equals 1 iff an index $X^\top\beta$ is positive.The vectors $X$ and $\beta$ are independent and belong to the sphere $\mathbb{S}^{d-1}$ in $\mathbb{R}^{d}$.We prove lower bounds on the minimax risk for estimation of the density $f\_{\beta}$ over Besov bodies where the loss is a power of the $L^p(\mathbb{S}^{d-1})$ norm for $1\le p\le \infty$. We show that a hard thresholding estimator based on a needlet expansion with data-driven thresholds achieves these lower bounds up to logarithmic factors.


Introduction
Discrete choice models (see, e.g., [18]) have applications in many areas ranging from planning of public transportation, economics of industrial organizations, evaluation of public policies, among others. We consider the binary choice model where individuals choose between two exclusive alternatives that we label {1, −1}, for example buying a good or not. In a random utility framework, an individual chooses the alternative that yields the highest utility. Assume that the utility that an individual i gets from alternative -1 has the form u −1,i = z −1,i γ i + −1,i while the utility from alternative 1 has the form u 1,i = z 1,i γ i + 1,i where z −1,i (resp. z 1,i ) is a vector of d − 1 characteristics of alternative -1 (resp. 1) for individual i where d ≥ 2, γ i are preferences of individual i for the characteristics, and −1,i and 1,i absorb both the usual error terms and constants. In this model, the preferences are allowed to vary across individuals, namely they are heterogeneous. The characteristics of the alternatives are indexed by the individuals, for example they can be characteristics of two goods that a consumer has to choose upon interacted with individual characteristics like age or distance. Alternatively it is common to assume that one observes variation of the product characteristics through multiple markets (e.g., cell phone plans in different countries) in which case a third index for the market is introduced. Individual i strictly prefers 1 (y i = 1) if and only if the net utility for 1 is positive, i.e. (1) he is indifferent when u 1,i − u −1,i = 0, and he strictly prefers -1 (y i = −1) when u 1,i − u −1,i < 0. Denote now x i = (1, (z 1,i − z −1,i ) ) / (1, (z 1,i − z −1,i ) ) and The denominator in the definition of β i is non zero because it is assumed that u 1,i − u −1,i = 0 never occurs since there are only two alternatives. We assume that the researcher observes (y i , x i ) for individuals i = 1, . . . , n and that (β i , x i ) are i.i.d. draws from a continuous distribution. Denoting the population quantities as Y , Z 1 , Z −1 , X, 1 , −1 , γ and β, with X and β random vectors in the unit sphere S d−1 of R d , we obtain the model where, for a real number a, sign(a) is 1 if a is positive, is -1 if a is negative, and it is 0 if a = 0. This model extends the Logit, Probit or Mixed-Logit models. Like in [3,4,8,11] among others, we consider a nonparametric specification of the joint distribution of β.
We maintain the following restrictions on the joint distribution of (β , X ) and relax (A2.3) at the end of section 5. Under assumption 1, f β is solution of the ill-posed inverse problem: for a.e. x ∈ H + sign x y f β (y)dσ(y) def = Kf β (x).
The operator K in (3) is a convolution on S d−1 .
The identification problem in this model stems from the fact that: (1) the distribution of the observed data only characterizes Kf β on supp(f X ) which is a proper subset of S d−1 and (2) K has an infinite dimensional null space. The support of X can only be as large as H + because the first coordinate of X is positive (i.e. we allow the term 1,i − −1,i in (1)). Condition (A2.2) requires that the support of Z 1 − Z −1 is R d . This is demanding but we do not restrict dependence in the vector ( 1 − −1 , γ ) or its tails. It is important to avoid restricting the dependence between the coordinates of ( 1 − −1 , γ ) since they can be functions of a deep heterogeneity parameter (e.g., the type of a consumer). Tail assumptions on unobservables (see, e.g., [4]) are problematic since they are impossible to test. In [8,11], it is assumed that there exists n (unknown) in S d−1 such that P(n β > 0) = 1. It implies that for some difference of the characteristics, or taking a limit of these, everyone chooses the same alternative. In contrast, (A2.1) does not restrict the support of β and does not imply "unselected samples".
Estimation of f β in (3) is related to statistical deconvolution on S d−1 (see, e.g., [10,13,16]). However, the left-hand side of (3) is not a density but a regression function where the regressors are random and supported on a subset of S d−1 and the convolution is not injective. [8] provides a straightforward to compute estimator, rates of convergence for the L p -losses for 1 ≤ p ≤ ∞, and confidence sets. This paper extends [8] by providing lower bounds on the minimax risk where the degree of integrability in the loss -specified by the statistician -can differ from the degree of integrability of the smoothness ellipsoid containing the unknown f β , giving rise to sparse and dense regimes. We show that the estimator in [8] is a plug-in of a linear needlet estimator. Needlets have been successfully used in statistics in [2,14,15]. We show that working in high-dimension and replacing the linear estimator by a nonlinear estimator based on hard thresholding achieves the bounds, up to a logarithmic factor. We also contribute to the literature on estimation using needlets by providing data-driven thresholds (similar in spirit to [5] for density estimation using the Dantzig selector). Proofs are given in the appendix.

Preliminaries
We use the notation x ∧ y and x ∨ y for the minimum and the maximum between x and y. We write x y when there exists c such that x ≤ cy, x y when there exists c such that x ≥ cy, and x y when x y and x y. We denote by |A| and 1 A the cardinal and indicator of the set A, by N the nonnegative integers, by N * the positive integers, by a.e. almost every, and by a.s. almost surely. We denote for 1 ≤ p ≤ ∞ by · p the p -norm of a vector, by · p the usual norm on the space L p (S d−1 ) of p integrable real-valued functions with respect to the spherical measure σ. We write L p odd (S d−1 ) (resp. L p even (S d−1 )) the closure in L p (S d−1 ) of continuous functions on S d−1 which are odd (i.e. for every x ∈ S d−1 , f (−x) = −f (x)) (resp. even). Every f ∈ L p (S d−1 ) can be uniquely decomposed as the sum of an odd and even function f − and f + in L p (S d−1 ). The space L 2 (S d−1 ) is a Hilbert space with the scalar product , derived from the norm, there f − and f + are orthogonal. D is the set of densities and ν(d) def = d/2 is the degree of ill-posedness of the inverse problem.
The best approximation in L r (S d−1 ) of a function f by harmonics of degree less or equal to m is We denote by W s r odd (S d−1 ) the restriction of W s r (S d−1 ) to odd functions. Definition 4. For s > 0, 1 ≤ r ≤ ∞, and 0 < q ≤ ∞, f belongs to the Besov space

The operator.
Proposition 5. The operator K satisfies the following properties: where the exponents ν(d) ± |1/r − 1/2|(d − 2) cannot be improved, Moreover, K is a self-adjoint and compact operator on .
2 odd (S d−1 ) has a unique inverse given by

Needlets.
Smoothed projection operators (see [8]) have good approximation properties in all L p (S d−1 ) spaces and are uniformly bounded from L p (S d−1 ) to L p (S d−1 ). One such operator, the delayed means, is the integral operator with kernel where J is an integer, a is a C ∞ and decreasing function on [0, ∞) supported on [0, 2] such that, for every 0 ≤ t ≤ 2, 0 ≤ a(t) ≤ 1 and, for every 0 ≤ t ≤ 1, a(t) = 1. The delayed means operator exhibits nearly exponential localization (see theorem 2.2 in [21]) and is a building block for the construction of needlets. Define b such that b 2 (t) = a (t) − a(2t) for t ≥ 0. It is nonzero only when 1/2 ≤ t ≤ 2, satisfies b 2 (t) + b 2 (2t) = 1 for 1/2 ≤ t ≤ 1 and thus for every t ≥ 1, ∞ j=0 b 2 t 2 j = 1, also b 2 (t) = a(t) for 1 ≤ t ≤ 2. Take a such that b is bounded away from 0 on 3/5 ≤ t ≤ 5/3. The second ingredient for the construction of needlets is a quadrature formula (corollary 2.9 of [21]) with positive weights (ω(j, ξ) 2 ) ξ∈Ξ j and nodes ξ ∈ Ξ j which integrates functions in 2 j k=0 H k,d and satisfy, for a constant C Ξ which depends on d, Needlets are defined as For j = 0, ψ 0,ξ (x) is constant and Ξ 0 is a singleton.
The L p -norms of the needlets satisfy, for a constant C p that can depend on d, The needlets form a tight frame, this means that for f ∈ L 2 (S d−1 ) Needlets do not form a basis and there is redundancy. Lemma 6 (see [2]) relates L p (S d−1 ) norms at level j to p norms of needlet coefficients. Constants may depend on d.
Needlets are such that (see [21]), for all function a in the definition of the smoothed projection operators, the norm · A B s r,q defining the Besov spaces is equivalent to The ball of radius M for this norm is denoted by B s r,q (M ).
Recall the following consequence of the proof of the continuous embeddings in [2].
Finally recall that, when f ∈ B s r,q with s > (d − 1)/r, then f is continuous.

Identification of f β
Let us present the arguments for the identification of f β . Proposition 5 (P1.1) implies that Kf β = Kf − β is odd. Thus under (A2.2) we can define the odd function R as . In this paper we normalize the vectors of random coefficients and covariates to have unit norm. Indeed, since only the sign of the net utility (1) matters for choosing between 1 and -1 and the index is linear, a scale normalization of ( 1 − −1 , γ ) is in order. Let us compare with the normalization in [7]. It is based on the following assumption, which is stronger than the condition in [11] that the support of β is a subset of some (unknown) hemisphere, which itself is stronger than (A2.1).
(H): a.s. there exists j ∈ {1, . . . , d}, the coordinate γ j of γ has a sign (excluding 0). assumption (H) is likely to hold when Z 1j and Z −1j are cost factors, since consumers dislike an increase in cost. If (H) holds we can identify for which index j γ j has a sign since it amounts to finding for which coordinate We can identify the sign of the coefficient by assessing whether the function is increasing (positive) or decreasing (negative). If γ j > 0 then we normalize the vector of coefficients by dividing by γ j . If γ j < 0 we change the sign of Z 1j − Z −1j to make it positive. A potential issue with this normalization is that if β j can take small values then estimators could differ in finite samples depending on which coefficient is used for normalization. Also, monotonicity in one regressor of the conditional mean function implies a type of weak monotonicity (in the sense used to identify treatment effects, see, e.g., [7]) at the individual level as we now explain. Assuming that γ j > 0, z 1i −z −1i = z for all i = 1, . . . , n, and that we change z j to z j > z j while leaving unchanged ( 1i − −1i , γ i ) (the characteristics of the individuals) and the other components of z, then some people do not change their decision and some chose alternative 1 while originally they had chosen alternative -1, but no one changes from alternative 1 to alternative -1. Monotonicity of the conditional mean function implies monotonicity for every individual. This is sometimes not a realistic model of individuals making choices. Clearly (A2.1) allows both individuals to switch from 1 to -1 and individuals to switch from -1 to 1 after similar changes in z (or x). On the other hand, if (H) holds then (A2.2) can be relaxed and we can consider an index which is nonlinear in X (c.f. [7]).

Lower bounds
We take 1 ≤ p, r ≤ ∞, 0 ≤ q ≤ ∞, z ≥ 1, and s > 0, and consider the minimax risk where the infimum is over all estimators based on the i.i.d. sample of size n. The degree of integrability r in the smoothness class B s r,q (M ) is allowed to differ from the degree of integrability p in the loss function. We distinguish two zones for s, r, q, d, and p:

Theorem 8. (i) In the dense zone we have
(ii) In the sparse zone we have where the constants c dense and c sparse depend on d, M , p, r, s and z.
The values of µ dense and µ sparse depend on d through the dimension of S d−1 . This is the usual curse of dimensionality in nonparametric regression or density estimation. They also depend on d through the degree of ill-posedness ν(d) = d/2 of the inverse problem.

Adaptive estimation by needlet thresholding
Consider the estimator

Smoothed projections and linear needlet estimators.
A smoothed projection estimator of f − β with kernel (10), window a, and J ∈ N, is given for Alternatively we can estimate f − β using the needlet frame with smoothing window a.
Using that a k , we obtain that, for 1 ≤ j ≤ J, β a j,ξ = f − β a,J , ψ j,ξ , which can be estimated without bias by Thus, the smoothed projection and needlet estimators coincide.

Nonlinear estimator with data-driven thresholds. Consider instead, for
It is classical that the optimal choice of J for linear estimators depends on the parameters of the smoothness ellipsoid. In contrast, using a thresholded estimator allows to take J large and independent of the parameters. Thresholding induces additional bias compared to linear estimators which allows to reduce the variance incurred by taking J large. The level of thresholding should depend on the size of the coefficients relative to their variance. This variance is proportional to 1/ √ n so that the level of the threshold do not have to depend on the smoothness of the unknown function. Instead of using a conservative upper bound on their variance, as is usually the case in estimation using wavelets, we use data-driven levels of thresholding. These provide better estimators in small samples. Lemma 14 gives a theoretical guarantee that the performance is almost as good as that of an oracle which would known the variance of the estimators of the coefficients. The data-driven thresholding rule uses that β a j,ξ = 1 Define the estimator of the variance by t n = log n/n, and the data-driven thresholds j,ξ (e.g., 2 G j,ξ ∞ ). For example, using (13) and proposition 5, we get The second term in T j,ξ,γ controls the error in estimating the threshold. and  c(d, p, r, s, γ) is a constant which depends on d, p, r, s, and γ.
Consider now the case where f X is unknown but we have at our disposal an estimator f X obtained with an auxiliary data set. Theorem 9 holds on the event if the expectations are conditional on the auxiliary data, (22) and (23), π ≥ 1 and c ∈ R are arbitrary. The probability of this event is close to 1 for large sets of f X where f X is small on a small subset of the sphere (possibly unbounded from below) and is smooth enough. A more general result is proved in the first version of this paper on arXiv (see also [8]).
Proof. The result is based on the following Proof of proposition 5. The operator K is related to the Hemispherical transform (see [8,22]) defined for f ∈ L 1 (S d−1 ) and a.e. x ∈ S d−1 by (P1.1) is a consequence of the fact that y → x y ∈ L ∞ odd (S d−1 ). (P1.2) follows from theorem 2 (ii), and (P1.3) follows from theorem C in [22]. The second part of the proposition together with (P1.4) are consequences of the properties of H detailed in [8]. The inequalities (8) This implies that the functions f m that we introduce below integrate to 1. Lemma 11. If for 0 < α < 1/8 we have: ( Start by checking (i) in lemma 11. It is enough to show that f m ∈ B s r,q (M ). Indeed, for r ≥ 1 and ω ∈ Ω, we have (ω ξ ) ξ∈A j r ≤ (ω ξ ) ξ∈A j Lemma 6 (ii) now yields that for every 1 ≤ p ≤ ∞ and 0 ≤ m < l ≤ M By independence, the Kullback-Leibler divergence between P m and P 0 is given by Using that, for x > 0, ln(x) ≤ x − 1, we obtain and thus where the last display comes from the fact that which yields using lemma 6 (i) Condition (iii) of lemma 11 is satisfied once For α < 1/8, the lower bound (27) yields that where the inequality leading to the second display holds when ln(M) ≥ 4, for example for j(d − 1) ≥ ln(5/c A ln (2))/ ln (2). Now (28) is satisfied for

7.3.2.
Proof of the lower bound in the sparse zone. In this proof we consider asymptotic orders for simplicity. The various constants can be obtained like in Section 7.3.1.
Consider the hypotheses where ξ m ∈ A j and |γ| 2 −j(d−1)/2 to ensure the functions are positive. The constant is adjusted so that for one of the f m that we denote . The function f m also integrate to 1. We denote by M the cardinality of A j (M 2 j(d−1) ), P m the distributions of an i.i.d. sample of (Y, X) of size n when f β = f m and for a given f X , and Λ(P m , P 0 ) the likelihood ratio. Recall that K(P m , P 0 ) = E Pm [Λ(P m , P 0 )]. We make use of the following lemma from [17].
Let us now consider item (iii), we obtain Thus, condition (iii) is satisfied when for α ∈ (0, 1). The same computations as in the beginning of Section 5.1 yield that we need to impose n2 −2jν(d) γ 2 j, thus The desired rate is obtained by taking

Comparison between Besov ellipsoids of a function and its odd part.
Lemma 13. For 0 < s, q ≤ ∞ and 1 ≤ r ≤ ∞, there exists a constant c eq that can depend on d such that, for every Proof. In definition 4 every f ∈ B s r,q (S d−1 ) has same norm as x → f (−x), thus by the triangle inequality f − A B s r,q ≤ f A B s r,q . We conclude by equivalence of the norms 7.5. A general inequality. We make use of the constants c 1,z and c 2,z such that Lemma 14. For every τ, γ, z > 1 and the two following inequalities hold: when p = ∞, where a n,∞,z,J = 1 + 2 √ γ log n where a n,p,z,J = 1 + 2 The inequalities of lemma 14 are similar to oracle inequalities, for a well chosen J depending on n (see theorem 9), where the oracle estimates β a j,ξ if and only if the error made by estimating this coefficient is smaller than the one made by discarding it. This oracle strategy would lead to a quantity of the form Proving such an oracle inequality would require to lower bound E β a j,ξ − β a j,ξ z 1/z . In the inequalities of lemma 14 the ideal quantity E β a j,ξ − β a j,ξ z 1/z is replaced by T s,++ j,ξ,γ , we call this term a quasi-oracle term. The remaining terms can be made as small as we want by taking γ large enough. The last term corresponds to the approximation error. Upper bounds of these types, uniform on Besov ellipsoids, yield an approximation error which can be expressed in terms of the regularity of the Besov class and is uniformly small for J large enough and allows to treat the bias/variance trade-off in the quasi-oracle term uniformly over the ellipsoid.
and that, for 1 ≤ z < ∞, we have The first term corresponds to the error in the high dimensional space while the second term corresponds to the approximation error. Let us start by studying the first term. Lemma 6 (i) yields the last inequality is obtained using that, when p ≥ z, we have and by the Hölder inequality, when p ≤ z, we have 7.6.2. Coefficientwise analysis. For the simplicity of the notations we sometimes drop the dependence on γ in the sets of indices. We first consider the term We introduce two "phantom" random thresholds T b j,ξ,γ = T j,ξ,γ − ∆ j,ξ,γ and T s j,ξ,γ = T j,ξ,γ + ∆ j,ξ,γ for some ∆ j,ξ,γ to be defined later. They are used to define "big" and "small" original needlets coefficients. We also use T b,− j,ξ,γ for a deterministic lower bound on T b j,ξ,γ , T s,+ j,ξ,γ and ∆ + j,ξ,γ for deterministic upper bounds on T s j,ξ,γ and ∆ j,ξ,γ . These bounds will hold with high probability. We obtain almost surely Sorting the terms according to the number of random terms we obtain We bound the expectations of the random terms as follows The constant τ > 1 in the Hölder inequality will be specified later.
The following lemma is useful to handle the case p = ∞. This yields

Proof. A uniform union bound yields
and thus, for any τ 1 ≥ 0 and τ 2 ≥ 0, we get Proof. Using the results of [20] we get P σ j,ξ > σ j,ξ + 2  The terms R 1,∞,z and R 2,∞,z . Using the definition of the Besov norm, we obtain With γ > z/2 + 1, which holds if 2(γ − 1)(1 − 1/τ ) > z, R 1,∞,z is of lower order than t z n . Due to the choice of J the term in bracket in the expression of R 2,∞,z in theorem 14 is