Asymptotic normality of a linear threshold estimator in fixed dimension with near-optimal rate

Linear thresholding models postulate that the conditional distribution of a response variable in terms of covariates differs on the two sides of a (typically unknown) hyperplane in the covariate space. A key goal in such models is to learn about this separating hyperplane. Exact likelihood or least squares methods to estimate the thresholding parameter involve an indicator function which make them difficult to optimize and are, therefore, often tackled by using a surrogate loss that uses a smooth approximation to the indicator. In this paper, we demonstrate that the resulting estimator is asymptotically normal with a near optimal rate of convergence: $n^{-1}$ up to a log factor, in both classification and regression thresholding models. This is substantially faster than the currently established convergence rates of smoothed estimators for similar models in the statistics and econometrics literatures. We also present a real-data application of our approach to an environmental data set where $CO_2$ emission is explained in terms of a separating hyperplane defined through per-capita GDP and urban agglomeration.


Introduction
The simple linear regression model assumes a uniform linear relationship between the covariate and the response, in the sense that the regression parameter β is the same over the entire covariate domain. In practice, the situation can be more complicated: for instance, the regression parameter may differ from sub-population to sub-population within a large (super-) population. Some common techniques to account for such heterogeneity include mixed linear models, introducing an interaction effect, or fitting different models among each sub-population which corresponds to a supervised classification setting where the true groups (sub-populations) are a priori known.
A more difficult scenario arises when the sub-populations are unknown, in which case regression and classification must happen simultaneously. Consider the scenario where the conditional mean of Y i given X i is different for different unknown sub-groups. A well-studied treatment of this problem -the so-called change point problem -considers a simple thresholding model where membership in a sub-group is determined by whether a real-valued observable X falls to the left or right of an unknown parameter γ. More recently, there has been work for multi-dimensional covariates, namely when the membership is determined by which side a random vector X falls with respect to an hyperplane with unknown normal vector θ 0 . A concrete example appears in [26] who extend the linear thresholding model due to [12] to general dimensions: and studied computational algorithms and consistency of the same. This model and others with similar structure, called change plane models, are useful in various fields of research, e.g.
Other aspects of this model have also been investigated. [8] examined the change plane model from the statistical testing point of view, with the null hypothesis being the absence of a separating hyperplane. They proposed a test statistic, studied its asymptotic distribution and provided sample size recommendations for achieving target values of power. [15] extended the change point detection problem in the multi-dimensional setup by considering the case where X θ 0 forms a multiple change point data sequence. The key difficultly with change plane type models is the inherent discontinuity in the optimization criteria involved where the parame- ter of interest appears as an argument to some indicator function, rendering the optimization extremely hard. To alleviate this, one option is to kernel smooth the indicator function, an approach that was adopted by Seo and Linton [21] in a version of the change-plane problem, motivated by earlier results of Horowitz [10] that dealt with a smoothed version of the maximum score estimator. Their model has an additive structure of the form: where ψ is the (fixed) change-plane parameter, and t can be viewed as a time index. Under a set of assumptions on the model (Assumptions 1 and 2 of their paper), they showed asymptotic normality of their estimator of ψ obtained by minimizing a smoothed least squares criterion that uses a differentiable distribution function K. The rate of convergence ofψ to the truth was shown to be n/σ n where σ n was the bandwidth parameter used to smooth the least squares function. As noted in their Remark 3, under the special case of i.i.d. observations, their requirement that log n/(nσ 2 n ) → 0 translates to a maximal convergence rate of n 3/4 up to a logarithmic factor. The work of [15] who considered multiple parallel change planes (determined by a fixed dimensional normal vector) and high dimensional linear models in the regions between consecutive hyperplanes also builds partly upon the methods of [21] and obtains the same (almost) n 3/4 rate for the normal vector (as can be seen by putting Condition 6 in their paper in conjunction with the conclusion of Theorem 3).
While it is established that the condition nσ 2 n → ∞ is sufficient (upto a log factor) for achieving asymptotic normality of the smoothed estimator, there is no result in the existing literature to ascertain whether its necessity. Intuitively speaking, the necessary condition for asymptotic normality ought to be nσ n → 0, as this will ensure a growing number of observations in a σ n neighborhood around the true hyperplane, allowing the central limit theorem to kick in.
In this paper we bridge this gap by proving that asymptotic normality of the smoothed change point estimator is, in fact, achievable with nσ n → ∞. This implies that the best possible rate of convergence of the smoothed estimator can be arbitrarily close to n −1 , the minimax optimal rate of estimation for this problem. To demonstrate this, we focus on two change plane estimation problems, one with a continuous and another with a binary response. The continuous response model we analyze here is the following: , where the zero-mean transitory shocks i |= (X i , Q i ). Our calculation can be easily extended to the case when the covariates on the either side of the change hyperplane are different and E[ | X, Q] = 0 with more tedious bookkeeping. As this generalization adds little of interest, conceptually, to our proof, we posit the simpler model for ease of understanding. As the parameter ψ 0 is only identifiable upto its norm, we assume that the first co-ordinate is 1 (along the lines of [21]) which removes one degree of freedom and makes the parameter identifiable.
To illustrate that a similar phenomenon transpires with binary response, we also study a canonical version of such a model which can be briefly described as follows: The covariate Q ∼ P where P is distribution on R d and the conditional distribution of Y given Q is modeled as follows: P (Y = 1|Q) = α 0 1(Q ψ 0 ≤ 0) + β 0 1(Q ψ 0 > 0) (1.3) for some parameters α 0 , β 0 ∈ (0, 1) and ψ 0 ∈ R d (with first co-ordinate being one for identifiability issue as for the continuous response model), the latter being of primary interest for estimation. This model is identifiable up to a permutation of (α 0 , β 0 ), so we further assume α 0 < β 0 . For both models, we show that n/σ n (ψ − ψ 0 ) converges to zero-mean normal distribution as long as nσ n → ∞ but the calculations for the binary model are completely relegated to Appendix C.
Organization of the paper: The rest of the paper is organized as follows: In Section 2 we present the methodology, the statement of the asymptotic distributions and a sketch of the proof for the continuous response model (1.2). In Section 3 we briefly describe the binary response model (1.3) and related assumptions, whilst the details can be found in the supplementary document. In Section 5 we present some simulation results, both for the binary and the continuous response models to study the effect of the bandwidth on the quality of the normal approximation in finite samples. In Section 6, we present a real data analysis where we analyze the effect of income and urbanization on the CO 2 emission in different countries.
Notations: Before delving into the technical details, we first setup some notations here.
We assume from now on, X ∈ R p and Q ∈ R d . For any vector v we define byṽ as the vector with all the co-ordinates expect the first one. We denote by K the kernel function used to smooth the indicator function. For any matrix A, we denote by A 2 (or A F ) as its Frobenious norm and A op as its operator norm. For any vector, · 2 denotes its 2 norm.
where P n is empirical measure based on i.i.d. observations and Θ is the parameter space. Henceforth, we assume Θ is a compact subset of dimension R 2p+d . We also define θ = (β, δ, ψ), i.e. all the parameters together as a vector and by θ 0 is used to denote the true parameter vector (β 0 , δ 0 , ψ 0 ). Some modification of equation (2.1) leads to the following: Typical empirical process calculations yield under mild conditions: but inference is difficult as the limit distribution is unknown, and in any case, would be a highly non-standard distribution. Recall that even in the one-dimensional change point model with fixed jump size, the least squares change point estimator converges at rate n to the truth with a non-standard limit distribution, namely a minimizer of a two-sided compound Poisson process (see [13] for more details). To obtain a computable estimator with tractable limiting distribution, we resort to a smooth approximation of the indicator function in (2.1) using a distribution kernel with suitable bandwidth, i.e we replace 1 Q i ψ>0 by K(Q i ψ/σ n ) for some appropriate distribution function K and bandwidth σ n , i.e.
Define M (resp. M s ) to be the population counterpart of M n and M s n respectively which are defined as: As noted in the proof of Seo and Linton, the assumption log n/nσ 2 n → 0 was only used to show: In this paper, we show that one can achieve the same conclusion as long as nσ n → ∞. The rest of the proof for the normality is similar to that of [21], we will present it briefly for the ease the readers. The proof is quite long and technical, therefore we break the proof into several lemmas. We, first, list our assumptions: 1. Define f ψ (· |Q) to be the conditional distribution of Q ψ givenQ.
(In particular we will denote by f 0 (· |q) to be conditional distribution of Q ψ 0 givenQ and f s (· |q) to be the conditional distribution of Q ψ s 0 givenQ. Assume that there exists F + such that sup t f 0 (t|Q) ≤ F + almost surely onQ and for all ψ in a neighborhood of ψ 0 (in particular for ψ s 0 ). Further assume that f ψ is differentiable and the derivative is bounded by F + for all ψ in a neighborhood of ψ 0 (again in particular for ψ s 0 ).
2. Define g(Q) = var(X | Q). There exists c − and c + such that c − ≤ λ min (g(Q)) ≤ λ max (g(Q)) ≤ c + almost surely. Also assume that g is a Lipschitz with constant G + with respect to Q.

Define
Lipschitz function of Q.

Sufficient conditions for above assumptions
We now demonstrate some sufficient conditions for the above assumptions to hold. The first condition is essentially a condition on the conditional density of the first co-ordinate of Q given all other co-ordinates. If this conditional density is bounded and has bounded derivative, then first assumption is satisfied. This condition is satisfied in fair generality. The second assumption implies that the conditional distribution of X given Q has variance in all the direction over all Q. This is also very weak condition, as is satisfied for example if X and Q and independent (with X has non-degenerate covariance matrix) or (X, Q) are jointly normally distributed to name a few. This condition can further be weaken by assuming that the maximum and minimum eigenvalues of E[g(Q)] are bounded away from ∞ and 0 respectively but it requires more tedious book-keeping. The third assumption is satisfied as long as as Q ψ has non-zero density near origin, while the fourth assumption merely states that the support of Q is not confined to one side of the hyperplane for any hyperplane and a simple sufficient condition for this is Q has continuous density with non-zero value at the origin. The last assumption is analogous to the second assumption for the conditional fourth moment which is also satisfied in fair generality.
Kernel function and bandwidth: We take K(x) = Φ(x) (distribution of standard normal random variable) for our analysis. For the bandwidth we assume nσ 2 n → 0 and nσ n → ∞ as the other case, (i.e. nσ 2 n → ∞) is already established in [21].
Based on Assumption 2.1 and our choice of kernel and bandwidth we establish the following theorem: Theorem 2.2. Under Assumption 2.1 and the above choice of kernel and bandwidth we have: and for matrices Σ β,δ and Σ ψ mentioned explicitly in the proof. Moreover they are asymptotically independent.
The proof of the theorem is relatively long, so we break it into several lemmas. We provide a roadmap of the proof in this section while the elaborate technical derivations of the supporting lemmas can be found in Appendix. Let ∇M s n (θ) and ∇ 2 M s n (θ) be the gradient and Hessian of M s n (θ) with respect to θ. Asθ s minimizes M s n (θ), we have from the first order condition, ∇M s n (θ s ) = 0. Using one step Taylor expansion we have: for some intermediate point θ * betweenθ s and θ 0 . Following the notation of [21], define a diagonal matrix D n of dimension 2p + d with first 2p elements being 1 and the last d elements being √ σ n . we can write: where γ = (β, δ) ∈ R 2p . The following lemma establishes the asymptotic properties of ∇M s n (θ 0 ): Lemma 2.3 (Asymptotic Normality of ∇M s n (θ 0 )). Under assumption 2.1 we have: for some n.n.d. matrices V γ and V ψ which is mentioned explicitly in the proof. Further more √ n∇M s,γ n (θ 0 ) and √ nσ n ∇M s,ψ n (θ 0 ) are asymptotically independent.
It will be shown later that the condition ψ n − ψ 0 /σ n P → 0 needed in Lemma C.9 holds for the (random) sequence ψ * , the intermediate point in the Taylor expansion. Then, combining Lemma C.8 and Lemma C.9 we conclude the proof of Theorem 2.2. Observe that, to show for some specific constant K. (This constant will be mentioned precisely in the proof ). Hence as nσ n → ∞, we have n 2/3 σ −1/3 n σ −1 n which implies ψ s − ψ s 0 /σ n P −→ 0 .
Hence the final roadmap is the following: Using Lemma C.12 and Lemma 2.5 we establish that ψ s − ψ 0 /σ n = o p (1) if nσ n → 0. This, in turn, enables us to prove Lemma C.9, i.e.

Binary response model
Recall our binary response model in equation (1.3). To estimate ψ 0 , we resort to the following loss (without smoothing): with γ ∈ (α 0 , β 0 ), which can be viewed as a variant of the square error loss function: We establish the connection between these losses in sub-section C.2. It is easy to prove that under fairly mild conditions (discussed later) ψ 0 = arg min ψ∈Θ M(ψ), uniquely. Under the standard classification paradigm, when we know a priori that α 0 < 1/2 < β 0 , we can take γ = 1/2, and in the absence of this constraint,Ȳ , which converges to some γ between α 0 and β 0 , may be substituted in the loss function. In the rest of the paper, we confine ourselves to a known γ, and for technical simplicity, we take γ = (β 0 +α 0 ) 2 , but this assumption can be removed with more mathematical book-keeping. Thus, ψ 0 is estimated by: We resort to a smooth approximation of the indicator function in (3.2) using a distribution kernel with suitable bandwidth. The smoothed version of the population score function then becomes: where as in the continuous response model, we use K(x) = Φ(x), and the corresponding empirical version is: for some non-stochastic matrix Γ, which will be defined explicitly in the proof.
We have therefore established that in the regime nσ n → ∞ and nσ 2 n → 0, it is possible to attain asymptotic normality using a smoothed estimator for binary response model.

Inferential methods
We draw inferences on (β 0 , δ 0 , ψ 0 ) by resorting to similar techniques as in [21]. For the continuous response model, we need consistent estimators of V γ , Q γ , V ψ , Q ψ (see Lemma C.9 for the definitions) for hypothesis testing. By virtue of the aforementioned Lemma, we can estimate Q γ and Q ψ as follows:Q The consistency of the above estimators is established in the proof of Lemma C.9. For the other two parameters V γ , V ψ we use the following estimators: Therefore, the consistency ofV γ is immediate from the law of large numbers. The consistency of V ψ follows via arguments similar to those employed in proving Lemma C.9 but under somewhat more stringent moment conditions: in particular, we need E[ X 8 ] < ∞ and E[(X δ 0 ) k | Q] to be Lipschitz functions over Q for 1 ≤ k ≤ 8. The inferential techniques for the classification model are similar and hence skipped, to avoid repetition.

Simulation studies
In this section, we present some simulation results to analyse the effect of the choice of σ n on the finite sample approximation of asymptotic normality, i.e. Berry-Essen type bounds. If we choose a smaller sigma, the rate of convergence is accelerated but the normal approximation error at smaller sample sizes will be higher, as we don't have enough observations in the vicinity of the change hyperplane for the CLT to kick in. This problem is alleviated by choosing σ n larger, but this, on the other hand, compromises the convergence rate. Ideally, a Berry-Essen type of bound will quantify this, but this will require a different set of techniques and is left as an open problem. In our simulations, we generate data from following setup: 1. Set N = 50000, p = 3, α 0 = 0.25, β = 0.75 and some θ 0 ∈ R p with first co-ordinate = 1.

Generate
We repeat Step 2 -Step 4 a hundred times to obtainθ 1 , . . . ,θ 100 . Define s n to be the standard that smaller value of σ n yield a poor normal approximation. Although our theory shows that asymptotic normality holds as long as nσ n → ∞, in practice we recommend choosing σ n such that nσ n ≥ 30 for the central limit of theorem to take effect.

Real data analysis
We illustrate our method using cross-country data on pollution (carbon-dioxide), income and   cross-country data (e.g. [22], [19], [1], [14], [4], [18], [9], [3], [2], [24] to name a few). While some of these papers have found evidence in favor of the EKC hypothesis (inverted U-shaped income-pollution relation), others have found evidence against it (monotonically increasing or other shapes for the relation). The results often depend on countries/regions in the sample, period of analysis, as well as the pollutant studied.
While income-pollution remains the focal point of most EKC studies, several of them have also included urban agglomeration (UA) or some other measures of urbanization as an important control variable especially while investigating carbon emissions. 1 (see for example, [22], [4]and [16]). The theory of ecological economics posits potentially varying effects of increased urbanization on pollution-(i) urbanization leading to more pollution (due to its close links with sanitations, dense transportations, and proximities to polluting manufacturing industries), (ii) urbanization potentially leading to less pollution based on 'compact city theory' (see [5], [6], [20]) that explains the potential benefits of increased urbanization in terms of economies of scale (for example, replacing dependence on automobiles with large scale subway systems, using multi-storied buildings instead of single unit houses, keeping more open green space). [17], using 17 developed countries, find a positive and significant effect of urbanization on pollution. On the contrary, using a set of 69 countries [23] find a negative and significant effect of urbanization on pollution while [7] find an insignificant effect of urbanization on carbon emission. Using various empirical strategies [20] conclude that the positive and negative effects of urbanization on carbon pollution may cancel out depending on the countries involved often leaving insignificant effects on pollution. They also note that many countries are yet to achieve a sizeable level of urbanization which presumably explains why many empirical works using less developed countries find insignificant effect of urbanization. In summary, based on the existing literature, both the relationship between urbanization and pollution as well as the relationship between income and pollution appear to depend largely on the set of countries considered in the sample. This motivates us to use UA along with income in our change plane model for analyzing carbon-dioxide emission to plausibly separate the countries into two regimes.
Following the broad literature we use pollution emission per capita (carbon-dioxide measured in metric tons per capita) as the dependent variable and real GDP per capita (measured in 2010 US dollars), its square (as is done commonly in the EKC literature) and a popular measure of urbanization, namely urban agglomeration (UA) 2 as covariates (in our notation X) in our regression. In light of the preceding discussions we fit a change plane model comprising real GDP per capita and UA (in our notation Q). To summarize the setup, we use the continuous with the per capita CO 2 emission in metric ton as Y , per capita GDP, square of per capita GDP and UA as X (hence X ∈ R 3 ) and finally, per capita GDP and UA as Q (hence Q ∈ R 2 ). Observe that β 0 represents the regression coefficients corresponding to the countries with Q i ψ 0 ≤ 0 (henceforth denoted by Group 1) and (β 0 +δ 0 ) represents the regression coefficients corresponding to the countries with Q i ψ 0 ≥ 0 (henceforth denoted by Group 2). As per our convention, in the interests of identifiability we assume ψ 0,1 = 1, where ψ 0,1 is the change plane parameter corresponding to per capita GDP. Therefore the only change plane coefficient to be estimated is ψ 0,2 , the change plane coefficient for UA. For numerical stability, we divide per capita GDP by 10 −4 (consequently square of per capital GDP is scaled by 10 −8 ) 3 . After some pre-processing (i.e. removing rows consisting of NA and countries with 100% UA) we estimate the coefficients (β 0 , δ 0 , ψ 0 ) of our model based on data from 115 countries with σ n = 0.05 and test the signif-icance of the various coefficients using the methodologies described in Section 4. We present our findings in Table 1.
From the above analysis, we find that GDP has significantly positive effect on pollution for both groups of countries. The effect of its squared term is negative for both groups; but the effect is significant for Group-2 consisting of mostly high income countries whereas its effect is insignificant (at the 5% level) for the Group-1 countries (consisting of mostly low or middle income and few high income countries). Thus, not surprisingly, we find evidence in favor of EKC for the developed countries, but not for the mixed group. Notably, Group-1 consists of a mixed set of countries like Angola, Sudan, Senegal, India, China, Israel, UAE etc., whereas Group-1 consists of rich and developed countries like Canada, USA, UK, France, Germany etc.
The urban variable, on the other hand, is seen to have insignificant effect on Group-1 which is in keeping with [7], [20]. Many of them are yet to achieve substantial urbanization and this is more true for our sample period 4 . In contrast, UA has a positive and significant effect on Group-2 (developed) countries which is consistent with the findings of [17], for example. Note that UA plays a crucial role in dividing the countries into different regimes, as the estimated value of ψ 0,2 is significant. Thus, we are able to partition countries into two regimes: a mostly rich and a mixed group.
Note that many underdeveloped countries and poorer regions of emerging countries are still swamped with greenhouse gas emissions from burning coal, cow dung etc., and usage of poor exhaust systems in houses and for transport. This is more true for rural and semi-urban areas of developing countries. So even while being less urbanized compared to developed nations, their overall pollution load is high (due to inefficient energy usage and higher dependence on fossil fuels as pointed out above) and rising with income and they are yet to reach the de-

Conclusion
In this paper we have established that under some mild assumptions the kernel-smoothed change plane estimator is asymptotically normal with near optimal rate n −1 . To the best of our knowledge, the state of the art result in this genre of problems is due to [21], where they demonstrate a best possible rate about n −3/4 for i.i.d. data. The main difference between their approach and ours is mainly the proof of Lemma C.12. Our techniques are based upon modern empirical process theory which allow us to consider much smaller bandwidths σ n compared to those in [21], who appear to require larger values to achieve the result, possibly owing to their reliance on the techniques developed in [10]. Although we have established it is possible to have asymptotic normality with really small bandwidths, we believe that the finite sample approximation (e.g. Berry-Essen bound) to normality could be poor, which is also evident from our simulation.

A Appendix
In this section, we present the proof of Lemma 2.5, which lies at the heart of our refined analysis of the smoothed change plane estimator. Proofs of the other lemmas and our results for the binary response model are available in the Appendix B.
A.1 Proof of Lemma 2.5 Proof. The proof of Lemma 2.5 is quite long, hence we further break it into few more lemmas.
Lemma A.1. Under Assumption (2.1), there exists u − , u + > 0 such that: for θ in a (non-srinking) neighborhood of θ 0 , where: Under certain assumptions: for some constant K and for all θ in a neighborhood of θ 0 , which does not change with n.
The proofs of the three lemmas above can be found in Appendix B. We next move to the proof of Lemma 2.5. In Lemma A.3 we have established the curvature of the smooth loss function M s (θ) around θ s 0 . To determine the rate of convergence ofθ s to θ s 0 , we further need an upper bound on the modulus of continuity of our loss function. Towards that end, first recall that our loss function is: The centered loss function can be written as: For the rest of the analysis, fix ζ > 0 and consider the collection of functions F ζ which is defined as: First note that F ζ has bounded uniform entropy integral (henceforth BUEI) over ζ. To establish this, it is enough to argue that the collection F = {f θ : θ ∈ Θ} is BUEI. Note that the functions X → X β has VC dimension p and so is the map X → X (β + δ). Therefore is also BUEI, as composition with monotone function (here x 2 ) and taking difference keeps this property. Further by the hyperplane Q → Q ψ also has finite dimension (only depends on the dimension of Q) and the VC dimension does not change by scaling it with σ n . Therefore the functions Q → Q ψ/sigma n has same VC dimension as Q → Q ψ which is independent of n. Again, as composition of monotone function keeps BUEI property, the functions Q → K(Q ψ/σ n ) is also BUEI. As the product of two BUEI class is BUEI, we conclude that F (and hence F ζ ) is BUEI.
Now to bound the modulus of continuity we use Lemma 2.14.1 of [25]: where F ζ is some envelope function of F ζ . As the function class F ζ has bounded entropy integral, J (1, F ζ ) can be bounded above by some constant independent of n. We next calculate the order of the envelope function F ζ . Recall that, by definition of envelope function is: and we can write f θ − f θ s 0 = M 1 + M 2 + M 3 which follows from equation (A.1). Therefore, to find the order of the envelope function, it is enough to find the order of bounds of M 1 , M 2 , M 3 over the set d * (θ, θ s 0 ) ≤ ζ. We start with M 1 : and the second term: For the third term, note that: Henceforth, we define the envelope function to be F ζ = F ζ,1 + F ζ,2 + F ζ,3 . Hence we have by triangle inequality: From equation (A.2) and (A.3) we have: For F 3,ζ , first note that: where m 4 (Q) is defined in Assumption 2.1. In this part, we have to tackle the dichotomous behavior of ψ around ψ s 0 carefully. Henceforth define d 2 * (ψ, ψ s 0 ) as: This is a slight abuse of notation, but the reader should think of it as the part of ψ in d 2 We can decompose B ζ (ψ s 0 ) as a disjoint union of two sets: Assume K > 1. The case where K < 1 follows from similar calculations and hence skipped for brevity. Consider the following two cases: Hence we have: This implies: Therefore we have: Now: where as before we split R into three parts R = R 1 + R 2 + R 3 .
We next calculate the inner integral (involving (s, t)) of equation (A.5): Putting this bound in equation (A.5) we obtain: The third residual R 3 is even higher order term and hence skipped. It is immediate that the order of the remainders are equal to or smaller than ζ √ σ n which implies: The calculation for T 2 is similar and hence skipped for brevity. Combining conclusions for T 1 and T 2 we conclude when ζ ≤ √ Kσ n : Case 2: Now consider ζ > √ Kσ n . Then it is immediate that: Using this we have: The analysis of the remainder term is similar and if is of higher order. This concludes when ζ > √ Kσ n : Combining (A.6), (A.7) with equation (A.4) we have: Hence to obtain rate we have to solve r 2 n φ n (1/r n ) ≤ Note that we neeed consistency of θ s 0 here as the lower bound in Lemma A.1 is only valid in a neighborhood around θ 0 . As θ s 0 is the minimizer of M s (θ), from the first order condition we have: We first show that (ψ s 0 −ψ 0 )/σ n → 0 by reductio ab absurdum. From equation (B.1), we know . Hence it has a convergent subsequent ψ s 0,n k , where (ψ s 0,n k −ψ 0 )/σ n → h.
If we can prove that h = 0, then we establish every subsequence of ψ s 0 − ψ 0 /σ n has a further subsequence which converges to 0 which further implies ψ s 0 − ψ 0 /σ n converges to 0. To save some notations, we prove that if (ψ s 0 − ψ 0 )/σ n → h then h = 0. We start with equation (B.4).
As mentioned earlier, there is a bijection between (Q 1 ,Q) and (Q ψ 0 ,Q). The map of one side is obvious. The other side is also trivial as the first coordinate of ψ 0 is 1, which makes We first show that T 1 , T 2 and T 4 are o(1). Towards that end first note that: From the above bounds, it is immediate that to show that above terms are o(1) all we need to show is: Towards that direction, define η = (ψ s 0 −ψ 0 )/σ n : Therefore, all it remains to show is R 1 is also O(1) (or of smaller order): This completes the proof. For T 3 , the limit is non-degenerate which can be calculated as follows: That the remainder R is o(1) again follows by similar calculation as before and hence skipped.
Therefore we have when η = (ψ s 0 − ψ 0 )/σ n → h: which along with equation (B.5) implies: Taking inner product with respect to h on both side of the above equation we obtain: From equation (B.3) we have: Subtracting equation (B.7) from (B.6) we obtain: where: It is immediate via DCT that as n → ∞: From equation (B.8) and (B.9) it is immediate that: Next observe that: (B.10) Similar calculation yields: Combining equation (B.10) and (B.11) we conclude: which further implies, and by similar calculations: This completes the proof.

B.2 Proof of Lemma A.1
Proof. From the definition of M (θ) it is immediate that M(θ 0 ) = E[ 2 ] = σ 2 . For any general θ: This immediately implies: For notational simplicity, define p ψ = P(Q ψ > 0). Expanding the RHS we have: Using the fact that 2ab ≥ (a 2 /c) + cb 2 for any constant c we have: for any c. To make the RHS non-negative we pick p ψ 0 < c < 1 and concludes that: For the last 3 summands of RHS of equation (B.12): (B.14) Combining equation (B.13) and (B.14) we complete the proof of lower bound. The upper bound is relatively easier: note that by our previous calculation: This completes the entire proof.

B.3 Proof of Lemma A.2
Proof. The difference of the two losses: This function can be bounded as follows: as our parameter space is compact. For the rest of the calculation define η = (ψ −ψ 0 )/σ n .
The definition of η may be changed from proof to proof, but it will be clear from the context. Therefore we have: where the integral over t is finite follows from the definition of the kernel. This completes the proof.

B.4 Proof of Lemma A.3
Proof. First note that we can write: where ξ can be taken as close to 0 as possible. Henceforth we set K = 4(2K 1 + ξ)/u − . For the other part of the curvature (i.e. when ψ − ψ s 0 ≤ Kσ n ) we start with a two step Taylor expansion of the smoothed loss function: Recall the definition of M s (θ): The partial derivates of M s (θ) with respect to (β, δ, ψ) was derived in equation (B.2) -(B.4).
From there, we calculate the hessian of M s (θ): where we useη for a generic notation for (ψ −ψ 0 )/σ n . For notational simplicity, we define γ = (β, δ) and ∇ 2 M s,γ (θ), ∇ 2 M s,γψ (θ), ∇ 2 M s,ψψ (θ) to be corresponding blocks of the hessian matrix. We have: Note that we can write: The operator norm of the difference of two hessians can be bounded as: for any θ * in a neighborhood of θ s 0 with ψ − ψ s 0 ≤ Kσ n . To prove this note that for any θ: where: Therefore it is enough to show A op = O(σ n ). Towards that direction: the claim. From the above claim we conclude: for all large n.
We next deal with the cross term T 2 in equation (B.16). Towards that end first note that: where the remainder term R 1 can be further decomposed R 1 = R 11 + R 12 + R 13 with: where the last bound follows from our assumptions using the fact that: The other remainder term R 13 is the higher order term and can be shown to be O(σ 2 n ) using same techniques. This implies for all large n: and similar calculation yields ∇ δψ M s (θ) op = O(1). Using this we have: Now for T 3 note that: We next show that M 1 and M 4 are O(σ n ). Towards that end note that for any two vectors v 1 , v 2 : where the remainder term R is O p (σ n ) can be established as follows: For R 1 : and similarly for R 2 : Therefore from (B.19) we conclude: Similar calculation for M 3 yields: i.e.
This, along with equation (B.15) concludes the proof.

B.5 Proof of Lemma C.8
We start by proving that analogues of Lemma 2 of [21]: we show that: lim n→∞ E √ nσ n ∇M s,ψ n (θ 0 ) = 0 lim n→∞ var √ nσ n ∇M s,ψ n (θ 0 ) = V ψ for some matrix V ψ which will be specified later in the proof. To prove the limit of the expectation: For the variance part: var √ nσ n ∇M s,ψ n (θ 0 ) The outer product of the expectation (the second term of the above summand) is o(1) which follows from our previous analysis of the expectation term. For the second moment: Finally using Lemma 6 of [10] we conclude that √ nσ n ∇M s,ψ n (θ 0 ) =⇒ N (0, V ψ ).
We next prove that √ n∇M s,γ n (θ 0 ) to normal distribution. This is a simple application of CLT along with bounding some remainder terms which are asymptotically negligible. The gradients are: That (1/ √ n) i X i i converges to normal distribution follows from a simple application of CLT. Therefore, once we prove that R 1 and R 2 are o p (1) we have: where: To complete the proof we now show that R 1 and R 2 are o p (1). For R 1 , we show that E[R 1 ] → 0 and var(R 1 ) → 0. For the expectation part: For the variance part: This shows that var(R 1 ) = o(1) and this establishes R 1 = o p (1). The proof for R 2 is similar and hence skipped for brevity.
Our next step is to prove that √ nσ n ∇ ψ M s n (θ s 0 ) and √ n∇M s,γ n (θ s 0 ) are asymptotically uncorre-lated. Towards that end, first note that: Also, it follows from the proof of E √ nσ n ∇ ψ M s n (θ 0 ) → 0 we have: Finally note that: Now getting back to the covariance: The proof for E √ nσ n ∇ ψ M s n (θ 0 ) ( √ n∇ δ M s n (θ 0 )) is similar and hence skipped. This completes the proof.
B.6 Proof of Lemma C.9 To prove first note that by simple application of law of large number (and using the fact that ψ * − ψ 0 /σ n = o p (1) we have: The proof of the fact that √ σ n ∇ 2 ψγ M s n (θ * ) = o p (1) is same as the proof of Lemma 5 of [21] and hence skipped. Finally the proof of the fact that for some non-negative definite matrix Q. The proof is similar to that of Lemma 6 of [21], using which we conclude the proof with: This completes the proof. So we have established: and they are asymptotically uncorrelated.

C Proof of Theorem 3.1
In this section, we present the details of the binary response model, the assumptions, a roadmap of the proof and then finally prove Theorem 3.1.
Assumption C.1. The below assumptions pertain to the parameter space and the distribution of Q: 1. The parameter space Θ is a compact subset of R p .
2. The support of the distribution of Q contains an open subset around origin of R p and the distribution of Q 1 conditional onQ = (Q 2 , . . . , Q p ) has, almost surely, everywhere positive density with respect to Lebesgue measure.
For notational convenience, define the following: 1. Define f ψ (·|Q) to the conditional density of Q ψ givenQ for θ ∈ Θ. Note that the following relation holds: where we define f Q 1 (·|X) is the conditional density of Q 1 givenQ.
The rest of the assumptions are as follows: Assumption C.2. f 0 (y|Q) is at-least once continuously differentiable almost surely for allQ.
Also assume that there exists δ and t such that for allQ almost surely.
This assumption can be relaxed in the sense that one can allow the lower bound t to depend onQ, provided that some further assumptions are imposed on E(t(Q)). As this does not add anything of significance to the import of this paper, we use Assumption C.2 to simplify certain calculations.

C.1 Sufficient conditions for above assumptions
We now demonstrate some sufficient conditions for the above assumptions to hold. If the support of Q is compact and both f 1 (·|Q) and f 1 (·|Q) are uniformly bounded inQ, then Assumptions (C.1, C.2, C.3, C.4) follow immediately. The first part of Assumption C.5, i.e.
the assumption fQ(0) > 0 is also fairly general and satisfied by many standard probability distributions. The second part of Assumption C.5 is satisfied when f 0 (0|Q) has some lower bound independent ofQ andQ has non-singular dispersion matrix.
Below we state our main theorem. In the next section, we first provide a roadmap of our proof and then fill in the corresponding details. For the rest of the paper, we choose our bandwidth σ n to satisfy log n nσn → 0.
Remark C.6. As our procedure requires the weaker condition (log n)/(nσ n ) → 0, it is easy to see from the above Theorem that the rate of convergence can be almost as fast as n/ √ log n.
Remark C.7. Our analysis remains valid in presence of an intercept term. Assume, without loss of generality, that the second co-ordinate of Q is 1 and letQ = (Q 3 , . . . , Q p ). It is not difficult to check that all our calculations go through under this new definition ofQ. We, however, avoid this scenario for simplicity of exposition.
In the case that nσ 3 n → 0, which, holds when nσ n → 0 as assumed prior to the statement of the theorem, λ = 0 and we have: Next, we analyze the convergence of Q n (ψ * n ) −1 which is stated in the following lemma: Lemma C.9 (Convergence in Probability of Q n ). Under Assumptions (C.1 -C.5), for any random sequenceψ n such that ψ n − ψ 0 /σ n P → 0, It will be shown later that the condition ψ n − ψ 0 /σ n P → 0 needed in Lemma C.9 holds for the (random) sequence ψ * n . Then, combining Lemma C.8 and Lemma C.9 we conclude from equation C.1 that: This concludes the proof of the our Theorem 3.1 with Γ = Q −1 ΣQ −1 .
Towards that direction, we have following lemma: Lemma C.10 (Rate of convergence). Under Assumptions (C.1 -C.5), for some specific constant K. (This constant will be mentioned precisely in the proof ).
The lemma immediately leads to the following corollary: Finally, to establish ψ s − ψ 0 /σ n P → 0, all we need is that ψ s 0 − ψ 0 /σ n → 0 as demonstrated in the following lemma: Lemma C.12 (Convergence of population minimizer). For any sequence of σ n → 0, we have: Hence the final roadmap is the following: Using Lemma C.12 and Corollary C.11 we establish that ψ s − ψ 0 /σ n → 0 if nσ n → ∞. This, in turn, enables us to prove that σ n Q n (ψ * n ) P → Q,which, along with Lemma C.8, establishes the main theorem.
Remark C.13. In the above analysis, we have assumed knowledge of γ in between (α 0 , β 0 ).
However, all our calculations go through if we replace γ by its estimate (sayȲ ) with more tedious book-keeping. One way to simplify the calculations is to split the data into two halves, estimate γ (viaȲ ) from the first half and then use it as a proxy for γ in the second half of the data to estimate ψ 0 . As this procedure does not add anything of interest to the core idea of our proof, we refrain from doing so here.

C.2 Variant of quadratic loss function
In this sub-section we argue why the loss function in (3.1) is a variant of the quadratic loss function for any γ ∈ (α 0 , β 0 ). Assume that we know α 0 , β 0 and seek to estimate ψ 0 . We start with an expansion of the quadratic loss function: Since the first summand is just EY , it is irrelevant to the minimization. A cursory inspection shows that it suffices to minimize On the other hand the loss we are considering is E (Y − γ)1 Q ψ≤0 : which can be rewritten as: By Assumption C.1, for ψ = ψ 0 , P sign(Q ψ) = sign(Q ψ 0 ) > 0. As an easy consequence, equation (C.2) is uniquely minimized at ψ = ψ 0 . To see that the same is true for (C.3) when γ ∈ (α 0 , β 0 ), note that the first summand in the equation does not depend on ψ, that the second and third summands are both non-negative and that at least one of these must be positive under Assumption C.1.

C.3 Linear curvature of the population score function
Before going into the proofs of the Lemmas and the Theorem, we argue that the population score function M (ψ) has linear curvature near ψ 0 , which is useful in proving Lemma C. 10. We begin with the following observation: Lemma C.14 (Curvature of population risk). Under Assumption C.2 we have: for some constants 0 < u − < u + < ∞, for all ψ ∈ ψ.
Proof. First, we show that which follows from the calculation below: We now analyze the probability of the wedge shaped region, the region between the two hyperplanes Q ψ = 0 and Q ψ 0 = 0. Note that, A similar calculation yields Adding both sides of equation C.4 and C.5 we get: Define ψ max = sup ψ∈ψ ψ , which is finite by Assumption C.1. Below, we establish the lower bound: At the very end, we have used the fact that To prove this, assume that the infimum is 0. Then, there exists γ 0 ∈ S p−1 such that as the above function continuous in γ and any continuous function on a compact set attains its infimum. Hence, Q γ 0 = 0 for all Q ≤ δ/ψ max , which implies thatQ does not have full support, violating Assumption C.1 (2). This gives a contradiction.
Establishing the upper bound is relatively easier. Going back to equation (C.6), we have: as E m(Q) Q < ∞ by Assumption C.3 and the sub-Gaussianity ofX.
C.4 Proof of Lemma C.8 Proof. We first prove that under our assumptions σ −1 n E(T n (ψ 0 )) The proof is based on Taylor expansion of the conditional density: Next, we prove that Var √ nσ n T n (ψ 0 ) −→ Σ as n → ∞, where Σ is as defined in Lemma C.8.

C.5 Proof of Lemma C.9
Proof. Let n ↓ 0 be a sequence such that P( ψ n − ψ 0 ≤ n σ n ) → 1. Define Ψ n = {ψ : where · F denotes the Frobenius norm of a matrix. Sometimes, we omit the subscript F when there is no ambiguity. Define G n to be collection of functions: That the function class G n has bounded uniform entropy integral (BUEI) is immediate from the fact that the function Q → Q ψ has finite VC dimension (as the hyperplanes has finite VC dimension) and it does change upon constant scaling. Therefore Q → Q ψ/σ n also has finite VC dimension which does not depend on n and hence BUEI. As composition with a monotone function and multiplication with constant (parameter free) functions or multiplication of two BUEI class of functions keeps BUEI property, we conclude that G n has BUEI. We first expand the expression in two terms: That T 1,n P → 0 follows from uniform law of large number of a BUEI class (e.g. combining Theorem 2.4.1 and Theorem 2.6.7 of [25]). For uniform convergence of the second summand T n,2 , define χ n = {Q : Q ≤ 1/ √ n }. Then χ n ↑ R p−1 . Also for any ψ ∈ Ψ n , if we define γ n ≡ γ n (ψ) = (ψ − ψ 0 )/σ n , then |γ nQ | ≤ √ n for all n and for all ψ ∈ Ψ n ,Q ∈ χ n . Now, Note that by DCT and Assumptions C.1 and C.4. For the second part: C. 6 Proof of Lemma C.12 Here we prove that ψ s 0 − ψ 0 /σ n → 0 where ψ s 0 is the minimizer of M s (ψ) and ψ 0 is the minimizer of M (ψ).
Proof. Define η = (ψ s 0 − ψ 0 )/σ n . At first we show that, η 2 is O(1), i.e. there exists some constant Ω 1 such that η 2 ≤ Ω 1 for all n: Hence: there exists a subsequence η n k and a point c ∈ R p−1 such that η n k → c. Along that sub-sequence we have: Taking limits on both sides and applying DCT (which is permissible by DCT) we conclude: By our assumption that K is symmetric kernel and that K(t) > 0 for all t ∈ (−1, 1), we easily conclude that c Q 2K(c Q ) − 1 ≥ 0 almost surely inQ with equality iff c X = 0, which is not possible unless c = 0. Hence we conclude that c = 0. This shows that any convergent subsequence of η n converges to 0, which completes the proof.
For the linear part, we first establish that |M(ψ) − M s (ψ)| = O(σ n ) uniformly for all ψ. Define η = (ψ − ψ 0 )/σ n :  Combining, we have where the last inequality holds for all large n as proved in Lemma C.12. Using Lemma C.12 again, we conclude that for any pair of positive constants ( 1 , 2 ): for all large n, which implies: σ n E QQ f 0 (0|Q)K (Q η) + R . (C.12) As we want a lower bound on the set ψ − ψ 0 s ≤ Kσ n , we have η ≤ K. For the rest of the analysis, define Clearly Λ ≥ 0 and continuous on a compact set, hence its infimum is attained. Suppose Λ(v 1 , v 2 ) = 0 for some v 1 , v 2 . Then we have: which further implies |ṽ 1X | = 0 almost surely and violates Assumption C.5. Hence, our claim is demonstrated. On the other hand, for the remainder term of equation (C.12): fix ν ∈ S p−1 . Then: ν Rν by Assumption C.1 and Assumption C.4. By a two-step Taylor expansion, we have: The proof is similar to that of Lemma 2.5 and therefore we sketch the main steps briefly. Define the estimating function f ψ as: f ψ (Y, Q) = (Y − γ) 1 − K Q ψ σ n and the collection of functions F ζ = {f ψ − f ψ n 0 : d n (ψ, ψ s 0 ) ≤ δ}. That F ζ has finite VC dimension follows from the same argument used to show G n has finite VC dimension in the proof of Lemma C.9. Now to bound modulus of continuity, we use Lemma 2.14.1 of [25], which implies: This completes the proof.