Generalized M-estimators for high-dimensional Tobit I models

This paper develops robust confidence intervals in high-dimensional and left-censored regression. Type-I censored regression models, where a competing event makes the variable of interest unobservable, are extremely common in practice. In this paper, we develop smoothed estimating equations that are adaptive to censoring level and are more robust to the misspecification of the error distribution. We propose a unified class of robust estimators, including one-step Mallow’s, Schweppe’s, and Hill-Ryan’s estimator that are adaptive to the left-censored observations. In the ultrahigh-dimensional setting, where the dimensionality can grow exponentially with the sample size, we show that as long as the preliminary estimator converges faster than n−1/4, the one-step estimators inherit asymptotic distribution of fully iterated version. Moreover, we show that the size of the residuals of the Bahadur representation matches those of the pure linear models – that is, the effects of censoring disappear asymptotically. Simulation studies demonstrate that our method is adaptive to the censoring level and asymmetry in the error distribution, and does not lose efficiency when the errors are from symmetric distributions.


Introduction
Left-censored data is a characteristic of many datasets. In physical science applications, observations can be censored due to limits in the measurements. For example, if a measurement device has a value limit on the lower end, the observations are recorded with the minimum value, even though the actual result is below the measurement range. In fact, many of the HIV studies have to deal with difficulties due to the lower quantification and detection limits of viral load assays (Swenson et al., 2014). In social science studies, censoring may be implied in the nonnegative nature or defined through human actions. Economic policies such as minimum wage and minimum transaction fee result in left-censored data, as quantities below the thresholds will never be observed. At the same time, with advances in modern data collection, high-dimensional data where the number of variables, p, exceeds the number of observations, n, Inference for left-censored linear model 583 are becoming more and more commonplace. HIV studies are usually complemented with observations about genetic signature of each patient, making the problem of finding the association between the number of viral loads and the gene expression values extremely high dimensional.
In general, we cannot develop p-values from the high-dimensional observations without further restrictions on the data generating distribution. A standard way to make progress is to assume that the model is selected consistently (Zhao and Yu, 2006;Fan and Li, 2001), i.e., that the regularized estimator accurately selects the correct set of features. The motivation behind model selection consistency is that, given sparsity of the model at hand, it effectively implies that we can disregard all of the features whose coefficients are equal to zero. An immediate consequence is that p-values are now well defined for the small selected set of variables; see for example Bradic et al. (2011). Such results heavily rely on assumptions named "irrepresentative condition" and variants thereof including but not limited to the minimal signal strength (Van De Geer et al., 2009). Thus, if we were to know that such conditions hold, p-value construction would follow standard literature of what are essentially low-dimensional problems. Many early applications of regularized methods effectively impose conditions similar to the irrepresentable condition and then rely solely on the results of the regularized estimator. However, such restrictions can make it challenging to discover strong but unexpected significant signals. In this paper, we seek to address this challenge: we showcase that valid p-values can be well defined for all of the features in the model through development of robust, bias-corrected estimator that yields valid asymptotic inference regardless of whether or not irrepresentable-type conditions are assumed.
Classical approaches to inference in left-censored models (Tobin, 1958) include maximum likelihood approaches (Amemiya, 1973), consistent estimators of the asymptotic covariance matrix (Powell, 1984), bayesian methods (Chib, 1992), maximum entropy principles (Golan et al., 1997), etc. These methods perform well in applications with a small number of covariates (smaller than the sample size), but quickly break down as the number of covariates increases. In this paper, we explore the use of ideas from the high-dimensional literature to improve the performance of these classical methods with many covariates.
We focus on the family of de-biased estimators introduced by (Zhang and Zhang, 2014), which allow for optimal inference in high dimensions by building an estimator that corrects for the regularization bias. Bias-corrected estimators are related to one-step M-estimators (Bickel, 1975) in that they improve on an initial estimator by following a Newton-Raphson updating rule; however, they differ from the classical one-step M-estimators in that their initial step is not consistent and direct estimator of the asymptotic variance does not exist.
One-step M-estimators even in low-dimensions, however, have not been defined for censored regression models, where measurements are censored by fixed constants. Moreover, knowledge of the underlying data generating mechanism is seldom available, and thus models with fixed-censoring are more prone to the distributional misspecification. To overcome that, we aim at developing a semi-parametric one-step estimator that makes no distributional assumptions.
Despite their widespread success in estimation problems, there are important hurdles that need to be cleared before one-step M-estimators are directly useful. Ideally, an estimator should be consistent with a well-understood asymptotic sampling distribution regardless of the error distribution, so that a researcher can use it to test hypotheses and establish confidence intervals. Yet, the asymptotics of censored one-step estimators have been largely left open, even in the standard regression contexts. This paper addresses these limitations, developing a regularization-based method for the high-dimensional Tobit model that allows for a tractable asymptotic theory and valid statistical inference.
We begin our theoretical analysis by developing the consistency and asymptotic normality results in the context of least absolute deviation regression. We prove these results for a carefully developed estimator that uses one-step corrections to remove regularization bias, while relying on a new technique, named smoothing estimating equations, which allows for efficient semi-parametric inference.
We also show that the generalized M-estimators that are robust to the outliers in the feature distribution can be effectively constructed for the Tobit model. Our methodology builds upon classical ideas from Hampel (1974), as well as Huber (1973). Given these general constructions, we show that our consistency and asymptotic normality result holds when the number of features is larger than the sample size.

Related work
From a technical point of view, the main contribution of this paper is an asymptotic normality theory enabling statistical inference in high-dimensional Tobit I models. Results by Powell (1986a), Powell (1986b) and Newey and Powell (1990) have established asymptotic properties in low-dimensional setting where the number of features is fixed, while Song (2011) and Zhao et al. (2014b) developed distribution free and rank-based tests. Müller and van de Geer (2016) offered a penalized version of Powell's estimator (penalized CLAD). Robustness properties of sample-selection models in low-dimensions were recently studied in Zhelonkin et al. (2016). To the best of our knowledge, however, we provide the first set of conditions under which semi-parametric estimators are both asymptotically unbiased and Gaussian in high-dimensional settings, thus allowing for classical statistical inference. The extension to the robust high-dimensional estimates robust to both feature and model outliers in this paper is also new.
A growing literature, including Van de Geer et al. (2014), Zhang and Zhang (2014), Ren et al. (2015) and Rinaldo et al. (2016), has considered the use of regularized algorithms for performing inference in high-dimensional regression models. These papers use the bias correction method, and report confidence intervals and p-values for testing feature significance. Meanwhile, Belloni et al. (2014Belloni et al. ( , 2013, Zhao et al. (2014a) and Javanmard and Montanari (2014) use robust approaches to estimate the asymptotic variance, and then use related bias correction step to remove the effect of regularization. A limitation of this line of work is that, until now, it has lacked formal statistical inference results in the presence of measurements with fixed censoring.
We view our contribution as complementary to this literature, by showing that bias correction methods may be applied to partially observed data. We believe that the new methodological tools developed here will be useful beyond the specific class of models studied in the paper. In particular, tools of utilizing unknown error distribution as a kernel smoother allow for direct analysis of many estimators with non-smooth loss functions.
Several papers use one-step methods for eliminating the bias of regularized estimates. In removing the bias of the regularized estimates, we follow most closely the approach of Van de Geer et al. (2014), which proposes bias correction estimator for least squares losses and obtain valid confidence intervals. Other related approaches include those of Javanmard and Montanari (2014) and Ning and Liu (2017), which build different variance estimates to determine a more robust bias correction step; however, these papers only focus on least squares losses (more importantly they do not extend naively to non-smooth or non-differentiable loss functions). Belloni et al. (2014) and Zhao et al. (2014a) discuss one-step approaches for quantile inference; however, the tools and techniques heavily depend on the convexity of the quantile loss (observe that our loss is non-convex due to the fixed or constant left-censoring mechanism; random censoring typically does not suffer from this problem). It is worth mentioning that the double-robust approach of Belloni et al. (2017), which proposes a powerful inference method for quantile regression, is based on leveraging principles of doubly-robust scores and their estimating equations. Intriguingly, even in low dimensions, doubly robust methods are not necessary for achieving semiparametric efficiency. We showcase that the newly proposed method achieves efficiency on its own.

Organization of the paper
In Section 2, we propose the smoothed estimating equations (SEE) for leftcensored linear models. In Section 3, we present our main result on confidence regions. In Section 4, we develop robust and left-censored Mallow's, Schweppe's and Hill-Ryan's estimators, and present their theoretical analysis. Section 5 provides numerical results on simulated data sets. In Section 6, we include discussions and conclusions for our work. We defer more general results for confidence regions, as well as the Bahadur representation of the SEE estimator, to Section 7 in the Appendix Section. Section 8 and 9 in the Appendix consist of technical details and proofs.

Inference in left-censored regression
We begin by introducing a general modeling framework followed by highlighting the difficulty for directly applying existing inferential methods (such as debiasing, score, Wald, and etc.) to the models with left-censored observations. Finally, we propose a new mechanism, named smoothed estimating equations, to construct semi-parametric confidence regions in high-dimensions.

Left-censored linear model
We consider the problem of confidence interval construction where we observe a vector of responses Y = (y 1 , . . . , y n ) and their censoring level c = (c 1 , . . . , c n ) together with covariates X 1 , . . . X p . The type of statistical inference under consideration is regular in the sense that it does not require model selection consistency. A characterization of such inference is that it does not require a uniform signal strength in the model. Since ultra-high dimensional data often display heterogeneity, we advocate a robust confidence interval framework. We begin with the following latent regression model: where the response Y and the censoring level c are observed, and the vector β * ∈ R p is unknown. Observe that the censoring mechanism considered here is fixed and non-random. This model is often called the semi-parametric censored regression model, whenever the distribution of the error ε is not specified. We assume that {ε i } n i=1 are independent across i, and are independent of x i . Matrix X = [X 1 , · · · , X p ] is the n × p design matrix. We also denote S β := {j|β j = 0} as the active set of variables in β and its cardinality by s β := |S β |. We restrict our study to constant-censored model, also called Type-I Tobit model, where entries of the censoring vector c are the same. Without loss of generality, we focus on the zero-censored model, (2.1)

Smoothed estimating equations (SEE)
In this paper, we take a general approach to the problem of designing robust and semi-parametric inference for left-censored linear models. Our estimator is motivated by the principles of estimating equations. Although estimating equations have been studied in many previous works, the smoothed estimating equations (SEE) framework presented in the following tailors to the high-dimensional and censored scenario. In addition, the method is simple enough to apply more generally to non-smooth loss functions. We begin by observing that the true parameter vector β * satisfies the population system of equations for some function Ψ(β) often taking the form of Ψ(β) = n −1 n i=1 ψ i (β) for a class of suitable functions ψ i . Observe that for left-censored models ε rarely, if ever, follows a specific distribution. A particular example of our interest, that allows error misspecifications, is The motivation comes from famous least absolute deviation l 1 loss. The advantage of the function ψ i above is then that it naturally bounds the effects of outliers; large values of the residuals y i − max{0, x i β} are down-weighted using l 1 distance. In fact, we work with Ψ resulting from this specific choice of ψ i function later in the analysis. Nevertheless, our SEE framework has a much broader spectrum, see Remark 1 below. Other functions Ψ can be applied as well. Another example of a function Ψ that has semiparametric advantage is a variant of a trimmed least squares loss, where the vanilla quadratic loss is multiplied by an indicator function as follows 1I{y i − However, with the appropriate choice of Ψ, solving estimating equations Ψ(β) = 0, although practically desirable, still has several drawbacks, even in low-dimensional setting. In particular, for semi-parametric estimation and inference in model (2.1), the function Ψ is non-monotone as the loss is nondifferentiable and non-convex. Hence, the system above has multiple roots resulting in an estimator that is ill-posed, and additionally presents significant theoretical challenges. Instead of solving the system (2.2) directly, we augment it by observing that, for a suitable choice of the matrix Υ ∈ R p×p , β * also satisfies the system of equations (2.4) For certain choices of the matrix Υ we aim to avoid both non-convexity and huge dimensionality of the system of equations (2.2). To avoid difficulties with non-smooth functions Ψ, we propose to consider a matrix Υ = Υ(β * ), where the matrix Υ(β * ) is defined as for a smoothed vector S(β) defined as The unknown error distribution smooths the function Ψ and acts as a kernel smoother function. In the above display Ψ(β * ) = Φ(β * , ε), for a suitable function Φ = n −1 n i=1 φ i and φ i : R p × R → R, whereas f ε denotes the density of the model error (2.1). Additionally, E X denotes expectation with respect to the random measure generated by the vectors X 1 , . . . , X n .
Following Ψ as in (2.3), the respective smoothed score function that we will be working with is where P denotes the probability measure generated by the errors ε (2.1).
Smoothed score will typically depend on the unknown density of the error terms and the unknown parameter of interest. For practical purposes, we will propose a suitable estimate of the function (2.5) -for homoscedastic errors ε i , the unknown cdf above can easily be estimated using empirical distribution function. With this choice of the smoothed loss, we obtain an information matrix as fol- We then proceed to define the matrix Υ as (2.6) We note that the matrix above is inspired by the linearization of non-differentiable losses, and is in particular very different from the Hessian or the Jacobian matrix typically employed for inference. Throughout the text, we denote the inverse of Σ(β * ) as Σ −1 (β * ), which is assumed to exist. In addition, we haveΣ(β) : To infer the parameter β * , we need to efficiently solve the SEE equation (2.4). We can observe that solving SEE equations (2.4) requires inverting the matrix Υ(β * ), as we are looking for a solution β that satisfies For low-dimensional problems, with p n, this can be done efficiently by considering an initial estimateβ and a sample plug-in estimate Υ(β) of Υ(β * ), and a sample estimate of EΨ(β * ), denoted with Ψ(β) and a suitable density estimatef ε (0). However, when p n, this is highly inefficient. Instead, it is more efficient to directly estimate Υ −1 (β * ) = Σ −1 (β * )/2f ε (0). Let Ω(β) be an estimate of Σ −1 (β * ) (see Section 2.3 for discussion). Then, we proceed to solve SEE equations approximately, by defining the SEE estimator as β =β + Ω(β)Ψ(β)/f ε (0).

Remark 1.
The proposed SEE can be viewed as a high-dimensional extension of inference from estimating equations. Although we consider a left-censored linear model, the proposed SEE methodology applies more broadly. For example, our framework includes loss functions based on ranks or non-convex loss functions for the fully observed data. For instance, the method in Van de Geer et al. (2014) is based on inverting KKT conditions might not directly apply for the non-convex loss functions (e.g., Cauchy loss) or rank loss functions (e.g., log-rank loss). Recent methods of Neykov et al. (2015) do not apply to nondifferentiable estimating equations (see Section 2.1 where a twice-differentiable assumption is imposed).

Estimation of the scale in left-censored models
We will introduce the methodology for estimating each row of the matrix Σ −1 (β * ). For further analysis, it is useful to define W (β) as a matrix composed of row vectors w i (β); The methodology is motivated by the following observation: . This motivates us to consider the following as an estimator for the inverse Σ −1 (β * ). Letγ (j) (β) andτ 2 j denote the estimators of γ * (j) (β * ) and τ 2 j respectively. We will show that a simple plug-in Lasso type estimator is sufficiently good for construction of confidence intervals. We propose to estimate γ * (j) (β * ), with the following l 1 penalized plug-in least squares regression, (2.8) Notice that this regression does not trivially share all the nice properties of the penalized least squares, as in this case the rows of the design matrix are not independent and identically distributed. An estimate of τ 2 j can then be defined through the estimate of the residuals ζ * j := W j (β * ) − W −j (β * )γ * (j) (β * ). Throughout this paper we assume that ζ * j has sub-exponential distribution and we denote Γ (j) (β * ) 0 = s j for j = 1, · · · , p, where · 0 denotes the number of nonzero entries in the vector. We propose the plug-in estimate for ζ * j aŝ ζ j = W j (β) − W −j (β)γ (j) (β), and a bias corrected estimate of τ 2 j defined aŝ (2.9) Observe that the naive estimate n −1ζ jζ j does not suffice due to the bias carried over by the penalized estimateγ (j) (β). Lastly, the matrix estimate of Σ −1 (β * ), much in the same spirit as Zhang and Zhang (2014) is defined with (2.10) The proposed scale estimate can be considered as the censoring adaptive extension of the graphical lasso estimate of Van de Geer et al. (2014).

Density estimation
Whenever the model considered is homoscedastic, i.e., ε i are identically distributed with a density function f ε (denoted whenever possible with f ), we propose a novel density estimator designed to be adaptive to the left-censoring in the observations. For a positive bandwidth sequenceĥ n , we define the density estimator of f (0) aŝ . (2.11) Of course, more elaborate smoothing schemes for the estimation of f (0) could be devised for this problem, but there seems to be no a priori reason to prefer an alternate estimator.

Remark 2.
We will show that a choice of the bandwidth sequence satisfying h −1 n = O( n/(s log p)) suffices. However, we also propose an adaptive choice of the bandwidth sequence and considerĥ n = o(1) such that let for a constant c > 0. Here, sβ denotes the size of the estimated set of the non-zero elements of the initial estimatorβ, i.e., sβ = β 0 .

Confidence intervals
Following the SEE principles, the solution to the equations is defined as an estimator,β =β + Ω(β)Ψ(β)/2f (0). (2.12) For the presentation of our coverage rates of the confidence interval (2.15) and (2.16), we start with the Bahadur representation. Lemmas 1-6 (presented in the Appendix) enable us to establish the following decomposition for the introduced one-step estimatorβ, where the vector Δ represents the residual component. We show that the residual vector's size is small uniformly and that the leading term is asymptotically normal. The theoretical guarantees required from an initial estimatorβ is presented below.

Condition (I).
An initial estimateβ is such that the following three properties hold. There exists a sequence of positive numbers r n such that r n → 0 when n → ∞ and β − β * 2 = O P (r n ) together with β 0 = t = O P (s β * ).
One particular choice of such estimator can be l 1 penalized CLAD estimator studied in Müller and van de Geer (2016) β := argmin β∈B 1 n Y − max{0, Xβ} 1 + λ β 1 , (2.14) which satisfies the Condition (I) with r 2 n = s β * log p/n and β 0 = O P (s β * × λ max (X X)/n), under the suitable conditions. However, other choices are also allowed. It is worth noting that the above condition does not assume model selection consistency of the initial estimator and the methodology does not rely on having a unique solution to the problem (2.14); any local minima suffices as long as the prediction error is bounded accordingly.
With the normality result of the proposed estimatorβ (as shown in Theorem 8, Section 7 in the Appendix), we are now ready to present the confidence intervals. Fix α to be in the interval (0, 1), and let z α denote the (1 − α)th standard normal percentile point. Let c be a fixed vector in R p . Based on the results of Section 7 in the Appendix, the standard studentized approach leads to a (1 − 2α)100% confidence interval for c β * of the form I n = c β − a n , c β + a n , (2.15) whereβ is defined in (2.12) and a n = z α c Ω(β)Σ(β)Ω(β)c 2 √ nf (0) (2.16) with Ω(β) defined in (2.10),Σ(β) defined in (2.7) andf (0) as defined in (2.11).
In the above, for c = e j , the above confidence interval provides a coordinatewise confidence interval for each β j , 1 ≤ j ≤ p. Notice that the above confidence interval is robust in a sense that it is asymptotically valid irrespective of the distribution of the error term ε.

High-dimensional asymptotics
Within this section, we will present the theoretical results using a specific initial estimator. However, our methodology has a much broader spectrum of applications. More details on the preliminary theoretical results, as well as more general results than the ones presented below, can be found in Section 7 in the Appendix. We begin with a set of very mild model error assumptions.

Theoretical background
There has been considerable work in understanding the theoretical properties of high-dimensional one-step bias correction estimators. The convergence and consistency properties of least squares based methods have been studied by, among others, Bickel et al. (2009), Meinshausen andNegahban et al. (2009). Meanwhile, their sampling variability has been analyzed by Van de Geer et al. (2014). However, to the best of our knowledge, our Theorem 1 is the first result establishing conditions under which one-step estimators are asymptotically unbiased and normal in high-dimensional Tobit I models.
Probably the closest existing result is that of Belloni et al. (2014) and Zhao et al. (2014a), which showed that high-dimensional quantile models can be successfully de-biased for the purpose of confidence intervals construction. However, it is worth noting that their procedures do not adapt to censoring, and their de-biased methods cannot be applied to fixed, left-censored models. Observe that the optimal Hessian matrix we have developed depends on the level of censoring and an initial estimate, whereas procedures in the above mentioned work do not: the post-lasso estimation in Belloni et al. (2014) relies on the score vector being a convex function of unknown parameters, and the Hessian matrix in Zhao et al. (2014a) depends merely on features. However, under convexity condition, left-censored models cannot be solved non-parametrically (without knowing the density function of the model error). Of course a surrogate score vector may be developed, but then it remains unclear if efficient attainment optimal bias-variance decomposition can be achieved. Although the methods of Belloni et al. (2014) and Zhao et al. (2014a) may appear qualitatively similar to the current work in the common choice of LAD loss, they cannot be used for valid inference in left-censored models.
The non-smooth losses have been studied extensively by Belloni et al. (2013) as well as Belloni et al. (2017) who showed that rates slower than that of smooth counterparts should be expected for many inferential problems; in particular rates are slower than those needed for estimation alone. However, it is important to note that in all approaches the de-biasing step consists of a nonsmooth score and smooth variance estimate. In our setting however, we have non-smooth score as well as non-smooth Hessian matrix (treated as parameters of the unknown). We identify that such departure in structure of the problem requires new concentration of measure as well as contracting principles regarding indicator functions: a step not needed in the mentioned literature. Even in low dimensions, such results are of independent interest, as they provide a unique Bahadur representation for left-censored semi-parametric method. Instead of using projections for Hessian estimation, inference for Tobit models is usually performed in terms of bootstrap sampling. High-dimensional inference with bootstrap, however, have proven to be unreliable and inconsistent (unless done after bias correction step). As observed by Karoui and Purdom (2016), estimators resulting from direct bootstrap in high dimensions can exhibit surprising properties even in simple situations.
Finally, an interesting question for further theoretical study is to understand the optimal scaling of the sparsity for Tobit models. Size of the model sparsity can be treated as a robustness parameter. It would be of considerable interest to develop methods that adapt to the size of the model sparsity and achieve uniform rates of testing.

Main results
Condition (E). The error distribution F has median 0, and is everywhere continuously differentiable, with density f , which is bounded above, f max < ∞, We require the error density function to be with bounded first derivative. This excludes densities with unbounded first moment, but includes a class of distributions much larger than the Gaussian.
Moreover, this assumption implies that x i β are distributed much like the error ε i , for β close to β * and x i β close to the censoring level 0. Last condition in particular implies that P(|x i β| ≤ z) = o(z) for all β close to β * . This condition does not exclude deterministic components of the vector x i , nor components which have discrete distributions; only the linear combination x i β must have a Lipschitz continuous distribution function near zero. Therefore, imply- Apart from the condition on the error distribution, we need conditions on the censoring level as well as the design matrix of the model (2.1) for further analysis.

Condition (C).
There exist constants C 2 > 0 and φ 0 > 0, such that for all β The censoring level c i has a direct influence on the constant C 2 . In general, higher values for c i increase the number of censored data. The bounds for the coverage probability (see Theorem 1 and Theorem 5) do not depend on the censoring level c i . The fact that the censoring level does not directly appear in the results should be understood in the sense that the percentage of the censored data is important, not the censoring level. Note that the compatibility factor φ 0 does not impose any restrictions on the censoring of the model, i.e., it is the same as the one introduced for linear models (Bickel et al., 2009). Observe that this condition does not impose distribution of W to be Gaussian or continuous. However, it requires that Σ(β * ), the population covariance matrix, is at least invertible, a condition unavoidable even in linear models.
In order to establish theoretical results on the improved one-step estimator, we also need to control the scale estimator in the precision matrix estimation, which requires the following condition.
With the conditions above, we present our main result. More generalized results for initial estimators satisfying Condition (I) are presented in Theorem 8 and 9 in the Appendix.
Theorem 1. Letβ be defined as in (2.14) with a choice of the tuning parameter λ = A 2 K 2 log(2p)/n + log p/n for a constant A 2 > 16 and independent of n and p. Assume that (iii) Let Ω(β) defined in (2.10). Then, forτ 2 j as in (2.9), we haveτ −2 Let I n and a n be defined in (2.15) and (2.16). Then, for all vectors c = e j and any j ∈ {1, . . . , p}, whens, n, p → ∞ we have A few comments are in order. Part (i) of Theorem 1 implies that the proposed estimator and confidence intervals have distinct limiting behaviors with varying magnitude of the censoring level. In particular, (i) implies that γ (j) (β) − γ * (j) (β * ) 1 inherits the rates available for fully observed linear models whenever C 2 is bounded away from zero. Additionally, if all data is censored, i.e., whenever C 2 converges to zero at a rate faster than λ j , the estimation error will explode. These results agree with the asymptotic results on consistency in left-censored and low-dimensional models; however, they provide additional details through the exact rates of censoring that is allowed. For example, β − β * 2 < n −1/4 is sufficient for optimal inferential rates, and the asymptotic result above matches those of fully observed linear models. In this sense, our results are also efficient.
Part (ii) provides easy to verify sufficient conditions for the consistency of a class of semiparametric estimators of the precision matrix for censored regression models. Even in low-dimensional setting, this result appears to be new and highlights specific rate of convergence (see Theorem 1 for more details). Part (iii) establishes properties of the graphical lasso estimate with data matrix that depends onβ. In comparison to linear models the established rate is slower for a factor of √ s j , whereas in comparison to the results of section 3 of (Van de Geer et al., 2014) (see Theorem 3.2 therein) we avoid a strict condition of bounded parameter spaces.
Observe that Part (iv) is a special case of general theory presented in the Appendix. There we show that a large class of initial estimates suffices.
For the case of low-dimensional problems with s = O(1) and p = O(1), we observe that whenever the initial estimator of rate r n , is in the order of n − , for a small constant > 0, then and Δ ∞ = O P (n −2 ). In particular, for a consistent initial estimator, i.e. r n = O(n −1/2 ) we obtain that Δ ∞ = O P (n −1/4 ). For high-dimensional problems with s and p growing with n, for all initial estimators of the order r n such that Classical results on inference for left-censored data, with p n, only imply that the error rates of the confidence interval is O P (1); instead, we obtain a precise characterization of the residual term size.
Remark 3. In particular, for the special case where the initial estimate is penalized CLAD estimate, we show We obtain that the confidence interval I n is asymptotically valid and that the coverage errors are of the order O s (log p) 3/4 /n 1/4 , whenevers(log p) 1/4 /n 1/4 = O(1). Moreover, with p n the rates above match the optimal rates of inference for the absolute deviation loss (see e.g. Zhou and Portnoy (1996)), indicating that our estimator is asymptotically efficient in the sense that the censoring asymptotically disappears even for p ≥ n.
The conditions 4 log 3 p n is also similar to the results in Belloni et al. (2013) obtained for p n. While it is unclear the orthogonal moments approach therein is applicable for fixed-censored model, the rate condition required for quantile procedure is s 3 log 3 (p) n, for known density and s 4 log 4 (p) n, for unknown density (see Comment 3.3

and equation (ii) therein).
Lastly, observe that the result above is robust in the sense that it holds regardless of the particular distribution of the model error (2.1), and holds in a uniform sense. Thus, the confidence intervals are honest. In particular, the confidence interval I n does not suffer from the problems arising from the nonuniqueness of β * (see Theorem 9 in the Appendix).

Left-Censored Mallow's, Schweppe's and Hill-Ryan's one-step estimators
Statistical models are seldom believed to be complete descriptions of how real data are generated; rather, the model is an approximation that is useful, if it captures essential features of the data. Good robust methods perform well, even if the data deviates from the theoretical distributional assumptions. The best known example of this behavior is the outlier resistance and transformation invariance of the median. Several authors have proposed one-step and k-step estimators to combine local and global stability, as well as a degree of efficiency under target linear model (Bickel, 1975). There have been considerable challenges in developing good robust methods for more general problems. To the best of our knowledge, there is no prior work that discusses robust one-step estimators for the case of left-censored models (for either high-or low-dimensions). We propose here a family of robust generalized M-estimators (GM estimators) that stabilize estimation in the presence of "unusual" design or model error distributions. Observe that (2.1) rarely follows distribution with light tail. Namely, model (2.1) can be reparametrized as Hence ξ i will often have skewed distribution with heavier tails, and it is in this regard very important to design estimators that are robust. We introduce Mallow's, Schweppe's and Hill-Ryan's estimators for left-censored models.

Smoothed robust estimating equations (SREE)
In this section, we propose a robust generalized population estimating equations where ψ is an odd, nondecreasing and bounded function. Throughout we assume that the function ψ either has finitely many jumps, or is differentiable with bounded first derivative. Notice that when q i = 1 and v i = 1, with ψ being the sign function, we have ψ r i = ψ i of previous section. Moreover, observe that for the weight functions q i = q(x i ) and v i = v(x i ), both functions of R p → R + , the true parameter vector β * satisfies the robust population system of equations above. Appropriate weight functions q and v are chosen for particular efficiency considerations. Points with high leverage are considered "dangerous", and should be downweighted by the appropriate choice of the weights v i . Additionally, if the design has "unusual" points, the weights q i 's serve to downweight their effects in the final estimator, hence making generalized M-estimators robust to the outliers in the model error and the model design.
We augment the system (4.1) similarly as before and consider the system of equations for a suitable choice of the robust matrix Υ r ∈ R p×p . Ideally, most efficient estimation can be achieved when the matrix Υ r is close to the matrix that linearizes the smoothed score function of the robust equations (4.1).
To avoid difficulties with non-smoothness of ψ, we propose to work with a matrix Υ r that is smooth enough and robust simultaneously. To that end, We consider a smoothed version of the Hessian matrix, and work with Υ r = Υ r (β * ) for where f ε denotes the density of the model error (2.1). To infer the parameter β * , we adapt a one-step approach in solving the empirical counterpart of the population equations above. We name the empirical equations as Smoothed Robust Estimating Equations or SREE in short. For a preliminary estimate, we solve an approximation of the robust system of equations above, and search for the β that solves The particular form of the matrix Υ r (β * ) depends on the choice of the weight functions q and v and the function ψ. In particular, for the left-censored model, ( 4.4) leads to the following form

Left-censored Mallow's, Hill-Ryan's and Schweppe's estimator
Here we provide specific definitions of new robust one-step estimates. We begin by defining a robust estimate of the precision matrix, i.e., {Υ r } −1 (β * ). We design a robust estimator that preserves the "downweight" functions q and v as to stabilize the estimation in the presence of contaminated observations. For further analysis, it is useful to define the matrixW (β) = Q 1/2 W (β) and where • denotes entry-wise multiplication, also known as the Hadamard product, takes the form of a weighted covariance matrix. Hence, to estimate the inverse {Υ r } −1 (β * ), we project columns onto the space spanned by the remaining columns. For j = 1, . . . , p, we define the vectorθ (j) (β) as follows, Thus, we propose the following as a robust estimate of the scalẽ and the normalizing factor Lastly, we arrive at a class of robust one-step generalized M-estimators, We propose a one-step left-censored Mallow's estimator for left-censored highdimensional regression by setting the weights to be v i = 1, and Extending the work of Coakley and Hettmansperger (1993), it is easy to see that Mallow's one-step estimator with α = 1 and b = χ 2 s,0.95 quantile of chi-squared distribution withŝ = |Ŝ| improves a breakdown point of the initial estimator to nearly 0.5, by providing local stability of the precision matrix estimate.
Similarly, the one-step left-censored Hill-Ryan estimator is defined with (4.8) and the one-step left-censored Schweppe's estimator with the same q i as the left hand side of (4.8), but v i = 1/q i . Note that these are not the only choices of Hill-Ryan and Schweppe's type estimators.
Another family of one-step estimators defined for Tobit-I models, for which we can use the framework above, is the class of adaptive Huber's one-step estimators, where v i = 1 and q i = 1, and the function ψ takes the form of a first order derivative of a Huber loss function. However, it is unclear what the benefit of such loss would be for left-censored data, as the nice convexity property of traditional least squares is no longer available regardless.
The purpose of this paper is to explore the behavior of the different types of one-step estimators for left-censored regression model through studying their higher order asymptotic properties. This provides a unified synthesis of results as well as new results and insights. We will show that the effect of the initial estimate persists asymptotically, only if it is of least squares type. We also show that the one-step robust estimate has fast convergence rates, and leads to a class of robust confidence intervals and tests.

Theoretical results
Similar to the concise version of Bahadur representation presented in (2.13) for the standard one-step estimator with q i = 1 and v i = 1, we also have the expression for robust generalized M-estimator, but now with the leading term of a different form Next, we show that the leading component has asymptotically normal distribution, and that the residual term is of smaller order. To facilitate presentation, we present results below with an initial estimator being penalized CLAD estimator (2.14) with the choice of tuning parameter as presented in Theorem 1. We introduce the following condition.

Condition (rΓ). Parameters θ *
is Lipschitz continuous for all β satisfying condition (C). In addition, let q i and v i be functions such that max i |q i | ≤ M 1 and max i |v i | ≤ M 2 for positive constants M 1 and M 2 and We will show that for the proposed set of weight functions, the above condition holds. Boundedness of the function ψ allows for error distributions with unbounded moments, and provides necessary robustness to the possible outliers in the model error. For the leading term of the Bahadur representation (4.9), we obtain the following result.
Theorem 3. Let Conditions (C), (rΓ) and (E) hold and let λ j = C log p/n for a constant C > 1. Assume thats log 1/2 (p)/n 1/4 = o(1), fors = s β * ∨s Ω withs Ω = max jsj . Then, Remark 5. The estimation procedure described above is based on the initial estimatorβ taken to be penalized CLAD. However, it is possible to show that a large family of sparsity encouraging estimator suffices. In particular, suppose that the initial estimatorβ is such that β − β * 2 ≤ γ n , and let for simplicity s β * = s. Then results of Theorem 3 extend to hold for the confidence interval defined asĪ n = (c β − a n , c β + a n ) with a n as in (4.11). In particular, the error rates are of the order of (γ 1/2 n t 1/4 ∨ γ n t 1/2 )t 1/2 (log p) 1/2 + √ nss With the results above, we can now construct a (1 − 2α)100% confidence interval for c β of the form I r n = c β −ȃ n , c β +ȃ n , (4.10) whereβ is defined in (4.7), c = e j for some j ∈ {1, 2, . . . , p}, with the robust covariance estimate that we define aŝ Remark 6. Constants M 1 and M 2 change with a choice of the robust estimator. For the Mallow's and Hill-Ryan's, by Lemma 5 in the Appendix, Thus, the coverage probability of Mallow's and Hill-Ryan's estimator is the same as that of the M-estimator. However, the coverage of the Schweppe's estimator is slightly slower, as result of Lemma 1 and Lemma 5 in the Appendix imply Together with Theorem 5 in the Appendix, we observe now a rate that is slower by a factor of s β * , i.e., the leading term is of the order of O s 2 (log(p ∨ n)) 3/4 n −1/4 .

Theorem 4. Under Conditions of Theorems 2 and 3, we have for Mallow's and Hill-Ryan's estimator
whereas for the Schweppe's estimator Remark 7. This result implies that the residual term sizes depend on the type of weight functions chosen. Due to the particular left-censoring, the ideal weights measuring concentration in the error or design depend on the unknown censoring. Hence, we approximate ideal weights with plug-in estimators, and therefore obtain rates of convergence that are slightly slower than those of non-robust estimators. This implies that the robust confidence intervals require larger sample size to achieve the nominal level.

Numerical results
In this section, we present a number of numerical experiments from both highdimensional, p n, and low-dimensional, p n, simulated settings.
We implemented the proposed estimator in a number of different model settings. Specifically, we vary the following parameters of the model. The number of observations, n, is taken to be 300, while p, the number of parameters, is taken to be 40 or 400. The error of the model, ε, is generated from a number of distributions including: standard normal, Student's t with 4 degrees of freedom, Beta distribution with parameters (2, 3) and Weibull distribution with parameters (1/2, 1/5). In the case of the non-zero mean distributions, we center the observations before generating the model data. The parameter s β * , the sparsity of β * , #{j : β * j = 0}, is taken to be 3, with all signal parameters taken to be 1 and located as the first three coordinates. The n × p design matrix, X, is generated from a multivariate Normal distribution N (μ, Σ). The mean μ is chosen to be vector of zero, and the censoring level c is chosen to fix censoring proportion at 25%. The covariance matrix, Σ, of the distribution that X follows, is taken to be the identity matrix or the Toeplitz matrix such that Σ ij = ρ |i−j| for ρ = 0.4. In each case, we generated 100 samples from one of the settings described above and for each sample we calculated the 95% confidence interval. The complete algorithm is described in Steps 1-4 below. We note that the optimization problem required to obtain the penalized CLAD estimator is not convex. Nevertheless, it is possible to write (2.14) as linear program within the compact set B, and solve accordingly (Powell, 1984), In addition, as our theory indicates, we allow for any initial estimator with desired convergence rate. Penalized CLAD is one example thereof.
1. The penalization factor λ is chosen by the one-standard deviation rule of the cross validation,λ = arg min λ∈{λ 1 ,...,λ m } CV(λ). We move λ in the direction of decreasing regularization until it ceases to be true that CV(λ) ≤ CV(λ) + SE(λ). Standard error for the cross-validation curve, SE(λ), is defined as a sample standard error of the K fold cross-validation statistics CV 1 (λ), . . . , CV K (λ). They are calibrated using the censored LAD loss as withβ −k (λ) denoting the CLAD estimator computed on all but the k-th fold of the data.
2. The tuning parameter λ j in each penalized l 2 regression, is chosen by the one standard deviation rule (as described above). In more details, λ j is in the direction of decreasing regularization until it ceases to be true that CV j (λ j ) ≤ CV j (λ j ) + SE j (λ j ) forλ j as the cross-validation parameter value. The cross-validation statistic is here defined as withγ −k j (λ j ) denoting estimators (2.8) computed on all but the k-th fold of the data. This choice leads to the conservative confidence intervals with wider than the optimal length. Theoretically guided optimal choice is highly complicated and depends on both design distribution and censoring level concurrently. Nevertheless, we show that one-standard deviation choice is very reasonable. 3. Whenever the density of the error term is unknown, we estimate f (0), using the proposed estimator (2.11), with a constant c = 10. We compute the above estimator by splitting the sample into two parts: the first sample is used for computingβ andβ and the other sample is to compute the estimatef (0). Optimal value of h is of special independent interest; however, it is not the main objective of this work. 4. Obtainβ by plugging Ω(β) andf (0) into (2.12) with λ and λ j as specified in the steps above.
The summary of the results is presented across dimensionality of the parameter vector. The Low-Dimensional Regime with SEE Estimator are summarized in Table 1 and Figures 1 and 2. The High-Dimensional Regime are summarized in Table 2 and Figures 3 and 4. We report average coverage probability across the signal and noise variables independently, as the signal variables are more difficult to cover when compared to the noise variables.
We consider a number of challenging settings. Specifically, the censoring proportion is kept relatively high at 25%, and our parameter space is large with p = 400 and n = 300. In addition, we consider the case of error distribution being Student with 4 degrees of freedom, which is notoriously difficult to deal with in left-censored problems. For the four error distributions, the observed coverage probabilities are approximately the same.
We also note that symmetric distributions are very difficult to handle in leftcensored models. However, when errors were symmetric (Normal), the coverage probabilities were extremely close to the nominal ones. The simulation cases evidently show that our method is robust to asymmetric distributions and does not lose efficiency when the errors are symmetric.
Lastly, to investigate smoothed robust estimating equations (SREE) empirically, we preserve the previous high-dimensional settings with standard normal and Student's t 4 error distributions respectively. However, to illustrate the robustness of the estimator, we artificially create outliers in the design matrix X, and perform Mallow's type SREE estimating procedures with the perturbed X. Within each iteration, after generating X from N (μ, Σ) accordingly, we randomly select 10% of the columns, and then randomly perturb 10% of the entries in X by adding twice the quantity of the maximum entry in X, i.e. X ij = X ij + 2 × max ij X ij . Such perturbations create a considerate proportion of outliers in the design. The results are summarized in Table 3 and Figures 5 and 6. As coverages under various scenarios are close to the nominal level, the results show that the SREE estimator is robust to high leverage points.

Discussion and conclusion
In this article, we enrich regular high-dimensional inferential methods with censoring and robust options. While a censoring option adds to the capacity of an existing inferential methods extending them to non-convex problems in general, a robust option has the potential to open a new direction. Usually, inferential methods have been aiming to create efficient methods with asymptotically exact or pivotal properties in a class of specific models. However, sometimes the nature of the data collection process has determined that a significant noise is inevitable for some observations, or that portions of the observations have been corrupted by an adversary. In big and high-dimensional data setting, such cases may occur naturally. When the cost of error is too large to bear, it may be wise to consider an alternative that can improve upon the inferential accuracy in a stepwise manner. With one-step robust estimators, one can often successfully iterate the estimate, and identify misleading observations. Therefore, limiting the effect of poor data quality. The aim of this article is to establish a new framework for high-dimensional robust inference. Many different loss functions and penalty functions, including non-convex ones, may be incorporated into this framework for the purpose of achieving correct inferential tools. We provide a novel theory, with emphasis on diverging dimensions and left-censoring. Future work will be devoted to how to better utilize longitudinal and heterogeneous observations. There are many one-step estimators based on a suitable choice of loss function or estimating equations, some of which have proved to work well, especially when the dimension is reasonably high. Our proposed method allows for leftcensoring, non-smooth, non-convex losses and/or non-monotone equations, and complements the existing methods in these domains. Our method achieves rates comparable the ones of efficient methods (with full observations), and our analysis provides tight control over both Type I and Type II error rates, which makes it a practically useful and efficient alternative.

Appendix
General results along with theoretical considerations are presented. In addition, statements and proofs of Lemmas 1-6 and Theorems 5-9, as well as proofs of main Theorems 1-4, are included.

General results
We begin theoretical analysis with the following decomposition of (2.12) (7.1) We can further decompose the last factor of the last term in (7.1) as To characterize the behavior of individual terms in the decomposition above, we develop a sequence of results presented below that rely on the conditions that we listed in Section 3.

Lemma 1. Suppose that the Conditions (E) hold. Consider the class of parameter spaces modeling sparse vectors with at most t non-zero elements, C(r, t) =
{w ∈ R p | ||w|| 2 ≤ r n , p j=1 1I{w j = 0} ≤ t} where r n is a sequence of positive numbers. Then, there exists a fixed constant C (independent of p and n), such that the process The preceding Lemma immediately implies strong approximation of the empirical process with its expected process, as long as r n , the estimation error, and t, the size of the estimated set of the initial estimator, are sufficiently small. The power of the Lemma 1 is that it holds uniformly for a class of parameter vectors enabling a wide range of choices for the initial estimator.
Next, we present a linearization result useful for further decomposition of the Bahadur representation (7.1).
Once the properties of the initial estimator are provided, such as Condition (I), Lemma 2 can be used to linearize the population level difference of the functions ψ i (β) and ψ i (β * ). Together with Lemma 1, Lemma 2 allows us to overpass the original highly discontinuous and non-convex loss function. Utilizing Lemma 2, Conditions (I)-(C) and representation (7.1), the Bahadur representation of β becomes We show that the last four terms of the right hand side above, each converges to 0 asymptotically at a faster rate than the first term on the right hand side of (7.3).
The following two lemmas help to establish l 1 column bound of the corresponding precision matrix estimator. The first one provides properties of the estimatorγ (j) (β) as defined in (2.8). Although this estimator is obtained via Lasso-type procedure, significant challenges arise in its analysis due to dependencies in the plug-in loss function. The design matrix of this problem does not have independent and identically distributed rows. We overcome these challenges by approximating the solution to the oracle one and without imposing any new conditioning of the design matrix.

constant C > 1 and let Conditions (I), (E), (C) and (Γ) hold. Then,
Remark 8. The choice of the tuning parameter λ j depends on the l 2 convergence rate of the initial estimator r n , and the size of its estimated non-zero set. However, we observe that whenever r n is such that r n ≤ t −3/4 and the sparsity of the initial estimator is such that ts j log p/n < 1, then the optimal choice of the tuning parameter is of the order of log p/n. In particular, any initial estimator that satisfies r n < n −1/4 is sufficient for optimal rates of inference in a model where t ≤ n 1/4 and s j ≤ n 1/4 . The next result gives a bound on the variance of ourγ (j) (β) estimator.

Lemma 5. Let the setup of Lemma 4 hold. Let Ω(β) be the estimator as in
(2.10). Then, forτ 2 j as in (2.9), we haveτ −2 j = O P (1). Moreover, The one-step estimatorβ relies crucially on the bias correction step that carefully projects the residual vector in the direction close to the most efficient score. The next result measures the uniform distance of such projection. Lemma 6. Let the setup of Lemma 4 hold. There exists a fixed constant C (independent of p and n), such that the process V n (δ) with probability 1 − δ and a constant K 1 defined in Condition (E).
Lemma 6 establishes a uniform tail probability bound for a growing supremum of an empirical process V n (δ). It is uniform in δ and it is growing as supremum is taken over p, possibly growing (p = p(n)) coordinates of the process. The proof of Lemma 6 is further challenged by the non-smooth components of the process V n (δ) itself and the multiplicative nature of the factors within it. It proceeds in two steps. First, we show that for a fixed δ the term ||V n (δ)|| ∞ is small. In the second step, we devise a new epsilon net argument to control the non-smooth and multiplicative terms uniformly for all δ simultaneously. This is established by devising new representations of the process that allow for small size of the covering numbers. In conclusion, Lemma 6 establishes a uniform bound I 4 ∞ = O P r 1/2 n t 3/4 (log p) 1/2 r n t(log p) 1/2 t log p/n 1/2 in (7.3). Size of the remainder term in (2.13) is controlled by the results of Lemmas 1-6 and we provide details below.
Theorem 5. Let λ j = C((log p/n) 1/2 (r 1/2 n t 1/4 (log p/n) 1/2 )t 3/4 (log p/n) 1/2 ) for a constant C > 1 and let Conditions (I), (E), (C) and (Γ) hold. With s Ω = max j s j , We first notice that the expression above requires t = O(n 1/2 / log(p ∨ n)), a condition frequently imposed in high-dimensional inference (see Zhang and Zhang (2014) for example). Then, in the case of low-dimensional problems with s = O(1) and p = O(1), we observe that whenever the initial estimator of rate r n , is in the order of n − , for a small constant > 0, then Δ ∞ = O P (n − /2 ). In particular, for a consistent initial estimator, i.e. r n = O(n −1/2 ) we obtain that Δ ∞ = O P (n −1/4 ). For high-dimensional problems with s and p growing with n, for all initial estimators of the order r n such that r n = O(s a β * (log p) b /n c ) and t = O(s β * ) we obtain that Next, we present the result on the asymptotic normality of the leading term of the Bahadur representation (2.13).
, the density of ε at 0 is known, Remark 9. A few remarks are in order. Theorem 6 implies that the effects of censoring asymptotically disappear. Namely, the limiting distribution only becomes degenerate when the censoring rate asymptotically explodes, implying that no data is fully observed. However, in all other cases the limiting distribution is fixed and does not depend on the censoring level.
Density estimation is a necessary step in the semiparametric inference for left-censored models. Below we present the result guaranteeing good qualities of density estimator proposed in (2.11).

Conditions (I) and (E) hold, then
Together with Theorem 6 we can provide the next result.

Corollary 2.
With the choice of density estimator as in (2.11), under conditions of Theorem 6 and 7, the results of Theorem 6 continue to hold unchanged, i.e.,

Remark 10. Observe that the result above is robust in the sense that the result holds regardless of the particular distribution of the model error (2.1). Condition (E) only assumes minimal regularity conditions on the existence and smoothness of the density of the model errors. In the presence of censoring, our result is unique as it allows p n, and yet it successfully estimates the variance of the estimation error.
Combining all the results obtained in previous sections we arrive at the main conclusions.

Proofs of main theorems
Proof of Theorem 1. The proof for the result with initial estimator chosen as the penalized CLAD estimator of Müller and van de Geer (2016) follows directly from Lemma 1-6 and Theorem 5-8 with r n = s 1/2 β * (log p/n) 1/2 and t = s β * . Proof of Theorems 2, 3 and 4. Due to the limit of space, we follow the line of the proof of Theorem 6 but only give necessary details when the proof is different. First, we observe that with a little abuse in notation thus it suffices to provide the asymptotic of Moreover, observe that R r i are necessarily bounded random variables (see Condition (rΓ). Following similar steps as in Theorem 6 we obtain where in the last step we utilized Hoeffding's inequality for bounded random variables. Next, we focus on establishing an equivalent of Lemma 2 but now for the robust generalized M-estimator. Observe that (8.1) Moreover, whenever ψ exists we have for some α ∈ (0, 1). When ψ doesn't exist we can decompose ψ into a finite sum of step functions and then apply exactly the same technique on each of the step functions as in Lemma 2. Hence, it suffices to discuss the differentiable case only. Let us denote the RHS of (8.1) with Λ r n (β)(β * − β), i.e.
Next, we observe that by Condition (rΓ), for a constant C 1 < ∞. With that the remaining steps of Lemma 2 can be completed with Σ replaced with Σ r . Next, by observing the proofs of Lemmas 3, 4 and 5 we see that the proofs remain to hold under Condition (rΓ), and with W replaced withW . The constants K appearing in the simpler case will now be KM 1 M 2 . However, the rates remain the same up to these constant changes.
Next, we discuss Lemma 6. For the case of robust generalized M-estimator ν n (δ) of Lemma 6 takes the following form . We consider the same covering sequence as in Lemma 6. Then, we observe that a bound equivalent to T 1 of Lemma 6 is also achievable here.
Term T 2 can be handled similarly as in Lemma 6. We illustrate the particular differences only in T 21 as others follows similarly. Observe that Furthermore, we observe that the same techniques developed in Lemma 6 apply to T r 211 (δ) hence we only discuss the case of T r 212 (δ). We begin by considering the decomposition T r 212 (δ) = T r 2121 (δ) + T r 2122 (δ) with Let us focus on the last expression as it is the most difficult one to analyze. Observe that we are interested in the difference T r 2122 (δ) − T r 2122 (δ k ). We decompose this difference into four terms, two related to random variables and two related to the expectations. We handle them separately and observe that because of symmetry and monotonicity of the indicator functions once we can bound the difference of random variables we can repeat the arguments for the expectations. Hence, we focus on . First due to monotonicity of indicators and (9.16) we have As sup ψ < ∞, I 11 can be handled in the same manner as T 21 of the proof of Lemma 6, whereas I 12 = O P (L n ). For I 13 it suffices to discuss the difference at the end of the right hand side of its expression. It is not difficult to see that E ε (ψ (ξ δ )) − E ε (ψ (ξδ k )) ≤ 4Cv iLn ≤ 4CM 1Ln with C = sup x |ψ (x)| for the case of twice differentiable ψ, C = sup y ∂/ ∂y| y −∞ ψ (x)dx| for the case of once differentiable ψ and C = f max for the case of non-differentiable functions ψ. Combining all the things together we observe that the rate of Lemma 6 for the case of robust generalized M-estimators is of the order of with M 3 = sup x |ψ (x)| for once differentiable ψ and M 3 = f max for nondifferentiable ψ. Now, with equivalents of Lemmas 1-6 are established, we can use them to bound successive terms in the Bahadur representation much like those of Theorem 1. Details are ommitted due to space considerations.
For Theorem 4 in the Main Material, the same line of the proof of Theorem 9 applies, but only replace the matrix Σ with the matrix Σ r . The result of the Theorem then follows from the arguments in Remark 2 in the Main Material. Uniformity of the obtained results is not compromised as the weight functions q i and v i only depend on the design matrix.
Proof of Theorem 5. The proof of the theorem follows from the bounding residual terms in the Bahadur representation (7.3) with the help of Lemma 3 -6.
For the term I 3 , we have that by applying Hölder's inequality and Hoeffding's inequality along with Lemma 5. For the term I 2 , we have by Hölder's inequality and Lemma 5, where A ∞ denotes the max row sum of matrix A, and A 1 denotes the max column sum of matrix A.
Lastly, for the only remainder term in (7.3), I 1 , we apply Hölder's inequality and Lemma 5, Proof of Theorem 6. We begin the proof by noticing that Recollect that by Condition (E), P(ε i ≥ 0) = 1/2. Additionally, we observe that in distribution, the term on the right hand side is equal to w i (β * )R i , with {R i } n i=1 denoting an i.i.d. Rademarcher sequence defined as R i = sign(−ε i ). Hence, it suffices to analyze the distributional properties of w i (β * )R i . More-over, Rademacher random variables are independent in distribution from w i (β * ). Thus, we provide asymptotics of We begin by defining and we also define T n := n i=1 V i . Notice that V i 's are independent from each other, since we assumed that each observation is independent in our design. We have In addition, also due to this fact, V i follows a symmetric distribution about 0. Thus, where with a little abuse in notation we denote the density and distribution of T n to be f (t n ) and F (t n ). Observe that Thus, Now combining (8.2) and (8.3), we have lim n→∞ n i=1 E|Vi| 2+δ (VarTn) 1+ δ 2 = 0. Thereby, we arrive at the result Therefore, we have the following conclusion, where j = 1, · · · , p. This gives Notice that for two nonnegative real numbers a and b, it holds that We first make note of a result in the proof of Theorem 8, that is bounded away from zero. Then, √ a is also bounded away from zero by (8.5), and so is The rate above follows from (8.9) in the proof of Theorem 8. Notice the rate is of order smaller than the rate assumption in Theorem 5. Thus, we can deduce that for some finite constant C. Applying Slutsky theorem on (8.4) with the inequality above, the desired result is obtained.
Proof of Theorem 7. We can rewrite the expressionf (0) in (2.11) aŝ Using a similar argument and the fact that lim n→∞ĥn /h n = 1, we havê . Now we work on the numerator of right hand side. Specifically, let η i = y i − x i β * andη i = y i − x iβ , we look at the difference of the quantities below, We begin with term T 1 . By Condition (E), we have ET 1 = O(h −1 n β − β * 1 ). By Corollary 1, we have which then brings us that T 1 is of order O P (1). For term T 2 , we work out the expression Next, we notice that for real numbers a and b, we have 1I(a > 0) − 1I(b > 0) ≤ 1I(|b| ≤ |a − b|). Thus, we have To bound T 21 , we use similar techniques as with T 1 . Notice that It is easy to see that |h n − η i | shares the nice property of the density of ε i . Thus, ET 21 is bounded by O P (1). Then by Hoeffding's inequality, we have that with probability approaching 1 that T 21 is of O P (1). T 22 can be bounded in exactly the same steps. Finally, we are ready to put everything together that By applying Slutsky theorem, the result follows directly, Proof of Corollary 2. By multiplying and dividing the term f (0), we can rewrite the term on the left hand side as Also, as a result of theorem 7, we have with Condition (E) guarantees that f (0) is bounded away from 0. It also indicates that f (0)/f (0) d − → 1. Finally, we apply Slutsky's Theorem and Theorem 6, we have

Ω(β)Σ(β)Ω(β)
Proof of Theorem 8. The result of Theorem 8 is a simple consequence of Wald's device and results of Corollary 2. The only missing link is an upper bound on (8.6) First, observe that Inference for left-censored linear model

T23
We notice that (8.8) For (8.7), we have the following bound for some positive constant C, where A ∞ denotes the max row sum of matrix A and A max denotes the maximum element in the matrix A. By Lemma 1, we can easily bound the term above with O P K 2 s 1/2 Ω (r 1/2 n t 3/4 (log p/n) 1/2 ∨ t log p/n) . For (8.8), we start with the following term, Applying Hoeffding's inequality on this term, we have that with probability approaches 1, the term is bounded by O P (n −1/2 ). Then we bound term (8.8) as following, for some constant C, Term T 22 can be bounded using Lemma 5 and the results from term T 21 , and turns out to be of order O P K 4 s 3/2 Ω λ j (r 1/2 n t 3/4 (log p/n) 1/2 ∨ t log p/n) .
Lastly, by Lemma 5, term T 23 is of order O P K 2 s Putting the terms together, we have Σ(β)Ω(β) − I max bounded by Thus, Σ(β)Ω(β) max is O P (1), and so can T 2 be shown similarly. The expression (8.6) is then bounded as, which then completes the proof.
Proof of Theorem 9. The result of Theorem 9 holds by observing that Bahadur representations (7.3) remain accurate uniformly in the sparse vectors β ∈ B; hence, all the steps of Theorem 5 apply in this case as well.

Proofs of lemmas
Proof of Lemma 1. Let {δ k } k∈[N δ ] be the centers of the balls of radius r n ξ n that cover the set C(r n , t). Such a cover can be constructed with N δ ≤ p t (3/ξ n ) t (see, for example Van der Vaart, 2000). Furthermore, let D n (δ) be a ball of radius r centered atδ k with elements that have the same support asδ k . In what follows, we will bound sup δ∈C(rn,t) |D n (δ)| using an -net argument. In particular, using the above introduced notation, we have the following decomposition sup δ∈C(rn,t)

627
We first bound the term T 1 in (9.1). To that end, let . With a little abuse of notation we use l to denote the density of where (i) follows from dropping a negative term, and (ii) follows from taking absolute value within the second expectation. We can apply linearization techniques on the difference of w i (δ k ) − w i (0).
where (iii) follows by the mean value theorem and (iv) from the Condition (E). Hence, we have that almost surely, |Z ik | ≤ C max i x iδk for a constant C < ∞. For a fixed k, Bernstein's inequality (see, for example, Section 2.2.2 of Van Der Vaart and Wellner, 1996) gives us x iδk log(2/δ) n ⎞ ⎟ ⎠ with probability 1 − δ. Observe that for i∈ [n] x iδk , we have i∈ [n] x iδk ≤ C 2 n δ k X Xδ k ≤ C 2 nr n t 1/2 (9.3) where the line follows using the Cauchy-Schwartz inequality.
Hence, with probability 1 − 2δ we have for all λ j ≥ A log p/n that Using the union bound over k ∈ [N δ ], with probability 1 − 2δ, we have Let us now focus on bounding T 2 term.
We further simply the expression, with a little abuse of notation, Then it is clear that EZ ik = 0 and as shown earlier in Var(Z ik ), Moreover, sup δ∈B(δ k ,rnξn) The term T 21 can be bounded in a similar way to T 1 by applying Bernstein's inequality and hence the details are omitted. With probability 1 − 2δ, A bound on T 2 now follows using a union bound over k ∈ [N δ ]. We can choose ξ n = n −1 , which gives us N δ pn 2 t . With these choices, we obtain , which completes the proof.
Proof of Lemma 2. We begin by rewriting the term n −1 n i=1 ψ i (β), and aim to represent it through indicator functions. Observe that (9.4) Using the fundamental theorem of calculus, we notice that if where F is the univariate distribution of ε i . Therefore, with expectation on ε, we can obtain an expression without the y i .
for some u * between 0 and x i (β * − β), and where we have defined We then show a bound for Δ : . Moreover, the original expresion is also smaller than or equal to 1I (|x i β * | ≤ |x i (β − β * )|). The term (9.6) can be bounded by the design matrix setup and Condition (E), With the help of Hölder's inequality, |(9.5)| ≤n By triangular inequality and Condition (E) we can further upper bound the right hand side with Then we are ready to put terms together and obtain a bound for Δ. Additionally, by the design matrix setup we have for β−β * 1 < ξ and a constant C. Essentially, this proves that Δ is not greater than a constant multiple of the difference between β and β * . Thus, we have as n → ∞ Proof of Lemma 3. For the simplicity in notation we fix j = 1 and denotê γ (1) (β) withγ(β). The proof is composed of two steps: the first establishes a cone set and an event set of interest whereas the second proves the rate of the estimation error by certain approximation results.
Next, we focus on the term H 1 . Simple computation shows that for all k = 2, · · · p, we have Observe that the sequence {u i } across i = 1, · · · , n, is a sequence of independent random variables. As ε i and x i are independent we have by the tower property E[r i ] = E X X ik 1I{x i β * > 0}E ε [ζ * 1,i ] = 0. Moreover, as ζ * 1 is sub-exponential random vector, by Bernstein's inequality and union bound we have We pick c to be (log p/n) 1/2 , then we have with probability converging to 1 that h * ≤ H 1 ∞ + H 2 ∞ ≤ (log p/n) 1/2 + C 1 r 1/2 n t 3/4 (log p/n) 1/2 + C 2 t log p/n ≤ (a − 1)/(a + 1)λ 1 , for some constant C 1 and C 2 . Thus, with λ 1 chosen as λ 1 = C (log p/n) 1/2 r 1/2 n t 1/4 (log p/n) 1/2 t 3/4 (log p/n) 1/2 , for some constant C > 1, we have that h * ≤ (a − 1)/(a + 1)λ 1 with probability converging to 1. More directly, with the condition on the penalty parameter λ 1 , this implies that the event for the cone set (9.8) to be true holds with high probability.
Proof of Lemma 4 . Recall the definitions ofζ j and ζ * j . Observe that we have the following inequality, using triangular inequality and Hölder's inequality. We proceed to upper bound all of the three terms on the right hand side of the previous inequality. First, we observe (9.11) Moreover, the conditions imply that W j (β) ∞ ≤ K (by the design matrix condition), and by Lemma 3, for λ j as defined, the right hand size is O P (Ks j λ j ∨ K). Thus, we conclude ζ j + ζ * j ∞ = O P K Ks j λ j K = O P (K ∨ K ∨ Ks j λ j ).
Its multiplying term can be decomposed as following where • denotes entry wise multiplication between two vectors. The reason we have to spend such a great effort in separating the terms to bound this quantity is that we are dealing with a 1-norm here, rather than an infinity-norm, which is bounded easily. We start with term i. Notice that by Hölder's inequality and the design matrix condition. Moreover, by Lemma 1 we can easily bound the term above with O P (Kr 1/2 n t 3/4 (log p/n) 1/2 Kt log p/n), with r n and t as defined in Condition (I).
Proof of Lemma 5 . We begin by first establishing thatτ −2 j = O P (1). In the case when the penalty part λ j γ (j) (β) 1 happens to be 0, which meanŝ γ (j) (β) = 0, the worst case scenario is that the regression part, n −1 W j (β) − , also results in 0, i.e. 0 = W j (β) − W −j (β)γ (j) (β) (9.13) We show that these terms cannot be equal to zero simultaneously, since this forces W j (β) = 0, which is not true. Thus,τ −2 j is bounded away from 0. In order to show results about the matrices Ω(β) and Ω(β * ), we first provide a bound on theτ and τ . This is critical, since the magnitude of Ω(·) is determined by τ . To derive the bound on the τ 's, we have to decompose the terms very carefully and put a bound on each one of them.
By definition we have that τ 2 j = n −1 Eζ * j ζ * j , for which we haveτ 2 j as an estimate. The τ 2 j andτ 2 j carry information about the magnitude of the values in Σ −1 (β * ) and Ω(β) respectively. We next break down τ 2 j andτ 2 j into parts related to difference betweenγ (j) (β) and γ * (j) (β * ), which we know how to control. Thus, we have the following decomposition, The task now boils down to bounding each one of the terms I and II, independently. Term I is now bounded by Lemma 4 and is in order of O P K 2 s j λ j . Regarding term II, we first point out one result due to the Karush-Kuhn-Tucker conditions of (6), For the term II, we then have since by Lemma 3 we have Putting all the pieces together, we have shown that rate Proof of Lemma 6. For the simplicity of the proof we introduce some additional notation. Let δ =β − β * , and ν n (δ) = n −1 n i=1 Ω(β) ψ i (β) − ψ i (β * ) .
Observe that 1I y i − x iβ ≤ 0 = 1I {x i δ ≥ ε i } and hence 1 − 2 1I{y i − x iβ > 0} = 2 1I y i − x iβ ≤ 0 − 1. The term we wish to bound then can be expressed as V n (δ) = ν n (δ) − Eν n (δ) for ν n (δ) denoting the following quantity ν n (δ) = n −1 Let {δ k } k∈[N δ ] be centers of the balls of radius r n ξ n that cover the set C(r n , t). Such a cover can be constructed with N δ ≤ p t (3/ξ n ) t (see, for example Van der Vaart, 2000). Furthermore, let B(δ k , r) = δ ∈ R p : ||δ k − δ|| 2 ≤ r , supp(δ) ⊆ supp(δ k ) be a ball of radius r centered atδ k with elements that have the same support as δ k . In what follows, we will bound sup δ∈C(rn,t) ||V n (δ)|| ∞ using an -net argument. In particular, using the above introduced notation, we have the following decomposition sup δ∈C(rn,t) (9.14) Observe that the term T 1 arises from discretization of the sets C(r n , t). To control it, we will apply the tail bounds for each fixed l and k. The term T 2 captures the deviation of the process in a small neighborhood around the fixed centerδ k . For those deviations we will provide covering number arguments. In the remainder of the proof, we provide details for bounding T 1 and T 2 .
We first bound the term T 1 in (9.14). Let a ij (β) = e j Ω(β)x i We are going to decouple dependence on x i and ε i . To that end, let With a little abuse of notation we use f to denote the density of ε i for all i.
Observe that E [f i (δ)g i (δ)|x i ] = f i (δ)P(ε i ≤ x i δ). We use w i (δ) to denote the right hand side of the previous equation .
Note that E[Z ijk | {x i } i∈[n] ] = 0. With a little abuse of notation we use l to denote the density of x i β * for all i.