On the Finite-Sample Analysis of $\Theta$-estimators

In large-scale modern data analysis, first-order optimization methods are usually favored to obtain sparse estimators in high dimensions. This paper performs theoretical analysis of a class of iterative thresholding based estimators defined in this way. Oracle inequalities are built to show the nearly minimax rate optimality of such estimators under a new type of regularity conditions. Moreover, the sequence of iterates is found to be able to approach the statistical truth within the best statistical accuracy geometrically fast. Our results also reveal different benefits brought by convex and nonconvex types of shrinkage.


Introduction
Big data naturally arising in machine learning, biology, signal processing, and many other areas, call for the need of scalable optimization in computation. Although for low-dimensional problems, Newton or quasi-Newton methods converge fast and have efficient implementations, they typically do not scale well to high dimensional data. In contrast, first-order optimization methods have recently attracted a great deal of attention from researchers in statistics, computer science and engineering. They iterate based on the gradient (or a subgradient) of the objective function, and have each iteration step being cost-effective. In high dimensional statistics, a first-order algorithm typically proceeds in the following manner where P is an operator that is easy to compute, ∇l denotes the gradient of the loss function l, and α gives the stepsize. Such a simple iterative procedure is suitable for large-scale optimization, and converges in arbitrarily high dimensions provided α is properly small. P can be motivated from the perspective of statistical shrinkage or regularization and is necessary to achieve good accuracy when the dimensionality is moderate or high. For example, a proximity operator (Parikh and Boyd, 2014) is associated with a convex penalty function. But the problems of interest may not always be convex. Quite often, P is taken as a certain thresholding rule Θ in statistical learning, such as SCAD (Fan and Li, 2001). The resulting computation-driven estimators, which we call Θ-estimators, are fixed points of β = Θ(β − ∇l(β); λ).
To study the non-asymptotic behavior of Θ-estimators (regardless of the sample size and dimensionality), we will establish some oracle inequalities.
During the last decade, people have performed rigorous finite-sample analysis of many high-dimensional estimators defined as globally optimal solutions to some convex or nonconvex problems-see Bunea et al. (2007), Zhang and Huang (2008), Bickel et al. (2009), Lounici et al. (2011), Zhang and Zhang (2012), She (2014), among many others. Θ-estimators pose some new questions. First, although nicely, an associated optimization criterion can be constructed for any given Θ-estimator, the objective may not be convex, and the estimator may not correspond to any functional local (or global) minimum. Second, there are various types of Θ-estimators due to the abundant choices of Θ, but a comparative study regarding their statistical performance in high dimensions is lacking in the literature. Third, Θ-estimators are usually computed in an inexact way on big datasets. Indeed, most practitioners (have to) terminate (1) before full computational convergence. These disconnects between theory and practice when using iterative thresholdings motivate our work.
The rest of the paper is organized as follows. Section 2 introduces the Θestimators, the associated iterative algorithm-TISP, and some necessary notation. Section 3 presents the main results, including some oracle inequalities, and sequential analysis of the iterates generated by TISP. Section 4 provides proof details.

Thresholding functions
Definition 1 (Thresholding function). A thresholding function is a real valued function A vector version of Θ (still denoted by Θ) is defined componentwise if either t or λ is replaced by a vector. From the definition, (2) must be monotonically non-decreasing and so its derivative is defined almost everywhere on (0, ∞). Given Θ, a critical number L Θ ≤ 1 can be introduced such that dΘ −1 (u; λ)/ du ≥ 1 − L Θ for almost every u ≥ 0, or where ess inf is the essential infimum. For the perhaps most popular soft-thresholding and hard-thresholding functions L Θ equals 0 and 1, respectively. For any arbitrarily given Θ, we construct a penalty function P Θ (t; λ) as follows for any t ∈ R. This penalty will be used to make a proper objective function for Θ-estimators. The threshold τ (λ) := Θ −1 (0; λ) may not equal λ in general. For ease in notation, in writing Θ(·; λ), we always assume that λ is the threshold parameter, i.e., λ = τ (λ), unless otherwise specified. Then an important fact is that given λ, any thresholding rule Θ satisfies Θ(t; λ) ≤ Θ H (t; λ), ∀t ≥ 0, due to property (iv), from which it follows that where In particular, P H (t; λ) ≤ P 0 (t; λ) := λ 2 2 1 t =0 and P H (t; λ) ≤ P 1 (t; λ) := λ|t|. When Θ has discontinuities, such as t = ±λ in Θ H (t; λ), ambiguity may arise in definition. To avoid the issue, we assume the quantity to be thresholded never corresponds to any discontinuity of Θ. This assumption is mild because practically used thresholding rules have few discontinuity points and such discontinuities rarely occur in real applications.

Θ-estimators
We assume a model where X is an n × p design matrix, y is a response vector in R n , β * is the unknown coefficient vector, and ǫ is a sub-Gaussian random vector with mean zero and scale bounded by σ, cf. Definition 2 in Section 4 for more detail. Then a Θ-estimatorβ, driven by the computational procedure (1), is defined as a solution to the Θ-equation where ρ, the scaling parameter, does not depend on β. Having ρ appropriately large is crucial to guarantee the convergence of the computational procedure. All popularly used penalty functions are associated with thresholdings, such as the ℓ r (0 < r ≤ 1), ℓ 2 , SCAD (Fan and Li, 2001), MCP (Zhang, 2010a), capped ℓ 1 (Zhang, 2010b), ℓ 0 , elastic net (Zou and Hastie, 2005), Berhu (Owen, 2007;He et al., 2013), ℓ 0 + ℓ 2 (She, 2009), to name a few. Table 1 lists some examples. From a shrinkage perspective, thresholding rules usually suffice in statistical learning.
We will show that the λ in the scaled form does not have to adjust for the sample size, which is advantageous in regularization parameter tuning. A simple iterative procedure can be defined based on (8) or (9): which is called the Thresholding-based Iterative Selection Procedure (TISP) (She, 2009). From Theorem 2.1 of She (2012), given an arbitrary Θ, TISP ensures the following function-value descent property when ρ ≥ X 2 2−L Θ : Here, the energy function (objective function) is constructed as where the penalty P can be P Θ as defined in (4), or more generally, with q an arbitrary function satisfying q(t, λ) ≥ 0, ∀t ∈ R and q(t; λ) = 0 if t = Θ(s; λ) for some s ∈ R. Furthermore, we can show that when ρ > X 2 /(2 − L Θ ), any limit point of β (t) is necessarily a fixed point of (8), and thus a Θ-estimator. See She (2012) for more detail. Therefore, f is not necessarily unique when Θ has discontinuities-for example, penalties like the capped ℓ 1 , P 0 (t; λ) = λ 2 2 1 t =0 and P H are all associated with the same Θ H . Because of the many-to-one mapping from penalty functions to thresholding functions, iterating (1) with a well-designed thresholding rule is perhaps more convenient than solving a nonconvex penalized optimization problem. Indeed, some penalties (like SCAD) are designed from the thresholding viewpoint.

Theorem 1. Letβ be a local minimum point (or a coordinate-wise minimum
The converse is not necessarily true. Namely, Θ-estimators may not guarantee functional local optimality, let alone global optimality. This raises difficulties in statistical analysis. We will give a novel and unified treatment which can yield nearly optimal error rate for various thresholdings.

Main Results
To address the problems in arbitrary dimensions (with possibly large p and/or n), we aim to establish non-asymptotic oracle inequalities (Donoho and Johnstone, 1994).
Unless otherwise specified, we study scaled Θ-estimators satisfying equation (9), whereβ = ρβ,X = X/ρ, and ρ ≥ X 2 (and so X 2 ≤ 1). By abuse of notation, we still write β forβ, and X forX. As mentioned previously, we always assume that Θ is continuous atβ + X T y − X T Xβ in Sections 3.1 & 3.2; similarly, Section 3.3 assumes that Θ is continuous at The past works on the lasso show that a certain incoherence requirement must be assumed to obtain sharp error rates. In most theorems, we also need to make similar assumptions to prevent the design matrix from being too collinear. We will state a new type of regularity conditions, which are called comparison regularity conditions, under which oracle inequalities and sequential statistical error bounds can be obtained for any Θ.

P Θ -type oracle inequalities under R 0
In this subsection, we use P Θ to make a bound of the prediction error of Θestimators. Our regularity condition is stated as follows.
In the case of hard-thresholding or SCAD thresholding, P Θ (β; λ) does not depend on the magnitude of β, and we can get a finite complexity rate in the oracle inequality. Also, R 0 can be slightly relaxed, by replacing KP Θ (β; λ) with KP 0 (β; λ) in (15). We denote the modified version by R ′ 0 (δ, ϑ, K, β, λ). Corollary 2. Suppose that Θ corresponds to a bounded nonconvex penalty satisfying P Θ (t; λ) ≤ Cλ 2 , ∀t ∈ R, for some constant C > 0. Then in the setting of Theorem 2, The right-hand side of the oracle inequalities involves a bias term Xβ − Xβ * 2 2 and a complexity term P Θ (β; λ). Letting β = β * in, say, (16), the bias vanishes, and we obtain a prediction error bound of the order σ 2 J * log(ep) (omitting constant factors), where J * denotes the number of nonzero components in β * . On the other hand, the existence of the bias term ensures the applicability of our results to approximately sparse signals. For example, when β * has many small but nonzero components, we can use a reference β with a much smaller support than J (β * ) to get a lower error bound, as a benefit from the bias-variance tradeoff.
Remark 2. When R 0 holds with δ > 1, the proof of Theorem 2 shows that the multiplicative constant for Xβ − Xβ * 2 2 can be as small as 1. The corresponding oracle inequalities are called 'sharp' in some works (Koltchinskii et al., 2011). This also applies to Theorem 3. Our proof scheme can also deliver highprobability form results, without requiring an upper bound of X 2 .
Remark 3. Corollary 2 applies to all "hard-thresholding like" Θ, because when Θ(t; λ) = t for |t| > cλ, P Θ (t; λ) ≤ c 2 λ 2 . It is worth mentioning that the error rate of σ 2 J * log(ep) cannot be significantly improved in a minimax sense. In fact, under the Gaussian noise contamination and some regularity conditions, there exist constants C, c > 0 such that infβ sup β * :  (17) achieves the minimax optimal rate up to a mild logarithm factor for any n and p.

P 0 -type oracle inequalities under R 1
This part uses P 0 instead of P Θ to make an oracle bound. We will show that under another type of comparison regularity conditions, all thresholdings can attain the essentially optimal error rate given in Corollary 2. We will also show that in the case of soft-thresholding, our condition is more relaxed than many other assumptions in the literature.
Remark 4. Some fusion thresholdings, like those associated with elastic net, Berhu and Hard-Ridge (cf . Table 1), involve an additional ℓ 2 shrinkage. In the situation, the complexity term in the oracle inequality should involve both J(β) and β 2 2 . We can modify our regularity conditions to obtain such ℓ 0 +ℓ 2 bounds using the same proof scheme. The details are however not reported in this paper. In addition, our results can be extended to Θ-estimators with a stepsize parameter. Given λ > 0 and 0 < α ≤ 1, suppose λ α is introduced such that αP Θ (t; λ) = P Θ (t; λ α ) for any t. Then, for anyβ as a fixed point of β = Θ(β −αX T Xβ +αX T y; λ α ), an analogous result can be obtained (the only change is that L Θ is replaced by L Θ /α).
To give some more intuitive regularity conditions, we suppose P Θ is concave on [0, ∞). Examples include ℓ r (0 ≤ r ≤ 1), MCP, SCAD, and so on. The concavity implies P Θ (t + s) ≤ P Θ (t) + P Θ (s), and so P where J c is the complement of J and β J is the subvector of β indexed by J . Then R 1 is implied by R ′ 1 below for given J = J (β).
ASSUMPTION R ′ 1 (δ, ϑ, K, J , λ) Given X, Θ, J , λ, there exist δ > 0, ϑ > 0, K ≥ 0 such that for any ∆ ∈ R p , or When Θ is the soft-thresholding, it is easy to verify that a sufficient condition for (20) is for some ϑ > 0 and K ≥ 0. (21) has a simper form than R 1 . In the following, we give the definitions of the RE and the compatibility condition (Bickel et al., 2009;van de Geer and Bühlmann, 2009) to make a comparison to (21).
ASSUMPTION RE(κ RE , ϑ RE , J ). Given J ⊂ [p], we say that X ∈ R n×p satisfies RE(κ RE , ϑ RE , J ), if for positive numbers κ RE , ϑ RE > 0, or more restrictively, for all ∆ ∈ R p satisfying In particular, R 1 is less demanding than RE.
Next, let's compare the regularity conditions required by Θ S and Θ H to achieve the nearly optimal error rate. Recall R 1 (δ, ϑ, K, β, λ) and R ′ 0 (δ, ϑ, K, β, λ) in Theorem 3 and Corollary 2, respectively On the other hand, Corollary 2 studies when all Θ H -estimators have the optimal performance guarantee, while practically, one may initialize (10) with a carefully chosen starting point. (12)) such that (16) holds without requiring any regularity condition. In particular, if Θ corresponds to a bounded nonconvex penalty as described in Corollary 2, then there exists a Θ-estimator such that (17) holds free of regularity conditions. Theorem 4 does not place any requirement on X. So it seems that applying Θ H may have some further advantages in practice. (How to efficiently pick a Θ Hestimator to completely remove all regularity conditions is however beyond the the scope of the current paper. For a possible idea of relaxing the conditions, see Remark 6.)

Theorem 4. Given any Θ, there exists a Θ-estimator (which minimizes
Finally, we make a discussion of the scaling parameter ρ. Our results so far are obtained after performing X ← X/ρ with ρ ≥ X 2 . The prediction error is invariant to the transformation. But it affects the regularity conditions. Seen from (8), 1/ρ 2 is related to the stepsize α appearing in (1), also known as the learning rate in the machine learning literature. From the computational results in Section 2.2, ρ must be large enough to guarantee TISP is convergent. The larger the value of ρ is, the smaller the stepsize is (and so the slower the convergence is). Based on the machine learning literature, slow learning rates are always recommended when training a nonconvex learner (e.g., artificial neural networks). Perhaps interestingly, in addition to computational efficiency reasons, all our statistical analyses caution against using an extremely large scaling when L Θ > 0. For example, R ′ 0 (δ, ϑ, K, β, λ) for an unscaled X reads ϑP H (ρ(β ′ − β); λ) + ρ 2 β ′ − β 2 2 /2 ≤ (2 − δ) X(β ′ − β) 2 2 /2 + P H (ρβ ′ ; λ) + Kλ 2 J, which becomes difficult to hold when ρ is very large. This makes the statistical error bound break down easily. Therefore, a good idea is to have ρ just appropriately large (mildly greater than X 2 ). The sequential analysis of the iterates in the next part also supports the point.

Sequential Algorithmic Analysis
We perform statistical error analysis of the sequence of iterates defined by TISP: , where X 2 ≤ 1 and β (0) is the starting point. The study is motivated from the fact that in large-scale applications, Θ-estimators are seldom computed exactly. Indeed, why bother to run TISP till computational convergence? How does the statistical accuracy improve (or deteriorate) at t increases? Lately, there are some key advances on the topic. For example, Agarwal et al. (2012) showed that for convex problems (not necessarily strongly convex), proximal gradient algorithms can be geometrically fast to approach a globally optimal solutionβ within the desired statistical precision, under a set of conditions. We however care about the statistical error between β (t) and the genuine β * in this work.
We will introduce two comparison regularity conditions (analogous to R 0 and R 1 ) to present both P Θ -type and P 0 -type error bounds. Hereinafter, denote (β T Aβ) 1/2 by β A , where A is a positive semi-definite matrix.
Remark 6. Theorem 5 still applies when δ, ϑ, K and λ are dependent on t. For example, if we use a varying threshold sequence, i.e., β (t+1) = Θ(β (t) + X T y − X T Xβ (t) ; λ (t) ), then (30) becomes This allows for much larger values of λ s to be used in earlier iterations to attain the same accuracy. It relaxes the regularity condition required by applying a fixed threshold level.
At the end, we re-state some results under ρ > X 2 , to get more intuition and implications. For a general X (unscaled), (30) reads Set ρ to be a number slightly larger than X 2 , i.e., ρ = (1+ǫ) X 2 , ǫ > 0. Then, we know that the prediction error Xβ (t) − Xβ * 2 2 decays geometrically fast to O(σ 2 J * log(ep)) with high probability, when ǫ, δ, ϑ, K are viewed as constants; a similar conclusion is true for the estimation error. This is simply due to Accordingly, there is no need to run TISP till convergence-one can terminate the algorithm earlier, at, say, t max = log{ρ 2 β (0) − β * 2 /(Kσ 2 λ 2 J * )} /log(1/κ), without sacrificing much statistical accuracy. The formula also reflects that the quality of the initial point affects the required iteration number. There are some related results in the literature. (i) As mentioned previously, in a broad convex setting Agarwal et al. (2012) proved the geometric decay of the optimization error β (t) −β to the desired statistical precision, whereβ is the convergent point. Loh and Wainwright (2015) extended the conclusion to a family of nononvex optimization problems, and they showed that when some regularity conditions hold, every local minimum point is close to the authentic β * . In comparison, our results are derived toward the statistical error between β (t) and β * directly, without requiring all local minimum points to be statistically accurate.
(ii) Zhang (2010b) showed a similar fast-converging statistical error bound for an elegant multi-stage capped-ℓ 1 regularization procedure. However, the procedure carries out an expensive ℓ 1 optimization at each step. Instead, (10) involves a simple and cheap thresholding, and our analysis covers any Θ.

Acknowledgement
The author would like to thank the editor, the associated editor and two anonymous referees for their careful comments and useful suggestions that improve the quality of the paper. The author also appreciates Florentina Bunea for the encouragement. This work was supported in part by NSF grant DMS-1352259.

Proofs
Throughout the proofs, we use C, c, L to denote universal non-negative constants. They are not necessarily the same at each occurrence. Given any matrix A, we use R(A) to denote its column space. Denote by P A the orthogonal projection matrix onto R(A), i.e., P A = A(A T A) + A T , where + stands for the Moore-Penrose pseudoinverse. Let [p] := {1, · · · , p}. Given J ⊂ [p], we use X J to denote a column submatrix of X indexed by J .
Examples include Gaussian random variables and bounded random variables such as Bernoulli. Note that the assumption that vec (ǫ) is sub-Gaussian does not imply that the components of ǫ must be i.i.d.

Proofs of Theorem 2 and Theorem 3
Given Θ, letβ be any Θ-estimator, β be any p-dimensional vector (non-random) and ∆ =β − β. The first result constructs a useful criterion forβ on basis of Lemma 1 and Lemma 2.
To handle ǫ, X∆ , we introduce another lemma.
Lemma 4. Suppose X 2 ≤ 1 and let λ o = σ log(ep). Then there exist universal constants A 1 , C, c > 0 such that for any constants a ≥ 2b > 0, the following event occurs with probability at most C exp(−ct)p −cA 2 1 , where t ≥ 0.
The lemma plays an important role in bounding the last stochastic term in (35). Its proof is based on the following results.
The proof of Theorem 3 follows the lines of the proof of Theorem 2, with (38) replaced by 1 2b Combining the last two inequalities gives We define an event E with its complement given by By Lemma 4, there exists a universal constant L such that for any Take b = 1/(2ϑ), a = 1/(δ ∧ ϑ), A 1 ≥ √ L, and λ = A 1 √ abλ o . Then, on E we get the desired statistical accuracy bound The bound under S 1 can be similarly proved. Noticing that (41) holds for any t, Corollary 3 is immediately true.
To cancel the first-order term, we give two other inequalities based on secondorder lower/upper bounds: Adding the three inequalities together gives the triangle inequality.