Quantile universal threshold ∗

: Eﬃcient recovery of a low-dimensional structure from high-dimensional data has been pursued in various settings including wavelet denoising, generalized linear models and low-rank matrix estimation. By thresholding some parameters to zero, estimators such as lasso, elastic net and subset selection perform variable selection. One crucial step challenges all these estimators: the amount of thresholding governed by a threshold parameter λ . If too large, important features are missing; if too small, incorrect features are included. Within a uniﬁed framework, we propose a selection of λ at the detection edge. To that aim, we introduce the concept of a zero-thresholding function and a null-thresholding statistic, that we explicitly derive for a large class of estimators. The new approach has the great advantage of transforming the selection of λ from an unknown scale to a probabilistic scale. Numerical results show the eﬀectiveness of our approach in terms of model selection and prediction.


Introduction
Many real world examples in which the number of features P of the model can be dramatically larger than the sample size N have been identified in various domains such as genomics, finance and image classification, to name a few. In those instances, the maximum likelihood estimation principle fails. Beyond existence and uniqueness issues, it tends to perform poorly when P is large relative to N due to its high variance. Motivated by the seminal papers of James and Stein [27] and Tikhonov [58], a considerable amount of literature has concentrated on parameter estimation using regularization techniques. In both parametric and nonparametric models, a reasonable prior or constraints are set on the parameters in order to reduce the variance of the estimator and the complexity of the fitted model, at the price of a bias increase.
Selection of the threshold is crucial to perform effective model selection. It amounts to selecting basis coefficients in wavelet denoising, or genes responsible for a cancer type in microarray data analysis. In change-point detection, it is equivalent to detecting locations of jumps. A too large λ results in a simplistic model missing important features whereas a too small λ leads to a model including many features outside the true model. A typical goal is variable screening, that is,Ŝ λ ⊇ S * (1.3) holds with high probability, along with few false detections {q :ξ λ,q = 0, ξ * q = 0}. For a suitably chosen λ, certain estimators allow variable screening. The optimal threshold for model identification often differs from the threshold aimed at prediction optimality [61,31,34,64], and it turns out that models aimed at good prediction are typically more complex.
Classical methodologies to select λ consist in minimizing a criterion. Examples include cross-validation, AIC [1], BIC [51] and Stein unbiased risk estimation (SURE) [53]. In low-rank matrix estimation, Owen and Perry [39] and Josse and Husson [28] employ cross-validation whereas Candès et al. [11] and Josse and Sardy [29] apply SURE. The latter methodology is also used in regression [15,66,57], and reduced rank regression [36]. Because traditional information criteria do not adapt well to the high-dimensional setting, generalizations such as GIC [21] and EBIC [12] have been suggested.
In this paper, we propose a new threshold selection method that aims at a good identification of the support S * . Our approach has the advantage of transforming the selection of λ from an unknown scale to a probabilistic scale with the simple selection of a probability level. In Section 2, we first review thresholding estimators in linear regression, generalized linear models, low-rank matrix estimation and density estimation. We then introduce the key concept of a zero-thresholding function in Section 3 and derive explicit formulations. In Section 4, we define the null-thresholding statistic, which leads to our proposal: the quantile universal threshold. Some properties are derived. We illustrate the effectiveness of our methodology in Section 5 with four real data sets and simulations. The appendices contain proofs and technical details.

Review of thresholding estimators
Thresholding estimators are extensively used in the following domains.
Generalized linear models (GLMs). The canonical model assumes the loglikelihood is of the form

4)
b a known function, x 0n and x n denoting the nth row of X 0 and X respectively [37]. As an extension of lasso, Sardy et al. [50] and Park and Hastie [40] define Other penalties such as group lasso [33] have been proposed.
Low-rank matrix estimation. Consider the model Y = X * + σZ, where X * is a low-rank matrix and Z ij i.i.d.
Motivated by the preceding examples in GLMs and low-rank matrix estimation, a definition of a thresholding estimator is the following.

The zero-thresholding function
A key property shared by a class of estimators is to set the estimated parameters to zero for a sufficiently large but finite threshold λ. This leads to the following definition.

Definition 2.
A thresholding estimatorξ λ (Y) admits a zero-thresholding func- The zero-thresholding function is hence determined uniquely up to sets of measure zero. The equivalence implies equiprobability between setting all coefficients to zero and selecting the threshold large enough. It turns out that such a function has a closed form expression in many instances. Below we derive a catalogue for the estimators reviewed in Section 2.
Linear regression. Explicit formulations are the following: • Lasso, WaveShrink and the Dantzig selector: λ 0 (y) = X T y ∞ ; SCAD and MCP share the same zero-thresholding function when X is orthonormal. For adaptive lasso, λ 0 (y) = W X T y ∞ , where W is a diagonal matrix of weights, for LAD-lasso, λ 0 (y) = X T sgn(y) ∞ , where sgn(·) is the sign function applied componentwise, and for square root lasso, λ 0 (y) = X T y ∞ / y 2 .
• Group lasso and square root lasso: if the parameters are partitioned into G prescribed groups so that p λ (β) = λ g=1,...,G β g 2 , the zero-thresholding function is λ 0 (y) = max g=1,...,G X T g y 2 for group lasso and λ 0 (y) = max g=1,...,G X T g y 2 / y 2 for square root lasso. • Generalized lasso: Assuming B has full row rank, let I denote a set of column indices such that B I , the submatrix of B with columns indexed by I, is invertible.
For a convex objective function, derivation of the zero-thresholding function can be inferred from the Karush-Kuhn-Tucker conditions [43]. As an example, we consider LAD-lasso. A given β ∈ R P is a minimum of the objective function Such a derivation can also be performed for estimators with a composite penalty involving a two-dimensional parameter λ = (λ (1) , λ (2) ), for example: • Elastic net [65] where p λ (β) = λ (1)

Generalized linear models.
The following Lemma shows that although the lasso GLM solution (2.5) might not be unique, its fit is unique (see proof in Appendix A.1).
The zero-thresholding function ofβ λ is given in (3.2) below. Its derivation is based on Theorem 1 whose proof can be found in Appendix A.2.
Hence, for a strictly convex b and settingβ λ = 0 if λ = +∞, the zerothresholding function of lasso GLM is with v any vector such that For the group lasso GLM, one obtains Lemma 1 implies λ 0 (y) does not depend on which solution v to (3.3) is chosen. The set D is the set of values based on which the maximum likelihood estimate (MLE) of (β 0 , β) with constraintβ = 0 exists. If the response variable is Gaussian, then D = R N . An explicit formulation of D when the intercept is unpenalized (X 0 = 1) is given in Table 1. For an arbitrary matrix X 0 and under certain assumptions, Giacobino [24] shows that D coincides with the set of values y such that lasso GLM admits a solution. In particular, the following property holds.

Property 1. Consider a Poisson or (multinomial) logistic regression model.
Then, for any fixed X 0 , X and 0 < λ < ∞, and any observed value y, lasso GLM defined in (2.5) admits a solution if and only if a MLE of (β 0 , β) with constraintβ = 0 exists. Low-rank matrix estimation. The zero-thresholding function is λ 0 (Y ) = d ∞ , the largest singular value of the noisy matrix Y .

Thresholding under the null
Inspired by Donoho and Johnstone [15], we now consider the idea of choosing a threshold based on the null model ξ * = 0, that is, selecting a threshold λ such that {ξ λ = 0 | ξ * = 0} holds with high probability. From Definition 2, the events {ξ λ = 0} and {λ ≥ λ 0 (Y)} are equiprobable. This conducts us to the zero-thresholding function under the null model.
Given a thresholding estimator and its null-thresholding statistic, selecting λ large enough such that ξ * is recovered with probability 1 − α under the null model ξ * = 0 leads to the following new selection rule.
We discuss the selection of α in Section 4.4. As we will see, it turns out such a choice results in good empirical and theoretical properties also in the case ξ * = 0.
If the distribution of Λ 0 is unknown, λ QUT can be computed numerically by Monte Carlo simulation. For instance, one can easily simulate realizations of square root lasso's null-thresholding statistic Λ 0 = X T Y 0 ∞ / Y 0 2 , and compute λ QUT by taking the appropriate upper quantile. Section 4.4 considers situations where a closed form expression of λ QUT can be derived.
With the quantile universal threshold, selection of the regularization parameter is now redefined on a probabilistic scale through the probability level α. QUT is a selection rule designed for model selection as it aims at good identification of the support of the estimand ξ * . If one is instead interested in good prediction, then the sparse model identified by QUT can be refitted by maximum likelihood. Such a two step approach has been considered (see Bühlmann and van de Geer [5], Belloni and Chernozhukov [2]) to mimic the behavior of adaptive lasso [64] and results in a smaller bias for large coefficients.

Instances of QUT
A QUT-like selection rule supported by theoretical results has appeared in the following three settings, where a null model for the selection of λ plays a key role. Linear regression. Desirable properties of estimators such as the lasso, group lasso, square root lasso, group square root lasso or the Dantzig selector are satisfied if the tuning parameter is set to λ = cλ (0) for a certain c ≥ 1, such that the event {β λ (0) = 0 | β * = 0} holds with high probability, for instance with λ (0) = λ QUT for a small α. More precisely, upper bounds on the estimation and prediction error, as well as the screening property (1.3) hold with high probability assuming certain conditions on the regression matrix, the support S * of the coefficients and their magnitude; see Bühlmann and van de Geer [5], Belloni et al. [3], Bunea et al. [7] and references therein.
Low-rank matrix estimation. Under the null model X * = 0 N ×P , it can be shown that with a noise level of 1/ √ N , the empirical distribution of the singular values of the response matrix converges to a compactly supported distribution. By setting any singular value smaller than the upper bound of the support to zero, Gavish and Donoho [23] derive optimal singular value thresholding operators.

Pivotal null-thresholding statistic
Assuming a Gaussian distribution, the zero-thresholding function of lasso (3.2) yields the null-thresholding statistic where P X0 is the orthogonal projection onto the range of X 0 and the null model is Y 0 ∼ N(X 0 β * 0 , σ 2 I). Since Λ 0 is an ancillary statistic for β * 0 , the quantile universal threshold can equivalently be defined as λ QUT = σλ Z , λ Z being the upper α-quantile of Λ Z = X T (I − P X0 )Z ∞ , where Z ∼ N(0, I N ). Alike other criteria such as SURE, AIC, BIC and GIC, an estimate of σ is required; see Appendix B for a possible approach.
In contrast, square root lasso's null-thresholding statistic Λ 0 = X T (I − P X0 )Y 0 ∞ / (I − P X0 )Y 0 2 is pivotal with respect to both β * 0 and σ, and LADlasso's is pivotal with respect to σ when P 0 = 0. This is a clear advantage over lasso, especially in a situation where the nuisance parameter σ is difficult to estimate.
In Poisson and logistic regression, the null-thresholding statistic depends on β * 0 which we estimate with the following procedure. First, calculate the MLE of β 0 based on the observed value y with the constraintβ = 0 (it is the solution to (3.3)). Then, solve (2.5) with the corresponding quantile universal threshold. Finally, the estimate isβ 0 MLE where (β 0 MLE ,β MLE ) denotes the MLE based on y with covariates selected by the previous procedure. In Appendix C, we conduct an empirical investigation of the sensitivity of our approach to the estimation of β * 0 . The random design setting is the situation where not only the response vector but also the matrix of covariates is random, like all four data sets in Section 5.1. To account for the variability due to random design, we define the quantile universal threshold as the upper α-quantile of Λ 0 = λ 0 (Y 0 , [X 0 , X]), with [X 0 , X] consisting of independent identically distributed rows. To account for the variability in the design matrix, the rows of [X 0 , X] can be bootstrapped within the Monte Carlo simulation to evaluate λ QUT . Both fixed and random alternatives are implemented in our R package qut.

Properties of QUT
Before considering the choice of α and deriving an explicit formulation of the quantile universal threshold in some settings, more theoretical properties are derived. Upper bounds on the estimation and prediction error of the lasso tuned with λ QUT as well as a sufficient condition for the screening property (1.3) follow from the next property.

Property 2. Assume the (L, S * )-compatibility condition is satisfied for
Then lasso (2.3) with λ = λ QUT satisfies with probability at least 1−α−P(λ (0) ≤ Λ 0 ≤ λ) then with the same probabilityŜ Remark that P(λ (0) ≤ Λ 0 ≤ λ) can be made arbitrarily small for a wellchosen λ (0) as long as the (L, S * )-compatibility condition is met. The proof of the property is omitted as it is essentially the same as for Theorem 6.1 in Bühlmann and van de Geer [5] using the fact that the key statistic they bound with high probability is the null-thresholding statistic Λ 0 = X T ∞ = λ 0 (Y 0 ) defined in (4.1). The screening property is a direct consequence of (ii). Similar results can be shown for the group lasso, square root lasso, group square root lasso and the Dantzig selector.
Another important property of our methodology concerns the familywise error rate. Recall that when performing multiple hypothesis tests, it is defined as the probability of incorrectly rejecting at least one null hypothesis. In the context of variable selection, it is the probability of erroneously selecting at least one variable. It can be shown that if the null model is true, the familywise error rate is equal to the false discovery rate defined in Section 5.2. Hence, Definition 4 implies the following property.

Property 3.
Any thresholding estimator tuned with λ QUT controls the familywise error rate as well as the false discovery rate at level α in the weak sense.
The probability of the previous properties is determined by α; we recommend α = 0.05 as Belloni et al. [3]. An alternative is to set α P tending to zero as the number P of covariates goes to infinity. Donoho and Johnstone [15] implicitly select a rate of convergence of α P = O(1/ √ log P ) (Josse and Sardy [29] also select this rate).
Finally, an explicit formulation of the quantile universal threshold can be derived in the following settings: (i) In orthonormal regression with best subset selection and threshold equal σ √ 2 log P discussed in Section 4.2, the equivalent penalty is λ QUT = 2λ BIC = σ 2 log P satisfyingF Λ0 (λ QUT ) ∼ 1/ √ π log P . This result can be inferred from the null-thresholding statistic ∼ N(0, σ 2 ). Generalizations such as GIC and EBIC also select a larger tuning parameter than BIC which performs poorly in the high-dimensional setting. (ii) In total variation, the null-thresholding statistic converges in distribution to the infinite norm of a Brownian bridge, then λ QUT = σ √ P log log P /2 for α P = O(1/ √ log P ) [48]. For block total variation, the null-thresholding statistic tends to the maximum of a Bessel bridge, which distribution is known [41]. (iii) In group lasso with orthonormal groups, each of size Q, extreme value theory leads to λ QUT = σ 2 log P + (Q − 1) log log P − 2 log Γ(Q/2) [47].

Numerical results of lasso GLM
The QUT methodology for lasso and square root lasso is implemented in the qut package which is available from the Comprehensive R Archive Network (CRAN). In the following, QUT lasso and QUT √ lasso stand for QUT applied to lasso and square root lasso. CVmin refers to cross-validation, CV1se to a conservative variant of CVmin which takes into account the variability of the cross-validation error [4], SS to stability selection [35] and GIC to the generalized information criterion [21]. When applying GIC and QUT lasso , the variance is estimated with (B.2) and (B.3) respectively. The level α is set to 0.05.

Based on data sets
We briefly describe the four data sets considered to illustrate our approach in Gaussian and logistic regression: • riboflavin [6]: Riboflavin production rate measurements from a population of Bacillus subtilis with sample size N = 71 and expressions from P = 4088 genes. We randomly split one hundred times into a training and a test set of equal size. Five lasso selection rules are compared including QUT. Except for CV1se, the final model is fitted by MLE with the previously selected covariates in order to improve prediction. In Figure 1, we report the number of nonzero coefficients selected on the training set, as well as the test set mean-squared prediction error and correct classification rate. Good predictive performance is achieved by QUT lasso as well as GIC with a median model complexity between SS and CV1se. QUT lasso works remarkably well for chemometrics and leukemia. By selecting a large number of variables CV1se results in efficient prediction, whereas SS and √ lasso show poor predictive performance due to the low complexity of their estimated model. Moreover, GIC exhibits a larger variability than QUT lasso and QUT √ lasso in terms of number of nonzero coefficients. We perform a simulation based on Reid et al. [42]. Responses are generated from the linear, logistic and Poisson regression model with N = 100 and P = 1000. The intercept is set to one and unit noise variance is assumed in linear regression. The true parameter β * and predictor matrix X are obtained as follows:

Based on simulations
• Entries of X are standard Gaussian with correlation between columns set to ω. • The support of β * is of cardinality s * = N θ and selected uniformly at random. Entries are generated from a Laplace(1) distribution and scaled according to a certain signal to noise ratio, snr = β * T Σ ω β * , Σ ω being the covariance matrix of a single row of X and for a noise variance σ 2 = 1 in the Gaussian case. Table 2 contains estimated TPR and FDR based on one hundred replications. We also report the predictive root mean squared error defined by RMSE 2 = E{(x T new β * − x T newβ ) 2 }/snr; here the expectation is taken over training sets and new predictive locations x new . Looking at TPR and FDR, the high complexity of CV1se and the low complexity of SS and √ lasso are again observed. Looking at RMSE, QUT lasso often performs best thanks to a good sparse model before fitting by MLE. Finally, QUT lasso and GIC are comparable in terms of RMSE, but QUT lasso often has a better compromise between TPR and FDR.

Phase transition property
We investigate the variable screening property and observe a phase transition. Given a thresholding estimator, if several tuning parameter values yieldŜ λ containing the true support S * , the smallest estimated model can be of interest since it minimizes the FDr. We call it the optimal inclusive model. This leads to the definition of the oracle inclusive rate which measures its cardinality relative to the estimated support.  Models with OIr = 0 have TPr = 1, whereas those with OIr = 1 have minimum FDr amongst all models with TPr = 1. Moreover, OIR ≤ P(Ŝ λ ⊇ S * ). A small OIr results from a complex model containing S * , whereas a null OIr results fromŜ λ S * . The latter could be due to a simplistic model or the variable screening property being unachievable, in which case s min does not exist.
We extend the simulation of Donoho and Tanner [16] in compressed sensing to model (2.1) with unit noise variance assumed to be known. The entries of the N × P X matrix are assumed to be i.i.d. standard Gaussian. We set P = 1600 and vary the number of rows N ∈ {160, 320, 480, 640, 800, 960, 1120, 1280, 1440} as well as the cardinality of the support of β * , s * ∈ {1, . . . , N}. Nonzero entries are set to ten. One hundred predictor matrices X and responses y are generated for each pair (N, s * ).
On the left panel of Figure 2, we report OIR for the oracle lasso selection rule which retains the optimal inclusive model if it exists. Values are plotted as a function of δ = N/P , the undersampling factor, and of ρ = s * /N , the sparsity factor. On the middle and right panel, we report OIR for QUT lasso along other methodologies as well as QUT √ lasso . The following interesting behaviors are observed: • Phase transition of Oracle and QUT. Two regions can be clearly distinguished: a high OIR region due to a selected model containing few covariates outside the optimal model and a zero OIR region in which s min does not exist. The change between these regions is abrupt, as observed in compressed sensing. • Near oracle performance of QUT. Comparing the left and middle panels, the performance of QUT is nearly as good as that of the oracle selection rule, with the phase transition occurring at similar values of ρ. • Low complexity of QUT lasso . Comparing several rules on the right panel, QUT has a high OIR. Moreover, CVmin has lower OIR than CV1se and is comparable to SURE. The low OIR of the three latter selection rules is due to the high complexity of their selected model. This goes along the fact they are prediction-based methodologies whereas QUT aims at a good identification of the parameters. • Low OIR of QUT √ lasso . This could be due to the fact that √ lasso requires stronger conditions than lasso with a known variance to achieve variable screening [3]. Considering its zero-thresholding function, not only its numerator increases as the model deviates from the null model (as lasso), but also its denominator, making screening harder to reach with α = 0.05.

Conclusion
According to Ockham's razor, if two selected models yield comparable predictive performances, the sparsest should be preferred. Lasso with QUT tends to be in accordance with this principle by selecting low complexity models that achieve good predictive performance. A good compromise between high TPR and low FDR is obtained.

A.1. Proof of Lemma 1
It follows from the strict convexity of b on Θ and the convexity of f (β 0 , β) = β 1 on F that the objective function in (2.5) is convex on F. The solution set is thus convex.