T RADE - OFF BETWEEN PREDICTIVE PERFORMANCE AND FDR CONTROL FOR HIGH - DIMENSIONAL G AUSSIAN MODEL SELECTION

In the context of high-dimensional Gaussian linear regression for ordered variables, we study the variable selection procedure via the minimization of the penalized least-squares criterion. We focus on model selection where the penalty function depends on an unknown multiplicative constant commonly calibrated for prediction. We propose a new proper calibration of this hyperparameter to simultaneously control predictive risk and false discovery rate. We obtain non-asymptotic bounds on the False Discovery Rate with respect to the hyperparameter and we provide an algorithm to calibrate it. This algorithm is based on quantities that can typically be observed in real data applications. The algorithm is validated in an extensive simulation study and is compared with several existing variable selection procedures. Finally, we study an extension of our approach to the case in which an ordering of the variables is not available.


Problem statement.
We consider the following high-dimensional univariate Gaussian linear regression model : (1.1) The random response vector Y = (Y i ) {1≤i≤n} T ∈ R n is regressed on p deterministic vectors : T . The design matrix of size n × p is denoted by X = (X 1 , • • • , X p ).
The noise ε = (ε i ) {1≤i≤n} T is assumed to be Gaussian : ε ∼ N (0, σ 2 I n ) with σ 2 > 0. In the high-dimensional context, additional assumptions of regularity are required and we assume that β * is sparse, meaning that only a few coefficients are non-zero.In the following, a variable X j corresponding to a non-zero coefficient β * j is called an active arXiv:2302.01831v4[math.ST] 28 Jun 2024 variable.Otherwise the variable is said to be non-active.
In this paper, we are interested in variable selection.We refer the reader to [1] and references therein.To the best of our knowledge, some variable selection procedures focus on the prediction of the response variable Y through a control of the predictive risk.Others focus on limiting the number of selected non-active variables through a control of the False Discovery Rate.There also exists procedures where several cost functions are considered simultaneously.In the line of the latter, our goal is to identify a set of variables from a model selection procedure by limiting the selection of non-active variables while maintaining accurate prediction performances.
In a variable selection procedure, a cost function has to be defined.The predictive risk (PR) and the False Discovery Rate (FDR) are the common used cost functions.
The penalized methods to control the predictive risk.The penalization procedure balances goodness of fit and sparsity : the smaller the penalty function, the better the fitting to the data but the higher the number of selected variables.
In a high-dimensional setup, the most popular method is the Lasso criterion [2] where the estimator βλ of β * is the solution of : βλ = arg min where | • | 1 and || • || 2 denote the ℓ 1 -norm and the euclidean norm of a vector respectively.The main challenge is to calibrate the hyperparameter λ > 0. If λ is chosen to be proportional to σ log(p) n , then the predictive risk is bounded [3,4].However, the noise being usually unknown, the choice of λ remains tricky.Therefore, an alternative is to solve the Lasso criterion for a λ within a reasonable interval by using subsamples [5] or resamples [6].The selected variables are then defined as the variables with the highest selection frequencies.Such alternative is no longer sensitive to the choice of λ but the main challenge lies in the threshold on the frequency defining the selected variables.In this paper, we consider a model selection procedure composed of three steps.The first step consists of solving the Lasso criterion on a relevant grid Λ.Each λ ∈ Λ defines a variable subset to get a collection M of relevant subsets of variables with a wide range of sizes.In the second step, the least-squares estimator onto each variable subset of M is calculated leading to a collection of estimators βm m∈M .Lastly, the following penalized least-squares minimization is solved to select the best m of M : where D m is the dimension of the model m and the function pen is a penalty function increasing with D m .Selecting m from M by minimizing (1.3) corresponds to selecting λ from Λ by minimizing (1.2).Hence, the main challenge is the definition of pen that achieves an optimal trade-off between goodness of fit and sparsity within M. Popular methods of model selection include V −fold cross-validation [7,8], AIC [9], Cp-Mallows [10], BIC [11] and eBIC [12].For these penalty functions, the predictive risk is bounded when σ 2 is known and when the sample size n tends to infinity.When n is fixed, relatively small, and possibly smaller than the dimension p, a non-asymptotic point of view is preferable to get properties for all couples of (n, p).In this direction, [13] propose some penalty functions depending on the collection complexity such that m guarantees non-asymptotic optimal control of the predictive risk.If the model collection is nested with a known variance, pen(D m ) = 2σ 2 D m allows to achieve an optimal non-asymptotic control of the predictive risk [9].If the model collection is fixed and large (for instance with an exponential growth with D m ) and if the variance is unknown, this optimal control is obtained with data-driven penalties [13,14].Lastly, if the model collection is data-dependent and if the variance σ 2 is unknown, the LinSelect penalty [15,16] guarantees an optimal control of the predictive risk.
Multiple testing methods to control the False Discovery Rate.In the multiple testing procedure, the p tests H 0 = {β * j = 0} versus H 1 = {β * j ̸ = 0} are performed independently to get a list of p-values.Variables associated with a p-value smaller than a threshold are selected and the challenge is to find this threshold to obtain an upper bound on a function of the number of selected non-active variables.Several methods control the Family-Wise Error (FWER) which is the probability of selecting at least one non-active variable [17,18].However, these methods tend to be conservative, leading to a tiny set of selected variables.An alternative is to control the FDR which is the expectation of the proportion of non-active variables among the selected ones.The authors of [19] first provide a threshold assuming independence of the p-values.This hypothesis is then relaxed in [20,21,22,23].Instead of considering the p-values, the knockoff filter method [24] introduces copies of the columns of X constructed to function as non-active variables to calibrate a threshold on test statistics.
The simultaneous control of several cost functions.Controlling PR or FDR is commonly performed independently in the literature and yield different sets of selected variables.For a PR control, selected variables aim at correctly predicting a new observation of Y , without guaranteeing that the set of selected variables does not contain non-active variables.Conversely, when the cost function is the FDR, the number of non-active variables is controlled at the price that some active variables are not selected.Therefore, recent works have been proposed to combine prediction and FDR approaches to select all active variables without selecting non-active ones.For instance, [25] propose a multi-step algorithm where a threshold procedure is applied to some Lasso estimators computed for specific values of λ.In addition to prediction performances, a consistency property on the selected variable set is satisfied under some conditions on X.Another idea is the postselection inference [26,27] where the principle is to test the relevance of each selected variable by a model selection procedure.Valid confidence intervals are provided from conditional hypothesis tests for each model of the collection in addition to a PR control.Their work has been generalized by [28,29,30] and a review can be found in [31].In a completely different direction, [32,33] propose to control the False Negative Rate (FNR) in addition to the FDR.A good FNR control ensures that most of the active variables are selected.So, minimizing a weighted sum of FDR and FNR provide a set of variables close to the set of active variables.However, improving FDR control deteriorates FNR control and vice versa.Hence, optimal controls of both criteria are impossible to achieve.Some other papers propose to combine the FDR with the PR.Additional motivation to consider the PR is its behavior between the learning phase and the over-fitting phase.In the learning phase, the addition of a variable in the selected set drastically reduces the PR, whereas in the over-fitting phase, it increases proportionally to the noise level.Firstly, in the standard multivariate normal mean problem with a known variance, [34] propose a penalty function in the model selection procedure built from a multiple testing procedure.They obtain simultaneously sharp asymptotic bounds of the FDR and the PR.Then, [35] propose the Sorted ℓ 1 penalized estimator (SLOPE) which is the minimizer of the Lasso criterion (1.2) where λ is replaced by a p-vector built from a multiple testing procedure.For the orthogonal design, their approach achieves a non-asymptotic control of the FDR and satisfies a minimum value of the total mean squared error with minimax convergence rate [36].This asymptotic convergence of the FDR has been generalized under a wide range of hypotheses, for instance, for a random design in [37].
Ordered variable selection.The ordered variable framework has attracted much attention recently, especially to address high-dimensional problems.In literature, a large class of methods exists dealing with variables having a natural ranking : [38] for the regression framework, [39] with the nested lasso penalty and [40] for covariance matrix estimation.This assumption allows for drastically reducing the estimation complexity.We develop our theoretical framework under this assumption.It can be applied to data sets where the assumption that an ordering of the variables is available a priori (e.g., from suitable forms of prior knowledge) is appropriate.However, in most applications, no canonical ranking of the variables is available and having a natural order on variables becomes a strong assumption.In this case, alternatives consist of proposing a candidate order from random procedures and applying theoretical statistical methods on top of these random variable rankings.Several approaches have been proposed in the literature to provide such rankings.The most used ones are based either on a regularization path which is built with the Lasso type equation solving [2] or on a decision tree [41].However, these approaches suffer from instability in that a small modification of the initial sample could radically change the variable order [42].To circumvent this instability problem, one solution is to add a sampling procedure like the bootstrap [43].We adopt this point of view in this article to generalize our theoretical results to non-ordered variable selection.

Main contributions.
The originality of this paper is to obtain a control of the FDR in addition to the PR control in model selection through a convenient calibration of the penalty.We assume variables are ranked according to their importance for the response variable Y ; X 1 being the most important one, X 2 being the second one, • • • , and X p being the least important one.In Gaussian linear regression, the order is given by the partial correlation between Y and each X j .A natural model collection is the one containing the nested models respecting the variable order.This framework sounds restrictive but allows to derive theoretical expressions of the FDR in the considered model selection procedure.According to [13], all the penalty functions defined by : provide a non-asymptotic control of the PR for K > 1 when variables are ranked.
Theoretical bounds on the FDR in model selection : Although the model selection procedure is built for a PR control, we obtain non-asymptotic lower and upper bounds on the FDR with respect to K > 0 when σ 2 is known.We show that these bounds only involve some evaluations of cumulative distribution functions of the standard Gaussian and of some chi-squared variables.Whatever the noise level, FDR is always strictly positive.When K tends to infinity, the FDR converges to 0 with an exponential rate.So, a low value of the FDR is satisfactory as soon as the value of K is not too large.

Calibration of the hyperparameter K :
The obtained theoretical bounds depend on the parameters β * and σ 2 .We replace them with estimators to obtain completely data-dependent bounds on the FDR.Then, we propose a calibration of the hyperparameter K to control a trade-off between FDR and PR.Our algorithm is validated on an extensive simulation study and is compared with several existing variable selection procedures.
Towards a non-ordering variable selection : From a practical point of view, a crucial assumption of this work is the knowledge of the variable ranking.We investigate empirically an extension in which the variable ordering is not given beforehand but estimated using a data-driven procedure to build random model collections.
1.4 Outline of the paper.
The rest of the paper is organized as follows.Section 2 introduces the Gaussian linear regression model and some notations.Section 3 contains theoretical results.Since an increase of the hyperparameter K leads to a decrease of the FDR, it motivates the study of the FDR function in model selection with respect to K. As the FDR has an intractable expression, bounds are obtained when the variable order and the variance are known.We establish an exponential convergence rate of the FDR function when K tends to infinity.The special case of orthogonal design matrix is studied to illustrate the main results.In Section 4, an algorithm is proposed to calibrate the hyperparameter K in the penalty function to achieve a suitable trade-off between FDR and PR controls.It is based on simultaneous evaluations of the prediction performance and the FDR of the models, which are calculated from properly chosen estimators of σ 2 and β * .
We then present a study to generalize our procedure to non-ordered variable selection and we compare our algorithm with some existing variable selection procedures.Section 5 contains a conclusion and a discussion of prospective work.
In Section 6, proofs of all the theoretical results are provided.Lastly, the simulation protocol considered in this paper is described in Section 7.
2 Model and notations.
Let us consider the Gaussian linear regression model given in (1.1).We define q = min(n, p) and assume that (X 1 , • • • , X q ) is a family of linearly independent vectors.We consider the deterministic and nested model collection of linear spaces : By construction, the true model m * = Span X j , j s.t.β * j ̸ = 0 belongs to M. For each m ∈ M, D m is the dimension of m and βm is the least-squares estimator onto m : With the definition of q and properties on the family (X 1 , • • • , X q ), βm is unique for each m ∈ M. For all K > 0, we define the function crit K on M as : , and the selected model m(K) by : m(K) = arg min We define PR(m) the predictive risk associated to the model m ∈ M by : where E denotes the expectation under the distribution of Y satisfying (1.1).We define successively F P (m) the number of variables contained in m but not in m * , the false discovery proportion by : ; and the False Discovery Rate by : FDR(m) = E FDP(m) , where E still denotes the expectation under the distribution of Y satisfying (1.1), so that FDR(m) is deterministic even in the case where m is random.
Finally, the notation ⟨., .⟩denotes the canonical scalar product in R n , Π X denotes the orthogonal projection function onto the space X , Φ denotes the standard Gaussian cumulative distribution function and F χ 2 (k) is the cumulative distribution function of a chi-squared variable with k degrees of freedom.By convention, an intersection or an union from indices k to ℓ with k > ℓ are the intersection or the union over an empty set.In the same way, the set {k, • • • , ℓ} is empty if k > ℓ.
In this section, the variance σ 2 is supposed to be known.We first present intuitions that lead to study FDR( m(K)) in model selection.Non-asymptotic bounds on FDR( m(K)) are obtained in Theorem 3.2, as well as asymptotic behaviors when K tends to infinity in Corollary 3.4.Finally, the particular case where X is the orthogonal design matrix is studied to illustrate the main results.
According to [13], the penalty function (1.4) 1: Values of the empirical estimators of PR( m(K)) and FDR( m(K)) according to K for the toy data set described in Section 7.
We propose an example to illustrate our point and intuition.In Figure 1, we plot the empirical estimators of PR( m(K)) and FDR( m(K)) on a regular grid of positive values of K. Graphs are obtained from the toy data set described in Section 7 and values are transferred to Table 1.We observe that the empirical values of PR( m(K)) are kept low for K ∈ [2,4] while the FDR( m(K)) function decreases with K until 0. Here, the choice K = 3 is more judicious than K = 2: it ensures a stronger and positive control of the FDR while satisfying similar prediction performances (a FDR of zero is not relevant as it means that no variable is selected).We also observe that while FDR decreases with K, PR increases from a certain value of K ≥ 2. To control PR and FDR simultaneously, the constant K must be close to 2. Increasing the constant K to limit the non-active variable selection is known for the asymptotic point of view.Indeed, AIC and Cp-Mallows penalties [9,10], where K equals 2, give asymptotically the best set of variables for prediction performances; while BIC penalty [11], where K is fixed to log(n), exactly recovers asymptotically the set of active variables.Obtaining the asymptotic properties of AIC, Cp-Mallows and BIC penalties simultaneously is impossible [44], but it suggests that a value of K ∈ [2, log(n)] would get reasonable (but not necessarily optimal) values for both PR and FDR in a non-asymptotic framework.In this way, we propose to study the function FDR m(K) in the model selection procedure (2.2) where the penalty function is (1.4) in the ordered variable setting.
3.2 Bounds on the FDR in model selection.

FDR expression in model selection for ordered variables.
Let us assume that K > 0 and crit K is injective on M. If D * m = q, FDR( m(K)) = 0. Otherwise, the FDR( m(K)) is expressed within the model selection procedure as : A detailed proof of (3.1) can be found in Subsection 6.1.

By using the decomposition
where for each r where A proof of Proposition 3.1 can be found in Subsection 6.1.

General bounds.
In (3.2), the P r (K) terms do not depend on data.Conversely, the Q r (K, β * , σ 2 ) terms depend on the data.Thus, to understand the behavior of the FDR function with respect to m(K), we propose to bound the Q r (K, β * , σ 2 ) terms in the following theorem : Theorem 3.2.Let us consider the ordered variable framework and the model collection (2.1) where q = min(n, p).
Let us suppose that m * ∈ M and D * m < q.The notation Φ stands for the standard Gaussian cumulative distribution function and where where for all K > 0, P r (K) is defined in (3.3) and ) is defined by : A proof of Theorem 3.2 can be found in Subsection 6.2.
Hence, although the model selection procedure is built for prediction performances, bounds on the FDR are derived with respect to m(K).Terms f r (K, β * , σ 2 ) and f r (K, β * , σ 2 ) only involve evaluations of cumulative distribution functions of the standard Gaussian and chi-squared variables.So, they have a fully explicit form which simplifies the understanding of the behavior of the FDR in model selection.However, they depend on the unknown parameters β * and σ 2 for which estimators are proposed in Section 4.1.2.

Strictly positive FDR.
The following corollary gives a lower bound on the FDR independent from σ 2 .Corollary 3.3.Under the assumptions and definitions of Theorem 3.2, ∀K > 0 : A proof of Corollary 3.3 can be found in Subsection 6.3.From Corollary 3.3, FDR( m(K)) is always strictly positive, whatever the values of K > 0 and σ 2 .

Asymptotic analysis.
The following corollary gives the asymptotic behavior of the FDR function in model selection when K tends to infinity.Corollary 3.4.Under the assumptions and the definitions of Theorem 3.2, the FDR( m(K)) function tends to 0 when K tends to infinity and satisfies ∀η > 0, Furthermore, ∀η > 0, ∃C η > 0, ∃L η > 0, ∀K > L η , we have : A proof of Corollary 3.4 can be found in Subsection 6.4.From Equation (3.6), FDR( m(K)) tends to 0 when K tends to +∞ with at least an exponential convergence rate and Equation (3.7) suggests that the exponential convergence rate is optimal.Remark 3.5.With no signal (β * = 0 and D m * = 0), the asymptotic bounds in (3.8) are − 1 2 − ε and − 1 2 + ε and consequently : Remark 3.6.The asymptotic upper and lower bounds (3.6) and (3.7) are satisfied whatever the value of σ 2 > 0. It is possible to obtain the following sharpest asymptotic upper bound : ∀η > 0, in the asymptotic regime where K −→ +∞ and σ −→ 0 The reader can find a proof in Subsection 6.4.

Illustrations of the main result in the orthogonal case.
We propose to analyze the particular case where the design matrix X is orthogonal since it leads to simplified forms for the FDR bounds that are easy to calculate.Corollary 3.7 (Application on the orthogonal case).Under assumptions of Theorem 3.2 and by assuming that (X 1 , • • • , X q ) are orthonormal with respect to ⟨., .⟩,then, ∀K > 0, FDR( m(K)) satisfies the same inequalities as (3.4) where : and all other terms are the same as those defined in Theorem 3.2.
A proof of Corollary 3.7 can be found in Subsection 6.5.
In Figure 2, we plot the empirical estimation of the FDR( m(K)) with the functions b(K, β * , σ 2 ) and B(K, β * , σ 2 ) on a grid of positive K (left) and for K ≥ 2 (right).Graphs are obtained from the toy data set described in Section 7 where X is orthogonal.The left figure is devoted to illustrate Corollary 3.7.The FDR values are well smaller than the upper bound values and larger than the lower bound ones.From the right figure and in accordance with Corollary 3.4, the empirical values of FDR( m(K)) tend to 0 when K increases and the convergence rate seems to be exponential.Moreover, the curves of b(K, β * , σ 2 ) and B(K, β * , σ 2 ) frame the empirical FDR and the difference between the three functions becomes quickly negligible for K larger than 2.
4 Trade-off between the PR and the FDR controls.
While bounds b(K, β * , σ 2 ) and B(K, β * , σ 2 ) are easily understandable and fully implementable, they depend on β * , σ 2 and D * m .These quantities are unknown in practice.For a practical use, we propose to replace the theoretical bounds on the FDR as well as the theoretical expression of the PR with observable quantities (Subsection 4.1).Then, we propose an algorithm to calibrate the hyperparameter K from the data set such that both PR and FDR are controlled (Subsection 4.2).
As variables are not usually naturally ranked, we explore the robustness of our algorithm under perturbations of the correct variable ordering and present approaches for obtaining a variable ordering in a data-driven manner.Results are provided in Subsection 4.3.Lastly, our algorithm is compared with some existing variable selection procedures in Subsection 4.4, in terms of both PR and FDR.Commonly, the predictive risk is evaluated with the mean squared error on a validation set independent from the training set used to estimate the parameters (see Formula (7.1) for the definition).However, it requires separating the dataset in two parts which increases the estimation errors.Here, we propose to use the entire dataset to both apply the model selection procedure and evaluate the predictive risk.Intuitively, the response vector Y is replaced with X β m(2) which provides good prediction performances [13].Moreover, by re-expressing the PR, it is straightforward to show that for all K > 0 and K ′ > 0 : According to [45], the constant 2 provides the optimal asymptotic control of (2.3), so − Y almost belongs to the subspace Im(X) ⊥ .In addition, X β m(K) and X β m(K ′ ) belongs to Im(X), so the last term in (4.1) is close to 0 and is negligible compared to the two others.So, for all K > 0 and up to an additive negligible term.Hence, the constant 2  2 ] and the one minimizing ] are almost equal.Therefore, to evaluate the prediction performances of m(K), we propose to compare prediction performances of the estimates X β m(K) and X β m(2) .We introduce the following term that we call estimated difference in predictions : For the rest of the paper, the empirical version of (4.2) is calculated averaging over 100 independent data sets and is denoted diff-PR.If this difference is significantly smaller than the noise level σ 2 , the model m(K) has performances similar to those satisfied by m(2).
4.1.2Estimation of the FDR.
Choices of these estimators are crucial since they are proposed as inputs to the algorithm.Justifications of the choices are provided in the Supplementary file [46] in which an extensive simulation study is presented.We propose a completely data-driven calibration of the hyperparameter K up to α and γ.These parameters α and γ are set by the user given values of the FDR and PR that are still deemed acceptable.The algorithm depends on the functions K −→ B(•, β m(4) , σ2 ) and K −→ diff-PR( m(K)) to obtain a lower bound on both PR and FDR.We propose the following algorithm : Algorithm 1: Algorithm to calibrate K 1. Choose α the threshold for the FDR control and γ the threshold for the estimated risk estimated difference in predictions (4.2).

Compute
Otherwise, return min K, K ∈ I 1 or take a larger value of either α or γ.
Curves of Figure 3 are generated from the toy data set and from the simulation protocol described in Section 7.
Parameters α and γ are free and are defined by the user for maximum acceptable values for FDR and PR.In this example, we choose α = 0.05 and γ = 0.1.The graph at the bottom right shows that there exists some constants K for which a trade-off between both theoretical FDR and PR and both empirical FDR and PR can be achieved.Here, the range of K values is given by the interval [2,6].These values correspond to a selected model dimension close to D m * (Figure 3 2) is equal to 0.05.Hence, our proposed algorithm allows to maintain the prediction performances from m(2), reinforce the control of the FDR criterion and so gain a convenient trade-off between PR and FDR.
In the supplementary file [46], the algorithm 1 is applied to several data sets generated from various sets of parameters and described in Table 4.Each time, the hyperparameter K is strictly larger than the commonly used constant 2 and provides a low value of FDR while maintaining the prediction performances given by m(2).

Towards non-ordered variable selection
For most applications, no canonical order of variables is available and our algorithm cannot be applied directly.We propose to generate candidate orders from random procedures to use our method when an ordering of the variables is not given a priori.
More precisely, we first study the robustness to variable ordering of our method (Subsection 4.3.1)and provide some procedures to construct variable orders in practice (Subsection 4.3.2).Our algorithm 1 is then applied from the generated rankings in Subsection 4.4.

Robustness to variable ordering
We propose numerical experiments where the assumption of ordered variables is not fulfilled.The goal is to test the robustness to variable ordering of our algorithm by measuring how this impacts its performances.We consider the toy data set where the size of the true model is D m * = 10 and we consider three collections which are the results of a random permutation of the nested model collection (2.1) on respectively the first ten, the first twelve and the first fifteen variables.Hence, active variables remain first in the first collection; perturbations may introduce non-active variables among the first ten variables in the second collection, while in the third collection, some active variables can be pushed far into the collection.To test the robustness to variable ordering of our algorithm, Figure 4 shows how the empirical values of FDR behaves in relation to its estimated upper bound as well as the empirical and estimated differences in predictions for the three perturbed collections.We observe that when the permutation concerns only the active variables (on the nested model collection (2.1)), values of the empirical values of FDR are smaller than the values of B(K, β * , σ 2 ) and B(K, β m(4) , σ2 ) which are close.For prediction, the diff-PR function has the same behavior than for the nested model collection (2.1).
When the permutation concerns the first twelve and the first fifteen variables, the empirical values of FDR is higher than B(K, β * , σ 2 ) and B(K, β m(4) , σ2 ) as soon as K ≥ 2 and with an increasing deviation when the error on estimated variable order increases.Moreover, we observe that the rate of the empirical values of FDR decay is much slower and values are high whatever the value of K : above 0.13 and 0.28 for the second and third perturbed model collection, respectively.For prediction, the diff-PR function is stable for K ≥ 2.
Hence, permutations among only the active variables have no effect on FDR and PR.However, as soon as a non-active variable is ranked before an active variable, the theoretical guarantees of Theorem 3.2 no longer hold and empirical values of FDR can be high whatever the value of K. To tackle this problem, one solution consists of combining our algorithm with a method discriminating active and non-active variables.We consider this direction for the rest of this section.

Random variable order
We consider four strategies to estimate variable orders.The Bolasso procedure [6] consists of solving the Lasso equation (1.2), through the LARS algorithm [47], on several resamples and for different values of λ.Variables are ranked according to their occurrence frequency in the models averaged over the λ's and the resamples.The random forests [43] are aggregation of several binary decision trees.The tree predictors are generated on bootstrap resamples and on a subset of variables randomly chosen.Here, we combine the random forest with the recursive feature elimination (RFE) algorithm [48] whose efficiency has been proved especially for correlated variables [49].Variables are ordered according to their importance defined by the random forest.The Sorted ℓ 1 penalized estimator (SLOPE) [35] is obtained by solving the Lasso equation (1.2) with λ a p-vector calculated from a multiple testing procedure.Lastly, the knockoff method [24] consists of building a non-active copy Xj of each X j and solving the Lasso equation (1.2) for several values of λ on the augmented matrix composed on the X j and Xj variables.Variables are then sorted according to the values of for all j ∈ {1, • • • , p}, where Z j and Zj correspond to the largest λ for which X j and Xj are respectively selected.
For each strategy, a random model collection, on which model selection can be applied is defined from the estimated variable order.Bolasso and random forest both provide a variable order from a prediction point of view, whereas SLOPE and the knockoff method provide a variable order by considering both PR and FDR controls.
To quantify the ability to discriminate between active and non-active variables, we calculate the proportion of active variables in models of size  2: Proportion of active variables in models of size 5, 10, 15 and 20 for random collections built with Bolasso, SLOPE, random forest and the knockoff method.Each value is the average over 100 independent iterations.
The collection built by the knockoff method is the collection containing the fewest non-active variables in the models of size 5 and 10.For models of size 15 and 20, Bolasso and SLOPE are slightly better and the proportion of active variables is always larger than 0.9.We observe with the model of size 20 that there are active variables far away in the collections, which is undesirable.Since Bolasso is slightly better than SLOPE on scenarios described in Table 4 (see the supplementary file [46]), we consider the random collections built with the knockoff method and Bolasso in the following.

Comparison with other variable selection methods
Performances of Algorithm 1 are compared with three variable selection procedures.The LinSelect penalty [16] is a model selection criterion introduced in a non-asymptotic setting to take into account of the randomness of the model collection.The penalty function provides a sharp oracle inequality.The V -fold cross-validation [50,51,52] is the most popular, adaptive and simple variable selection method.The final selected model is the one with the best prediction performance accuracy over the data sets obtained by splitting the initial data set into a training set and a validation set.The last method is the knockoff method where the final variable subset is composed by X j such that W j ≥ T where T is defined to satisfy a given control of FDR.LinSelect and V -fold cross-validation aim at providing a control of PR while the knockoff method aims at providing a control of FDR.
We consider the 50-fold cross-validation and evaluate PR and FDR of our algorithm and of the three variable selection procedures on the nested model collection (2.1) where the active variables are properly ranked before the non-active variables and on random collections built with Bolasso and with the knockoff method.
Table 3 shows the performances of the four variable selection procedures.As the knockoff method is a procedure for both collection generation and variable selection, the knockoff collection is only used with the knockoff variable selection method.On the nested model collection (2.1), our algorithm provides the smallest values in both FDR and PR and the average of the selected model sizes is the closest to the true model size.LinSelect behaves in a similar way while the 50-fold cross validation selects a model located in the over-fitting area providing a high value of both PR and FDR.On random model collections, performances are deteriorated for all methods, as expected, and are slightly better on model collections built with the knockoff method than with Bolasso.3: Results of the dimension, PR and FDR of the selected models obtained by LinSelect, the 50-fold CV, the knockoff method and our algorithm, applied on the nested model collection (2.1) and on random collections built with Bolasso and the knockoff method.Each value is the average over 100 independent iterations.PR and FDR of each selected model are the empirical quantities.Input parameters of our algorithm are fixed to γ = 0.1 and α = 0.05.Note that the knockoff variable selection method is adapted for only the knockoff random model collection.
LinSelect provides the smallest FDR.While LinSelect is designed to control the PR theoretically, we remark that it is apparently also a relevant candidate to control FDR and to achieve a trade-off between both PR and FDR.The 50-fold cross validation method provides poor results while the knockoff method selects the empty set of variable.The size of the model selected by our algorithm is larger when the collections are random and provides high values of FDR.These results show that a meticulous choice of γ and α is important to improve our algorithm performances.
The robustness of our algorithm to variable order, the construction of random model collections and performances of the four variable selection procedures are studied on several data sets generated from various sets of parameters and described in Table 4. Results are presented in the supplementary file [46], and the conclusions remain the same.

Conclusions.
The variable selection procedure in a high-dimensional Gaussian linear regression with sparsity assumption is commonly used to identify a set of variables with prediction performances or with as few non-active variables as possible.For prediction performances, the PR is usually controlled via a penalized least-squares minimization; to avoid the selection of non-active variables, the FDR is usually controlled via a multiple testing approach.Controlling the PR tends to select too many variables, including non-active ones, whereas controlling the FDR tends to select too few variables, leaving out some active ones.
This work shows that a convenient trade-off between PR and FDR can be achieved in ordered variable selection.The originality of this paper is to obtain this trade-off through a proper calibration of the hyperparameter K in the penalty of the model selection (1.4).Firstly, theoretical results lead to non-asymptotic lower and upper bounds on the FDR m(K) function when σ 2 is known.Asymptotic behaviors suggest that bounds are optimal.Secondly, the proposed methodology provides an algorithm to calibrate the hyperparameter K in the penalty function when σ 2 is unknown.This algorithm is based on completely data-driven terms : the estimated difference in predictions and the estimated upper bound on the FDR where the choices of estimators σ2 and β m(4) are derived from an extensive simulation study.The hyperparameter K is calibrated from the dataset to ensure diff-PR( m(K)) < γ × σ2 under the constraint B(K, β m(4) , σ2 ) < α.Our algorithm is validated on an extensive simulation study and allows to obtain a selected model ensuring a small value of both theoretical PR and FDR.The calibrated hyperparameter K is strictly larger than the commonly used constant K = 2.Moreover, PR and FDR values of the selected model with our algorithm are the smallest values compared with the existing variable selection procedures considered in the paper.Lastly, we propose a preliminary response to construct a random model collection to extend our work to non-ordered variable selection.The performances of our algorithm deteriorate as soon as a non-active variable is ranked before an active one, but combined with procedures with high ability to discriminate between active and non-active variables, our algorithm is competitive with some existing variable selection procedures.
If the selected model is the largest one of the nested model collection (with dimension equals q), the associated lower and upper bounds values equal 0. In this case, a distinction between D m * = q and D m * < q is not possible without additional arguments.This is a limitation of our work.
The main perspective of our work is to generalize our theoretical results to non-ordering variable selection.The ordered variable assumption is the key ingredient of our proofs and appears in the second line of the proof where the ratio is fixed, allowing randomness only on m that we control thanks to the ordered model selection theory.Hence, relaxing this assumption requires new technical arguments and this is a real challenge for future work.Moreover, for non-ordered variable selection, the penalty function (1.4) has to include a logarithmic term to take into account that all possible models should be explored but this is computationally infeasible given the combinatorial nature of the problem.In this case, two hyperparameters have to be calibrated.Another way to generalize our work to non-ordered variable selection is to detect the value of K from which the theoretical FDR is larger than the theoretical upper bound on the FDR and quantify the gap between the theoretical upper bound and the FDR.
One way to improve the performances of our algorithm can be a meticulous choice of the algorithm input parameters α and γ, which are arbitrarily fixed in our work.Achieving a trade-off between FDR and PR is not trivial and investigating alternatives in this direction can be considered in future work.In particular, in this work, the model collection is constructed from a prediction point of view method (minimization of the least squares values).It could be judicious to introduce the FDR metric from the step of model collection generation.
A possible opening is to study the potential characteristics of the hyperparameter K provided by our algorithm in a theoretical point of view (dependence in β * and σ 2 ).Another possible extension is to study the false negative rate (FNR) function in the model selection procedure, similarly and in addition to the FDR one.This can provide a more powerful method, similarly to [32,33].
Finally, another generalization is to extend our theoretical results to unknown variance, random model collections or to non-fixed designs, which are more general frameworks adapted to some application points of view.These extensions are much more intricate.
6 Proofs of theoretical results.
This section contains proofs of all the theoretical results of this paper.

FDR expression in model selection.
Proof of Formula 3.1.
If D * m = q, then FP(m) = 0 for all m ∈ M and FDR(m) = 0 for all m ∈ M. Let us now suppose that D * m < q.The FDP expression within the model selection procedure is : Proof of Lemma 6.1.
The last line is due to the fact that Y − X βmr ∈ (m r ) ⊥ ⊂ (m ℓ ) ⊥ since m ℓ ⊂ m r and X βmr is the projection of Y onto m r . Then, (*) come from the definition of (u 1 , • • • , u n ) and (**) is obtained by Parseval's identity.
Proof of Lemma 6.2.
Proof of Proposition 3.1.
Starting from (3.1), we decompose the event By using the definition of the crit K function, we have for r So, by applying Lemma 6.1, and by applying Lemma 6.2, ℓ ∈ {r In this way, Since the first event of the previous decomposition depends only on random variables ⟨Y, u i ⟩ for i ∈ {1, • • • , r − 1} whereas the second one depends only on random variables ⟨Y, u i ⟩ for i ∈ {r + 1, • • • , q}, the two events are independent.Hence, from (3.1), we obtain for all K > 0 : Moreover, since ⟨Xβ * , u k ⟩ = 0, ∀k > D m * and since r ≥ D m * + 1, we have : So, for all K > 0 and for each r ∈ {D m * + 1, • • • , q} : where Zk i.i.d.
Hence, for all K > 0 and for each r ∈ {D m * + 1, • • • , q}, does not depend on the data and we deduce the Formula (3.2) with : where Z k i.i.d.
bounds on the Q r terms.For all K > 0 and for each r ∈ {D m * + 1, • • • , q}, we recall that : and since ⟨Xβ * , u k ⟩ = 0, ∀k > D m * , we have : where ∩ and ⊓ denote respectively any intersection and a disjoint intersection of events, as well as ∪ and ⊔ denoting respectively any union and a disjoint union of events.
Proof.We prove Lemma 6.3 by a recurrence on s ≥ 1.
For s = 1, both sets correspond to E 1 , so the inclusion is obvious.Let s ≥ 1 and suppose that the inclusion is true for s.
Thus, the property is true for s + 1, which proves lemma.

Strictly positive FDR.
Proof of Corollary 3.3.From Theorem 3.2, we have ∀K > 0, For the rest of the proof, we use the following Lemma : Lemma 6.4 (Frank R. Kschischang [53]).The complementary error function, erfc(x), is defined, for x ≥ 0, as : where F N (0, 1 2 ) denotes the cumulative function of the centered Gaussian with the variance equals 1 2 .Then, We remark that for all x ≥ 0, 1 − Φ (**) is provided by Lemma 6.4.So, from (6.9) and (6.10), we obtain : This lower bound is strictly positive and since the P r (K) terms are all strictly positive too, we deduce that the FDR function is a strictly positive function.

Asymptotic analysis.
Proof of Corollary 3.4.For all r ∈ {D m * + 1, • • • , q} and by using the definitions from Theorem 3.2, and -Upper bound on f r : For each r ∈ {D m * + 1, • • • , q} and for all K > 0 : So, for each r ∈ {D m * + 1, • • • , q} and for all K > 0 : By the exponential inequality of [54] for X ∼ χ 2 (ℓ) and ℓ ∈ N * : We apply (6.14) for each ℓ We obtain for all K > 1 : (*) come from the fact that a minimum into a set is smaller than any value in the set.We choose the value corresponding for ℓ = 0. So, from (6.12), (6.15) and (6.16), we obtain for each r ∈ {D m * +1, • • • , q} and for all K > 1 respecting σ 2 (K −1) > The empirical estimator of PR( m(K)) is the average of the MSE( md (K)).
To validate the quality of the empirical estimations, the central limit theorem is applied to get the 95% asymptotic

Figure 1 :
Figure 1: Curves of the empirical values of FDR m(K) and PR m(K) for the toy data set described in Section 7.The vertical lines correspond to K = 3.

we obtain the following proposition : Proposition 3 . 1 .
Let us consider the ordered variable framework and the model collection (2.1) where q = min(n, p), m * ∈ M and D

Figure 2 :
Figure 2: Left : curves of the empirical values of FDR( m(K)) (red) and of terms b(K, β * , σ 2 ) (green) and B(K, β * , σ 2 )(blue) for the orthogonal design matrix X for the toy data set described in Section 7. Right : curves are plotted only for K ≥ 2.

4. 2 A
data-dependent calibration of K in the model selection procedure.
at the bottom left).By applying our algorithm on this example, we get I 1 = [3.3,10] and I 2 = [2, 5.8] and so, our proposed algorithm returns K = 3.3.The evaluation of the prediction performances provided by the selected model m(3.3) is equal to 1.14 and we get B(3.3, β m(4) , σ2 ) = 0.03.The constant K = 3.3 corresponds to a low value of both empirical predictive risk and FDR functions.Indeed, the empirical predictive risk of m(3.3) is equal to 1.24 and the empirical FDR of m(3.3) is equal to 0.01.To compare with the usual choice K = 2, the empirical predictive risk of m(2) is equal to 1.25 and the empirical FDR of m(

Figure 3 :
Figure 3: Top : Curves of the empirical functions FDR m(K) (red) and diff-PR m(K) (blue), of the B(K, β m(4) , σ2 ) functions (pink) and of diff-PR m(K) (violet) for K ≥ 2 for the toy data set.Bottom : Curves of the D m(K) as function of K averaged over the 1000 data sets (left) and values of the empirical FDR m(K) and FDR m(K) as functions of diff-PR m(K) and diff-PR m(K) (right) for all K > 0 and for the toy data set.

Table 4 :
Description of the four scenarios.