Prediction Weighted Maximum Frequency Selection

Shrinkage estimators that possess the ability to produce sparse solutions have become increasingly important to the analysis of today's complex datasets. Examples include the LASSO, the Elastic-Net and their adaptive counterparts. Estimation of penalty parameters still presents difficulties however. While variable selection consistent procedures have been developed, their finite sample performance can often be less than satisfactory. We develop a new strategy for variable selection using the adaptive LASSO and adaptive Elastic-Net estimators with $p_n$ diverging. The basic idea first involves using the trace paths of their LARS solutions to bootstrap estimates of maximum frequency (MF) models conditioned on dimension. Conditioning on dimension effectively mitigates overfitting, however to deal with underfitting, these MFs are then prediction-weighted, and it is shown that not only can consistent model selection be achieved, but that attractive convergence rates can as well, leading to excellent finite sample performance. Detailed numerical studies are carried out on both simulated and real datasets. Extensions to the class of generalized linear models are also detailed.


Introduction
Consider the standard linear regression model where y = (y 1 , . . . , y n ) T is a vector of responses, X = (X 1 , . . . , X pn ) is an n × p n design matrix of predictors, β = (β 1 , . . . , β pn ) T is a vector of unknown regression parameters, ε = (ε 1 , . . . , ε n ) T is a vector of independent and identically distributed (i.i.d.) random errors. We allow p n to increase with n. Because some elements of β might be 0, a family of penalized least squares estimators were developed for variable selection and estimation, β = arg min β y − Xβ 2 + pn j=1 ρ(|β j |, λ), (1.2) where · is the L 2 -norm, λ ≥ 0 are regularization parameters, and ρ(|β j |, λ) is positive valued for β j = 0. [1] pointed out that through variable selection one can focus on a small number of important predictors for enhanced scientific discovery and potentially improve prediction performance by removing noise variables. Penalized L q -regression is a special case of (1.2) with ρ(|β j |, λ) = λ|β j | q , q ≥ 0, which includes the best subset selection for q = 0; the LASSO [2] for q = 1 and the ridge regression [3] for q = 2. Best subset selection is known to be computationally infeasible for high dimensional data and inherently discrete in variable selection [4]. Ridge regression arXiv:1702.02286v1 [stat.ME] 8 Feb 2017 does not possess a variable selection property. The LASSO however, can do simultaneous estimation and variable selection because its L 1 penalty is singular at the origin and can shrink some coefficients to exact 0 with a sufficiently large λ [5]. Other penalized least squares estimators that can do simultaneous estimation and variable selection include the SCAD [5] and adaptive LASSO [6] both enjoying the oracle properties [5]; the Elastic-Net [7] capable of detecting grouped effects; the adaptive Elastic-Net [8] combining advantages of the adaptive LASSO and Elastic-Net; and etc.
Selection of λ is essential in above penalized least squares estimation procedures. Although methods such as the SCAD, adaptive LASSO and adaptive Elastic-Net enjoy the oracle properties asymptotically, their optimal properties rely on particular specifications of the λ, whose magnitude controls the complexity of a selected model and trade-off between bias and variance in estimators [9]. The multi-fold cross-validation (CV) and generalized cross-validation (GCV) are frequently applied for the tuning parameters selection [2,5,6,7]. But they overfit the model asymptotically [10]. For consistent variable selection, [11] suggested to use the BIC in adaptive LASSO and a modified BIC when p n is diverging [12]; [13] introduced an extended BIC (EBIC) for linear models and then generalized it to generalized linear models [14]; [15] put forward a generalized information criterion (GIC) with p n diverging; [16] provided a consistent cross-validation procedure (CCV) for the LASSO; [17] proposed the stability selection (SS) for their randomized LASSO. Although variable selection consistency was established for these procedures, their finite sample performance can often be less than optimal (Section 6 ahead demonstrates this in simulation studies).
We propose a new method for tuning parameters selection, focusing in particular on the adaptive LASSO and adaptive Elastic-Net estimators. A simple example helps to illustrate the basic idea. Consider the adaptive LASSO in following example.  Figure 1 (top) shows the full adaptive LASSO solution path from the LARS algorithm [18]. In the figure, each step indicates a dimension change in the estimator. These steps are called transition points in [19]. They showed that if using information criteria such as the AIC or BIC to identify the optimal λ in adaptive LASSO, it lies in one of the transition points. This result helps to justify uses of the LARS algorithm and our subsequent focus on the transition points. Then the question remains about how to choose from these transition points. Figure 1 (middle and bottom) gives a brief look at our proposed method. The middle panel shows the estimated maximum frequency (MF) of a candidate model given the dimension. The MF estimation is done by a bootstrapping algorithm using the transition points. The strategy of conditioning on dimension has two important consequences: i) for overfit dimensions, the MFs are dramatically smaller than the true dimension MF (other than the full model of course), and ii) underfit dimensions can also produce large MF values. Point i) is important because for variable selection, overfitting is usually much more difficult to deal with. So one must now deal with the underfitting issue. We do this by introducing a prediction-based weight to the MFs (labeled as WMF). The results are shown in bottom panel of Figure 1. As is evident, now the true dimension, which maps to the true model, stands out beautifully from all others.
The rest of the paper is organized as follows. In Section 2, we briefly review the adaptive LASSO and adaptive Elastic-Net estimators and introduce the bootstrap algorithm for each. In Section 3, the MF procedure itself is described and its underlying properties are carefully examined using a simple orthogonal design. In Section 4, asymptotic properties of the MF procedure are established in general settings. The WMF procedure and its variable selection consistency are presented in Section 5. Comprehensive simulation studies are shown in Section 6. Section 7 describes extensions of the MWF procedure to generalized linear models (GLMs). Applications of the WMF procedure to ultra-high dimensional data are discussed in Section 8.
The Elastic-Net estimator [7] iŝ It overcomes several limitations pertaining to the LASSO: 1) the added L 2 penalty is strictly convex to allow grouping effects; 2) In a p n > n case, it can potentially estimate all p n predictors, while the LASSO can only find at most n predictors.
[8] proposed the adaptive Elastic-Net to combine strengths of the Elastic-Net and adaptive LASSO. The adaptive Elastic-Net estimator iŝ where ω j = |β ej | −γ , γ ≥ 0 andβ e = (β e1 , . . . ,β epn ) T is the Elastic-Net estimator in (2.2). Note that λ n2 takes the same value for the L 2 penalty function in (2.2) and (2.3), because the L 2 penalty contributes to the same kind of grouping effects. On the other hand, λ n1 and λ + n1 are allowed to be different as they control the sparsity in estimators. Under some regularity conditions,β ae was shown to enjoy the oracle properties.
We now detail bootstrapping for these two estimators. There are typically two ways of generating bootstrap observations for model (1.1) [20].
2. Bootstrapping residuals [22]. Calculate the ith residual whereβ is a ridge estimate of β 0 . Generate i.i.d. bootstrap residuals {ε * i , i = 1, . . . , n} from the empirical distribution that puts mass n −1 on each centered residual,ε i =ε 0i −ε 0 , whereε 0 is the average ofε 0i , i = 1, . . . , n. Then the i.i.d. residual bootstrap data is The bootstrap analog ofβ a , denoted asβ * a , is to substitute y with y * in (2.1). So is the bootstrap analog ofβ ae , denoted asβ * ae . In next section, we introduce the MF procedure which takes use of above bootstrap estimators.

The MF procedure
Denote a j-dimensional candidate model from the ith bootstrap data as M i j .
Remark 1. In the 4th step, the full model is excluded because it will destroy the maximum frequency rule by having the highest frequency, B, all the time. If there is a tie at the maximum of MF i , 1 ≤ i ≤ p − 1, we select the one at the highest dimension. This strategy guarantees asymptotic variable selection consistency of the MF procedure, which will be discussed in Section 4.
The MF procedure for adaptive Elastic-Net is in parallel. But in the 2nd step, the LARS-EN algorithm [7] is used instead to fit each bootstrap data.
We discussed in introduction to this paper consequences of the MF procedure by conditioning on dimension. Here we use a simple orthogonal design with i.i.d. normal random errors to study underlying properties driving that performance. In this case, we have X T X = I and the adaptive Elastic-Net reduces automatically to the adaptive LASSO [8]. Denote X j the jth column of X. Then the adaptive LASSO estimator iŝ where z + equals to z if z > 0 otherwise 0. We can expand the X T j y by where X T j ε ∼ N (0, σ 2 ). The following Lemma gives an order relationship for X T j y's.
Lemma 1. Suppose X T X = I, then we have In combine with the fact that λn |β i | γ > λn |β j | γ asymptotically for β 0i < β 0j , it is easy to deduce from (3.1) that given a λ n adaptive LASSO tends to select those variables, corresponding to the first k λn largest |β j |'s, with the highest probability.
Without loss of generality, suppose |β 0 | is decreasingly ordered. Denote S r a r-dimensional model containing the first r elements of |β 0 |, and denote W r any other r-dimensional models. LetÂ r be an adaptive LASSO model estimate given the model size is r, P (Â r = S r | r) indicates the conditional probability ofÂ r = S r given the model size. Then preceding deductions from (3.1) can be formularized as (1).
where W 1 r and W 2 r are two r-dimensional models s.t. S p 0 ⊂ W 1 r , W 2 r . Above properties of the adaptive LASSO coincides to some extent with the results of Theorem 2 in [23]. By (3.3), zero predictors will be equally likely selected at an overfit dimension. As a result P (Â r = M r | r) (see Algorithm 1 for definition of M r ), p 0 < r < p n , drops down dramatically, which is why we see a huge gap between the true dimension and overfit dimensions in Figure 1 (middle). On the other hand, P (Â r = S r | r) at some underfit dimensions can be as competitive as P (Â r = S p 0 | p 0 ). We propose a WMF procedure to tackle this underfitting issue in Section 5.
In next section, we show asymptotic variable selection properties forβ * a andβ * ae in general settings, from which variable selection consistency of the MF procedure can be deduced.

Asymptotic properties of the MF procedure
Let A = {j : β 0j = 0} be the true model. We assume following regularity conditions for subsequent theoretical studies: (A1) Denote ζ min (C) and ζ max (C) the minimum and maximum eigenvalues of a positive definite matrix C. We assume (A4) In adaptive Elastic-Net, (A5) The errors {ε i , i = 1, . . . , n} are i.i.d. with mean 0 and variance σ 2 < ∞.
Denote A * n = {j :β * aj = 0} an adaptive LASSO estimate of A using paired bootstrap data. Let P * = P (· | E) and E * = E(· | E) where E = σ ((x i , y i ), i = 1, . . . , n). Then P * (A * n = A | λ n ) indicates the conditional probability of A * n = A given E and λ n . Theorem 1. Suppose conditions (A1)-(A3) and (A5) hold, then Moreover, let λ n be another tuning parameter such that the adaptive LASSO estimator under λ n is of dimension r, p 0 < r < p n , then where M r is any r-dimensional model.

Proofs of Theorem 1 are included in Appendix A.
In adaptive LASSO, given a λ n is equivalent to given a dimension, but the converse is not true. One dimension can be mapped to numerous models, as a result to numerous tuning parameters. Fortunately however, the LARS algorithm enables us to map a dimension to an optimal λ n . Recall the adaptive LASSO solution path from the LARS in top panel of Figure 1. Transition points (e.g. steps) from 0 to 10 corresponds to a sequence of λ n 's: Note thatβ a (λ n ) = 0 for λ n > λ n (0) whereβ a (λ n ) is the adaptive LASSO estimator under λ n . By Theorem 5 in [19], λ n (m + 1) = arg min λn y − Xβ a (λ n ) 2 + a nd f (λ n ), λ n (m + 1) ≤ λ n < λ n (m), wheredf (λ n ) is the number of non-zero elements inβ a (λ n ) and a n is a positive sequence depending on n. It is worth mentioning that λ n (m + 1) is optimum in [λ n (m + 1), λ n (m)) by producing the minimum sum of squared errors (SSE) and the smallest model size concurrently.
Also note that the number of steps can exceed the full model size -different steps may have a same model size. Denote m k the last step having a model size k, and m k is another step having the same model size. The theorem also showed that Theorefore, λ n (m k ) is the overall optimum in {λ n :df (λ n ) = k, λ n ∈ [0, ∞]}. So the LARS algorithm enables us to create a one-to-one map between a dimension k and the optimum λ n (m k ), k ⇐⇒ λ n (m k ).
It is easy to see that λ n (m p 0 ) will satisfy condition (A3). Hence, we have the following corollary from Theorem 1.
where M r is any r-dimensional model.
This result can also be established for adaptive Elastic-Net. Denote T * n = {j :β * aej = 0} an adaptive Elastic-Net estimate of A using paired bootstrap data.
where M r is any r-dimensional model.
Proof. It can be proved by using the techniques for deriving Theorem 1, Corollary 1 and Theorem 2. We bypass here.
We now study the estimation properties for using residual bootstrap data. Denote T * n = {j :β * aej = 0} an adaptive Elastic-Net estimator of A using residual bootstrap data.
Moreover, let λ n1 be another tuning parameter such that the adaptive Elastic-Net estimator under λ n1 is of dimension r, p 0 < r < p n , then where M r is any r-dimensional model.
Proofs of Theorem 2 are included in Appendix A. The LARS-EN algorithm for adaptive Elastic-Net estimations is an extension of the LARS algorithm, which shares the same properties of the LARS for deriving Corollaries 1-2. Hence we obtain the following corollary from Theorem 2.
where M r is any r-dimensional model.
This result can also be established for adaptive LASSO. Denote A * n = {j :β * aj = 0} an adaptive LASSO estimate of A using residual bootstrap data.
where M r is any r-dimensional model.
Proof. Note that the adaptive LASSO estimator is a special case of the adaptive Elastic-Net estimator with λ n2 = 0. Theorem 2 holds automatically for A * n , from which Corollary 4 can be deduced.
Variable selection consistency of the MF procedure can then be deduced from Corollaries 1-4. where M r * is the model selected from the MF procedure.
Proof. By definition, A * n is an adaptive LASSO estimate of A using paired or residual bootstrap data. It is easy to see that Combining with Corollaries 1 or 4, Thus the MF procedure for adaptive LASSO can consistently identify the true dimension and true model via selecting the maximum of MF j , j ∈ {1, . . . , p − 1}, with the highest dimension (if there is a tie). Similarly, Corollary 2 and 3 imply variable selection consistency of the MF procedure for adaptive Elastic-Net.
However, the MF procedure has potential issues in application. In Figure 1 (middle) excluding the full model case, the maximum occurs at dimension 1 instead of 3 although their MFs are both close to 1. In next section, we propose a WMF procedure to tackle this underfitting issue in application. 5 The WMF procedure 5

.1 Method and Asymptotic properties
The underfitting issue in MF procedure can be deduced from Corollaries 1-4. Take A * n for an example. Although it was shown that lim n→∞ P * (A * n = A | p 0 ) = 1, the conditional probability at some underfit dimensions can also reach one, e.g. lim n→∞ P * (A * n = M r | r) = 1, 0 < r < p 0 . Note that the tuning parameter leading to an underfit r-dimensional estimator, denoted as λ n , fulfills λ n > λ n . Hence, the convergence rate of P * (A * n = M r | r) at some underfit dimensions can exceed the one at the true dimension. Therefore, the MF procedure would select an underfit model even with a sufficiently large n.
In order to fix things, we introduce a weight to the MF procedure. An effective weight should be able to down-weight the underfitting MFs asymptotically, i.e. the weight is able to identify underfit dimensions and its effects does not vanish as n → ∞, without significantly up-weighting the overfitting MFs. [24] showed that the overall unconditional (on y) expected squared prediction error for the OLS estimator of β 0 under model α is is a sub-matrix of X whose columns are indexed by the components of α and I is an identity matrix.
When α is a true or overfit model, it has Xβ 0 = X α β α and thus However, if α is an underfit model, then ∆ α,n > 0 for any fixed n. He further assumed that lim inf n→∞ ∆ α,n > 0, (5.3) which is argued in the paper to be a minimal type of asymptotic model identifiability condition. Under assumption (5.3) and by (5.1)-(5.2), where ν is an underfit model and κ is a true or overfit model. By (5.4) a formula inversely proportional to T α,n will be an ideal choice for the weight. [25] proposed such a formula for estimating the posterior probability of the model size given the dataP whereT n (j) is an estimate of T α,n using a j-dimensional model and c, 1 ≤ c ≤ 2, is a constant. We use the multi-fold CV forT n (j) and define Recall that MF j /B is a bootstrap version estimate of the posterior probability of model M j given the data and dimension, i.e. P (M j | y, j), along with (5.6) it has Note that BIC is a Laplace approximation to P (M j | y) under a flat prior assumption and is variable selection consistent for adaptive LASSO [11,12], but no convergence rate has been studied. Simulation studies in Section 6 show that BIC has a much slower empirical convergence rate than the WMF procedure.
Next we show properties of the multi-fold CV using adaptive LASSO or adaptive Elastic-Net estimators. Then variable selection consistency of the WMF procedure can be established. Let K be a fixed integer and suppose n = Kt. In multi-fold CV, one randomly divides a sample of n observations into K mutually exclusive subgroups s 1 , . . . , s K with each subgroup containing t observations, and selects the model by minimizing the following sum of squared errors whereβ s c i ,M is an adaptive LASSO or adaptive Elastic-Net estimator under model M using samples not in s i . Let α and α be the true or overfit models and ν be an underfit model. We assume following condition for asymptotic studies of the multi-fold CV procedure.
Theorem 3. Suppose conditions (A1)-(A2) and (A5)-(A6) hold, then 1. the multi-fold CV for adaptive LASSO or adaptive Elastic-Net satisfies Proofs of Theorem 3 are included in Appendix A. Denote r an underfit dimension. The ratio of WMFp 0 WMF r is exponentially proportional to the bias term, d 2cσ 2 β 0M c r 2 , which is larger than 0 and does not fade as n → ∞. This guarantees a good finite sample performance of the WMF procedure and a fast vanishing rate of its underfitting issues, which will be confirmed in simulation studies in Section 6.

Computation
In adaptive Elastic-Net, λ n2 takes the same value in Elastic-Net for calculating the weights ω j 's, where the tuning parameters are chosen by minimizing the two-dimensional BIC [7]. Then computational efforts remain the same for adaptive LASSO and adaptive Elastic-Net, which are to compute a full solution path against λ n 's or λ + n1 's. Computational complexity of creating an entire adaptive LASSO solution path is of order O(np 2 n ) [6]. It is of order O(np 2 n + p 3 n ) for adaptive Elastic-Net [7]. Since the optimal value often occurs at an early stage, we could stop the algorithms after m, m < p n , steps. In this case, the computational cost reduces to O(nm 2 ) for adaptive LASSO and O(m 3 + nm 2 ) for adaptive Elastic-net.
Computational cost of a WMF procedure is then B times the cost of computing an adaptive LASSO or adaptive Elastic-Net solution path.

Empirical studies
We now investigate empirical performances of the WMF procedure and show it outperforms the BIC, EBIC, GIC, SS, Cp, and 1se-CV (which is often recommended for variable selection) in a wide range of situations for both adaptive LASSO and adaptive Elastic-Net. The Cp did very poor in all scenarios, thus is excluded in the presentation. In all simulations, data were generated from for some constant κ, 0 ≤ κ < 1, n = 100, 300, 500. Results were averaged over 100 times of replications.

Simulations of the adaptive LASSO WMF procedure
Three scenarios were designed for the adaptive LASSO WMF procedure. In each scenario, Σ(i, j) = 0.3 |i−j| and σ = 3.  Scenario 3: High dimension, sparse proportion of true covariates and relatively large signals for all true covariates. In detail, set p n = O(n 3/4 ), then p n equals to 32, 72, 106 accordingly. Let p 0 grow in the same manner as in scenario 2, but the new elements equal to 2. Accordingly, the proportions of true covariates are 0.09, 0.11 and 0.12, and the SNRs are 2, 5.07, and 8.5.
Paired bootstrapping was used in the adaptive LASSO WMF procedure. Simulation results are summarized in Figures 2-4. In all scenarios, the proposed method has the highest degree of accuracy in identifying the true model and also enjoys a much faster convergence rate than other compared methods. The WMF procedure has an underfitting issue which vanishes quickly as n increases. Other methods (except for the SS) however suffer from an overfitting issue. The sparser the model is, the more serious the issue tends to be. Performance of the SS relies on particular specifications of several unknown parameters. Although we have followed instructions in [17] for setting those parameters throughout the simulations, its performance remains erratic and unsatisfactory.
Simulations for using residual bootstrapping in the adaptive LASSO WMF procedure were also conducted. The results are presented in Appendix B, which are similar to those in above paired bootstrapping simulations.

Simulations of the adaptive Elastic-Net WMF procedure
We also designed three scenarios for the adaptive Elastic-Net WMF procedure, each of which mimics a typical structure in applications. Since the adaptive Elastic-Net fits data with grouping effects, in following simulations true covariates will be added in blocks with size 3. The SS is excluded due to its poor performance. Residual bootstrapping was used in the adaptive Elastic-Net WMF procedure. Simulation results are summarized in Figures 5-7. In scenarios 4 and 5, the proposed method has the best performance over other compared methods: on average the highest degree of accuracy in indentifying the true model; a faster convergence rate; the underfitting issue vanishes quickly. On the other hand, other methods suffer from an overfitting issue. The sparser the model is, the more serious the issue tends to be. In scenario 6, all methods do equally well because the adaptive Elastic-Net well fit the data with highly grouped effects.
Simulation results for using paired bootstrapping in the adaptive Elastic-Net WMF procedure are presented in Appendix B, which are similar to those in above residual bootstrapping simulations.

Classification analysis of the leukaemia data
We now demonstrate the WMF procedure in a real data application. The leukaemia data [26] contains p n = 7129 genes and n = 72 samples. We have 38 out of the 72 samples from the training dataset with 27 ALL's (acute lymphoblastic leukaemia) and 11 AML's (acute myeloid leukaemia). The remaining 34 samples are from the test dataset with 20  ALL's and 14 AML's. The goal of this analysis is to identify a subset of genes that can accurately predict the type of leukaemia for future data. Similar to [7], we coded the type of leukaemia as a binary response variable, denoted as y, and defined the classification function as I(ŷ > 0.5), where I(·) is the indicator function.
To improve computational efficiency, we selected 1000 candidate genes as the predictors using the sure independence screening (SIS) procedure [27]. The adaptive LASSO and adaptive Elastic-Net were then applied to explore the data. The screening and variable selection were carried out on the training dataset, while classification errors were examined on the test dataset. Both the LARS and LARS-EN algorithms were stopped after 200 steps of estimation to further reduce the computational costs. Note that since the optimal steps selected by various types of methods are much smaller than the stopping step, this strategy will not affect the variable selection.
Classification results are summarized in Tables 1-2. For adaptive LASSO, although the Cp, CV and BIC have obtained the minimal classification errors for both training and test datasets, the WMF has classification errors close to the minimum using the least number of genes. For adaptive Elastic-Net, the WMF has the minimal classification errors for both training and test datasets using the least number of genes. Thus we conclude that the WMF procedure is able to find the set of "important" genes that can largely improve the prediction accuracy.
[6] had extended the adaptive LASSO to GLMs. Its estimator,β a , is obtained by maximizing the penalized log-likelihood, whereŵ j = 1/|β j | γ , γ > 0 andβ = (β 1 , . . . ,β p ) T is the maximum likelihood estimator. Under certain regularity conditions,β a was shown to enjoy the oracle properties . The generalization of Multi-fold CV to GLMs is straightforward [24]. Define, where Q(·, ·) is a loss function,ŷ s c i ,α is the prediction of y s i under model α using samples not in s i .
Then we can extend the WMF procedure to GLMs for adaptive LASSO. In this case, we draw B paired bootstrap samples in step 1 of Algorithm 1. Note that the LARS algorithm does not fit for GLMs, but we can use the coordinate descent algorithm [29] instead, which generates a solution path similar to the LARS. Hence in step 2, we use the coordinate descent algorithm to fit each bootstrap data. The rest remain the same. Asymptotic properties of the adaptive LASSO WMF procedure for GLMs can also be established by using some similar techniques for showing Theorem 1 in this paper and Theorem 4 in [6].
We demonstrate this extension through one simple example, where binary responses were generated from the logistic regression model  Figure 8. It shows that the WMF procedure is much more accurate in variable selection and also enjoys a faster convergence rate than other compared methods.
Extension of the adaptive Elastic-Net WMF procedure to GLMs is similar. Define the adaptive Elastic-Net estimator for GLMs aŝ where w j = |β ej | −γ , γ > 0 andβ e = (β e1 , . . . ,β epn ) T is defined in (7.1) withŵ j = 1 for all j's. The rest follow the same procedures for extension of the adaptive LASSO WMF procedure.

Ultra-high dimensional data
In this section, we discuss applications of the WMF procedure to ultra-high dimensional data in which p n > n. [27] proposed the sure independence screening (SIS) method for ultrahigh dimensional data to reduce their dimensionality to a moderate scale, d n , s.t. d n < n. Afterwards a lower dimensional estimation method such as the SCAD can be applied to the reduced data. This process is called SIS+SCAD. Under some regularity conditions, they showed that the SIS has an exponentially small probability to omit true features and the SIS+SCAD retains the oracle properties if d n = o p (n 1/3 ). By replacing the SCAD with adaptive Elastic-Net, the new procedure is refered to as SIS+AEnet [8], which holds the oracle properties if d n = O p (n ), 0 ≤ < 1. Here we recommend to combine SIS with the WMF procedure when p n > n. We first use the SIS to reduce the dimensionality to d n , d n < n, and then apply the WMF procedure to the reduced data. We call this procedure SIS+WMF.
Corollary 6. Suppose conditions for Theorem 1 in [27] and Theorem 3 in this paper hold. Let d n = n , 0 ≤ < 1. Then the SIS+WMF procedure is variable selection consistent.
Note that Corollary 6 is a direct conclusion of Theorem 1 in [27] and Theorem 3 in this paper.

Discussion
We proposed a prediction-weighted maximal frequency procedure to estimate the amount of regularization for adaptive LASSO and adaptive Elastic-Net. Asymptotic properties were studied with a diverging p n .
Central idea of the WMF procedure is the importance of conditioning on dimension, which mitigates overfitting. Underfitting can then be handled by using prediction-based weights estimated by multi-fold cross-validation. This simple recipe can also be applied to other regularization methods, say the SCAD and fused LASSO, making the WMF procedure a unified model selection criterion in regularization problems. However, asymptotic properties have yet to be studied, which will be a future topic.

A Proofs
Proof of Lemma 1. Assume | β i |>| β j | and | β i | − | β j |= mσ, m > 0. We have 4 cases for β i , β j Consider case 1: β j ≥ 0 and β i = β j + mσ, m > 0. Let k be a positive constant. The point β j + kσ separates the domain of Z i and Z j into two parts: (−∞, β j + kσ] and (β j + kσ, ∞]. The cumulative probabilities of Z i and Z j in first part of the domain are respectively The probability P (| Z i |>| Z j |) can then be calculated from After some simple deductions, we get, However if m > 0 i.e. | β i |>| β j |, Since m, k, β j , σ > 0, we have Note that two integrals in (A.2) have equal length of the integral intervals. Moreover the integral function is an monotonically decreasing function of x for x ≥ 0, and monotonically increasing for x < 0. Hence Other three cases can be proved in the same way. We avoid the repetitions here.
Proof of Theorem 1. By [8],β a enjoys the oracle properties under certain regularity conditions. Andβ * a is a paired bootstrap analog ofβ a by replacing (X, y) with (X * , y * ) in estimation. To simplify notations in the proof, we drop the subscript 'a' inβ a andβ * a . By the KKT regularity conditions,β * is the unique solution of adaptive LASSO given where X * j is the jth column of X * and Lets A = (ω j sgn(β j ), j ∈ A) T andβ * A = (X * T A X * A ) −1 (X * T A y * − λ nsA ). We show that (β * A , 0) satisfies (A.4) with probability tending to 1, which is equivalent to prove where the first inequation implies sgn(β * A ) = sgn(β A ). Note that ω j = |β j | −γ , whereβ = (β 1 , . . . ,β pn ) T is an OLS or best ridge estimate of β 0 , By Theorem 3.1 in [8], under assumption that lim n→∞ λ n2 √ n = 0. It is satisfied automatically for the OLS estimate. Denote x * iA the ith row of X * A , and ⊗ the element-wise product. We havê Hence under conditions (A1) and (A5), Let ψ = min j∈A |β 0j |,ψ = min j∈A |β j | andψ = min j∈A |β j |. Under conditions (A1)-(A3) and (A5), the first inequation in (A.5) can be proved by By (A.6), it has Similarly, By Theorem 3.1 in [8], For proof of the second inequation in (A.5), it suffices to show it follows that where κ, 0 < κ < 1, is a constant.
By definition, we have E * [X T ε * ] = X T E * (ε * ) = 0, , and In above equation, Moreover, by the sum of squares inequality, Hence,ε Let We now prove s n → σ n asymptotically. Note that lim n→∞ σ 2 n = lim with probability 1.
And by the sum of squares inequality, Then lim n→∞ s 2 n = σ 2 with probability 1.
Proof of Theorem 2. Let (X, y * ) be a residual bootstrap sample, where y * = Xβ + ε * and β is the ridge estimator. Definẽ where we dropped the subscript 'ae' inβ * ae for simplicity.