Randomized maximum-contrast selection: subagging for large-scale regression

We introduce a very general method for sparse and large-scale variable selection. The large-scale regression settings is such that both the number of parameters and the number of samples are extremely large. The proposed method is based on careful combination of penalized estimators, each applied to a random projection of the sample space into a low-dimensional space. In one special case that we study in detail, the random projections are divided into non-overlapping blocks; each consisting of only a small portion of the original data. Within each block we select the projection yielding the smallest out-of-sample error. Our random ensemble estimator then aggregates the results according to new maximal-contrast voting scheme to determine the final selected set. Our theoretical results illuminate the effect on performance of increasing the number of non-overlapping blocks. Moreover, we demonstrate that statistical optimality is retained along with the computational speedup. The proposed method achieves minimax rates for approximate recovery over all estimators using the full set of samples. Furthermore, our theoretical results allow the number of subsamples to grow with the subsample size and do not require irrepresentable condition. The estimator is also compared empirically with several other popular high-dimensional estimators via an extensive simulation study, which reveals its excellent finite-sample performance.


. Introduction
Consider the simple linear model (1) Y = Xβ * + ε, ε ∈ N (0, diag(σ 2 )) where the noise level σ is known and X ∈ R n×p , where both n and p are very large numbers, possibly of the order of millions and thousands respectively.Sample size n and feature size p are such that n p or n p, with the assumption that n is always of extremely large size.Important examples of this type include data collected in large-scale energy systems or online advertising, where n can easily be of the order of a hundred millions or more.In such extremely large-scale settings, the sheer size of the data often prevents implementation of even simple methods.Capturing sparsity patterns of the original data is an important question for which it makes sense, in this specific situation only, to consider approximations to variable selection methods that can be obtained by considering subsamples of size N that are of much smaller scale than the original n.
Assume that vector β * is such that its support set supp(β * ) = S ⊆ {1, • • • , p} describes a particular subset of important variables for linear model (1).Without much loss of generality, we can assume that the columns of X have unit l 2 norm.Various regularization methods have been developed to address this problem.A particularly attractive one is Lasso regularization [21], defined as (2) β(λ n ) = arg min Theoretical properties of Lasso estimator (2) are well understood and have been studied in a variety of high dimensional settings (see for more details [25,9,5]).However, in practice there are situations in which usage of Lasso might be statistically desirable, but it's computation on entire dataset might not be feasible due to resource constraints.
In such situations, we might pose a question weather Lasso estimator can be well approximated by bagging a few Lasso estimators computed only over a few small subsets of the full data?In this work, we provide a negative answer to the previous question.We also propose a randomized, weighted and smoothed modification to bagging, inspired by the recent work of [15] and [16].While the method proposed in [15] requires only partial access to the full data and can be well suited for large-scale regression, in this paper we analyze it's finite sample theoretical properties and find that, under nonsmooth and potentially very large-scale regression assumptions, although it successfully controls for false positives, it fails to control for false negatives.In order to address this issue, we propose a novel method called maximal-contrast selection.Unlike stability selection [16,20], proposed method does not require variables to be selected in both of the paired-subsamples, but rather in only one randomized subsample which is later tailored to account for uncertainty of the smaller order subsampling.
To the best of our knowledge, almost no effort has been made in studying approximations to Lasso estimator by simple subsampling.Existing work includes [10], in which parametric bootstrap of thresholded lasso estimator is used to stabilize the estimation.However, this method works only for very small p and N = n.In [13], authors propose m-out-of-n bootstrap for the choice of λ n , but only operate in p ≤ n, N ≥ n/2 setting.Although, good empirical control over the false negatives, was proposed in [24] with bootstrapping rows and columns of the data simultaneously, no theoretical investigations were given.In the context of multiple hypothesis testing, [17] provides empirical evidence that subagging might have an advantage over bagging in false discovery control.Asymptotic limitations of subagging are underlined in [4] and [8], but its finite sample equivalents, specifically in regression with non-smooth model assumptions, have not been analyzed.In particular, we show that Bolasso [2] fails to even approximately recover the sparsity set of the original data, when some of the subsampled designs are ill conditioned (Theorem 1), whereas the proposed method succeeds in these specific cases (Theorem 3).Asymptotically not negligible mean squared error, as stated in Theorem 3.3 of [8], suggests that there is asymptotically non-zero probability that the subsampled Lasso (sub-Lasso) and Lasso estimators have different sparsity patterns.In this paper, we provide a stronger version of this Theorem, through finite sample analysis and show that given probability is equal to zero (Theorems 5 and 6).Moreover, in Section 5 , we show that estimated sparsity set of subagging estimator is a discontinuous function of λ N .
Contributions of this paper are three fold.First, we show limitations of using traditional subbagging with regularized methods in achieving exact or approximate support recovery.Interestingly, we show that smoothing effects of bagging, unlike prediction tasks, are not desirable for model selection tasks.Second, we propose a new method based on smoothed, weighted subagging, specifically designed to attain good support recovery in situations where previous methods fail.In this context, our main result shows that Lasso estimator can be stochastically approximated with a aggregated subsampled, weighted and smoothed Lasso equivalents (sub-Lasso).While not-assuming that every sub-Lasso estimator must follow the same sparsity pattern, we show that proposed subagging is able to approximately capture sparse signal set S of the original data.Finally, we provide novel finite sample analysis of the sparsity patterns of the subsampled Lasso (sub-Lasso) and the Lasso estimator.We show that the two estimated sparsity sets never match in cases when N n and classical conditions are possibly violated.We further extend proposed methodology with adaptive choice of tuning parameters which allows relaxation of some of the theoretical conditions.
Let us briefly introduce the notation used throughout the paper.Let I and J be two subsets of {1, • • • , n} and {1, • • • , p} respectively.For any vector x, let x I denote its sub vector with coordinates belonging to I, with x j denoting its j-th coordinate.For any matrix A, let a j denote its j-th column vector.We let A I,J denote a matrix formed by the rows of A that belong to the set I and columns of A that belong to the set J. Shorthand notation A I,J , stands for the matrix A I C ,J , whereas A I stands for A I,{1,••• ,p} .We also use φ min (A) = λ min (A T A) and φ max (A) = λ max (A T A).Note that λ N and λ n denote tuning parameters of the sub-Lasso and Lasso estimators, respectively.
The rest of the paper is organized as follows.Section 2 outlines the inefficiency of classical subagging of Lasso estimators with main results summarized in Theorems 1 and 2. Section 3 introduces novel methodology of weighted small sample bagging of Lasso estimators.Section 4 discusses theoretical findings on support recovery properties of the weighted subagging with main result summarized in Theorem 3. Discussion of difference in estimated sparsity set of the sub-Lasso and Lasso estimators is left for Section 5, in which Theorems 5 and 6 contain main results.Section 6 discusses adaptive techniques regarding the choice of λ N and outlines non-random, optimal weighting scheme.Finally, in Section 7 we provide results of implementation of our method on both simulated and real data.

. Inefficiency of Subagging
Bootstrap scheme is said to be efficient if it mimics the "behavior" of the original data.In this context, "behavior" can mean many things, like inheriting rates of convergence in central limit theorem or large deviations.In this work, we concentrate on the efficiency of inheriting good variable selection properties.Similar properties have not been investigated in great detail in the existing literature.We use the following definition.
i=1 be a collection of observations from the simple linear model (1).Let I b be the b-th bootstrap-without-replacement sample of index set M } a collection of its M disjoint, non-overlaping subsets, ordered in any fashion, each of size N .In typical situations, similar to block bootstrap, we will assume that N = n γ for γ ∈ (0, 1).Number of such subsamples, M , is determined such that M N = n.Without loss of generality, let us denote with pairs {(Y i , X i )} i∈I b m nonoverlapping, bootstrap subsamples of the original data, which have maximal distance from one another.Total number of bootstrap samples B, doesn't have to be exhaustive i.e.B n!.Let us denote the model averaging estimator, bagged Lasso estimator [6], as Given that our focus is on studying approximations to the Lasso estimator, we need to specify conditions under which Lasso estimator has good sparse recovery.Hence, it seems that Irrepresentable condition is unavoidable (see [22] for detailed discussion of the necessity and sufficiency of the following Condition).
Condition 1 (IR(n)).If Ŝ = Ŝ(λ n ) represents estimated sparsity set of the Lasso estimator β(λ n ), then the design matrix X satisfies the following Notation IR(n) denote the dependence on the number of rows in the design matrix X. Empirical study of [7] suggests that out-of-sample properties of Lasso estimator are linearly dependent on the conditional number of the new data matrix.Since, our study of sparse recovery of bagged estimators, requires tight control of out-of-sample properties of each of the aggregated estimators, we outline a condition needed to control conditional number of the out-of-sample design matrices.It is an equivalent of the usual restricted eigenvalue condition [5].
Condition 2 (RE( X, ζ N )).For each subsampled matrix X, there exists a positive number ζ N such that for C(a) being a cone set defined as where 2 for any vector v ∈ R p , and realization of weight vector w.
Subscript N in ζ N , denotes the number of distinct rows of the matrix X .As long as the tail of the distribution of X ij decays exponentially, column correlation is not too large and N ≥ s log n, there are no difficulties in establishing such RE condition (see [19] for more details).The following theorem shows inefficiency of classical subagged estimator in suport recovery.THEOREM 1. Assume that Conditions 1 and 2 are satisfied and that p > n.Then, for every λ n , there exist no sequence of λ 1 N with N ≤ n, such that bagged estimator βb (λ 1 N ) achieves exact or approximate sparse recovery of the Lasso estimator β(λ n ).
Let us note that, Theorem 1 holds without imposing IR(N ) condition on every subsampled dataset.This contrapositive result is in line with the expected disadvantages of subagging [8], but is somehow surprising.It extends [2] and shows that even bagged Lasso estimator (with N = n/2) for diverging n and p, fails to exactly recover sparsity set S, if at least one of IR (N ) conditions is violated.However, for p ≤ n and choice of N = n, situation reverses and bagged estimator recovers exactly the true sparsity set S with probability very close to 1 (proof is presented in the Appendix).Moreover, in the next result we show that, only for the choices of N ≤ n/4, βb (λ N ) attains the same predictive properties as guaranteed by Lasso.THEOREM 2. Assume that RE(X b , ζ n ) holds, where X b is matrix obtained by bootstrap sampling rows of the matrix X.Then the estimator βb (λ N ), defined on the bootstrapped sample of size n/k with k ≤ 4, with the choice of λ 1 N = kσ log p/n, obtains the same l 2 prediction error as Lasso estimator β.

. Weighted Maximum-Contrast Subagging
In practice, for large-scale regression with n of the order of millions or more, subsampling of much smaller order becomes necessary.Previous results show that in such situations, it is not possible to approximate Lasso estimator and recover signal sparsity by simply bagging sub-Lasso estimators.Hence, we propose randomized subsampling alternative, inspired by the recent paper of [15], where the number of distinct original observations N n is controlled, kept small and is not considered random.Unfortunately, Lasso estimator derived from such subsamples, although computationally feasible (as they shrink n-scale computation to a very small N scale one) is potentially very statistically unstable (especially as IR conditions of size N aren't required).In order to address this, we propose a hybrid of randomized aggregation and maximalcontrast selection, which we call Weighted Maximum-Contrast Subagging.Even though proposed method is based on subsamples of order much smaller than n, it approximates sparse recovery of a Lasso estimator, without requiring IR conditions on each of the subsamples.
Define weighted sub-Lasso estimator βb m (λ N , w) as follows where w ∈ R N is a vector of random weights and I b m random permutation of the index set I. Since N n, possible model selection inconsistencies of each of sub-Lasso estimators might be carried over to the aggregated estimator.Therefore, we propose smoothing out the selection set, by aggregation of subsampled selections sets, which we call "maximum-contrast" selection.With that in mind, for each realization of random weight vector w, let Ŝb m (λ N , w) = {1 ≤ j ≤ p : βb m,j (λ N , w) = 0}, denote a set of selected variables for each of the sub-Lasso estimators.In order to contrast the selected sets, we define and use similarity measure between these bootstrap sets to determine which variables are going to be selected.Let us denote with p * j = P (j ∈ Ŝb (λ N )), probability that variable j belongs to the set Ŝb (λ N ).Due to subsampling, estimated sets Ŝb (λ N ) may exhibit significant variability which may lead to severe overdispersion.Hence, we propose the following minimax risk estimator of the selection probability p * j (5) Compared to the MLE estimator B −1 B b=1 1 j ∈ Ŝb (λ N ) , proposed estimator (5) has larger bias but much smaller variance in two situations: for small values of B ≤ 40 and all p * j , or all values of p * j 's that are close to 0.5.By reducing the effects of observations belonging to the tail, this estimator is less sensitive to overdispersion and more robust than MLE in the presence of subsampling effects.Therefore, π * j increases the chance that variable j has to be selected.Estimator of the support set is Ŝτ (λ N ) = {1 ≤ j ≤ p : π * j (λ N ) ≥ τ }, with appropriately chosen thresholding parameter τ .Next, we define weighted maximum-contrast bagged estimator β a (λ N ) as ( 6) if j ∈ Ŝτ (λ N ) and zero otherwise.Here P * w is the empirical average with respect to all w realizations.Proposed estimator (6) is Monte Carlo approximation of the theoretical weighted small-sample bagged estimator T , 0 T T , with E * denoting expectation with respect to randomization, subsampling and bootstraping.Let us discuss choices of tuning parameters λ N , B, M, K in Section 6 .
3 .1.Choice of the weight vector w.Note that each of the sub-Lasso estimators βb m (λ N , w) can be recasted as n × p Lasso problem βb m (λ N , w) = arg min where the new design matrix is equal to XT is a vector of responses with each response Y i j being repeated w i j times.
Proposed weighted subsampling, unlike Efron's bootstrap [12] which has random number of distinct observations, adds a constraint that the number of distinct observations N has to be to fixed, non-random and much smaller then n.This choice of N alleviates computational costs, whereas random weights w are designed to allow theoretical advantages by making ghost subsamples of provisional size n.Hence, we propose weighted bootstrap alternative as in [15], with (7) Note that randomness is double in this proposal, both as the "placement" of nonzero elements of w * and as their actual (random) values.Hence, proposed method is equivalent to a randomized delete-(n − N )-jackknife [11].Unlike traditional jackknife, it has randomized weights w i , non necessarily equal to n/N and a very large size of the deleted set i.e. (n − N )/n → 1.The following condition is needed for subsequent theoretical analysis.Condition 3. Let w = (w 1 , . . ., w N ) be a vector of exchangeable random weights, that is independent of the data, such that N i=1 w i = n and P (w 1 > 0) = 1.Moreover, let w 2  2,min = ∞ 0 P 2 w (min 1≤i≤N w i > t)dt < ∞ and E w w ∞ < ∞.Condition 3 is inspired by similar conditions imposed for weighted bootstraps [18].Unlike those, it does not require for 1 N N i=1 (w i − 1) 2 to converge to a positive constant c, independent of n.This condition is violated given that c → 0 for N n, as seen through previous jackknife comparison.Weights are chosen to have larger variance, in order to allow for smaller choices of tuning parameters λ N .Let us present a few examples of random weighting schemes that satisfy proposed structure.We leave the proposition of optimal and deterministic scheme for Section 6 .
Example 1.(Multinomial weights) Traditional multinomial weights can easily be adapted as in [15], with vector ( denote occupation numbers of the urns.If we have successfully thrown all n balls with N j=1 Z j = n then let w i j = Z j .If on the other hand, N i=1 Z i < n then choose uniformly at random, n − N i=1 Z i balls and throw them in urns by choosing urns independently with probability 1/N .Let (w 1 , • • • , w N ) be new occupation numbers of the urns.
Example 3. (Balanced Pólya scheme) Consider a Pólya urn scheme with an urn containing n balls with N different colors.At an initial step, let the number of balls of each color be equal to n/N .Balanced Pólya bootstrap is an evolution of the number of balls within each color over discrete time steps that guarantees that the total number of all balls in an urn does not change (it is kept at n) and the number within each color changes.At each step, a ball is drawn uniformly at random from an urn and its color is noted.If a ball of color j (j = 1, • • • , N ) is drawn at a stage, it is returned at urn together with R jk balls of color k.It is customary to represent these changes with a replacement matrix and we propose the following for B jk being Bernoulli random variables with p jk ∈ (0, 1], such that only one in row j takes value 1 (sum of row elements must be kept at zero to preserve the total number of balls in an urn) and N k=1,k =j p jk = p jj .Then, by using classical arguments it can be shown that the number of balls within each color is a Markov chain with a stationary Multinomial distribution . By Borel-Cantelli arguments, it can be seen that this scheme will also guarantee that with probability 1 each color is represented in the evolution of the scheme after some number of draws, and Condition 3 is satisfied.

. Theoretical Properties
In this section ee explore finite sample variable selection property of the proposed weighted scheme (7).We provide novel insights into finite sample equivalents of the asymptotic bias of subagging (as defined in [8]) and show the way weights and smoothing are instrumental in overcoming it.If irrepresentable condition IR(N ) is assumed to hold for every subsampled dataset, it is easy to verify that each of the sub-Lassos (4) perfectly recovers the set S with high probability.Since proposed methodology requires small sampling with possibly N p, requiring IR(N ) condition is undesirable.Hence, a more interesting scenario is allowing for possible violations of such smallsampled irrepresentable conditions.Major part of this section is devoted to studying this particular case and showing that weighted maximal-contrast subagging (6) does not require such condition to achieve approximate sparse recovery (according to Definition 1).
In situations when IR(N ) condition is possibly violated, we show that for every realization of random weights w and λ N ≥ λ n (8) P ( βb m,j (λ N , w) = 0) < P ( βj (λ n ) = 0), j ∈ S c , (see Theorem 5).Therefore, there is little hope that each of the sub-Lasso estimators performs the same as Lasso estimator (2).However, the expectation is that by controlling out of sample properties and aggregating information over multiple subsamples, we can approximate Lasso.THEOREM 3. Let Conditions 1, 2 and 3 be satisfied.Assume that the distribution of for some constant c 2 > 1 and all 1 ≤ j ≤ p.Then, there exists bounded positive constant c N > 1 such that, for Studying estimator ( 6) is complicated by the the fact that we allow deviations of the small sampled IR(N ) condition.Proof of the Theorem 3 is based on a tight control of the out-of-sample properties of each of the sub-Lassos and a novel method of finding appropriate range for the λ N .This proof is based on allowing possible small deviations in prediction rates and finding the smallest such deviation that allows good variable selection properties.Assumption of exchangeability of indices j over the set ∪ w Ŝb m (λ N , w), is actually an assumption on the appropriate sets of probability distributions for the weight vectors w.It is easy to see that such a condition is not too restrictive for Examples 1 through 3 of Section 2.2 (all of which avoid trivial cases when exchangeability condition might be violated).
Theorem 3 suggests that even when sub-Lassos have probability of support recovery smaller than 1/2, proposed method (6) boosts naive aggregation and increases probability of support recovery to be very close to 1. Sub-Lasso estiamtors cannot select all features as signal variables; their "goodness" is controlled through assumptions RE( X, ζ N ), which guarantee that the size of the estimated set is not too large.Note that these are the only conditions required for each subsample (as IR(n) condition is for complete data only).
Although this work is mainly concerned with variable selection, let us provide a results on the effect the weighting scheme w has on prediction properties.Interestingly, it shows that the weights w are necessary for controlling out-of-sample properties.We look at both previously introduced minimal eigenvalue condition RE( X, ζ N ) and its relaxation of the following kind: here exists a positive number This relaxation is related but not equivalent to the RE condition on sub matrices X I b m .Since ERE holds for all data perturbations, it implies RE(X, ζ n ).By noticing that X I b m v 2 ≤ E w Xv 2 we can put forward the following set of inequalities: The upper bound on the constant a N varies with the distribution of w.Its minimum is reached when the minimum of the weight vector w is maximized i.e. for equally weighted w.In such cases, it is easy to show that a N ≤ N/n and that the upper bound in (i) and (ii) match despite different RE assumptions.
Remark 1.In contrast to existing work on Lasso estimator, where prediction bounds are obtained for λ N > σ log p/N [5], Theorem 4 holds for much smaller choices of the tuning parameter λ N .By appropriately weighting each row of the subsample matrix

X b
Im , we were able to obtain good prediction properties of subsampled estimators for choices of parameter λ N of much smaller order than σ log p/n.Hence, results of Theorem 4 suggest that existence of w is crucial in eliminating false negatives (see Theorem 6).

. Sparsity Approximation
As an important aspect of an efficiency of bootstrap subsampling, we explore how close connected are sub-Lasso βb m (λ N , w) (4) and Lasso β(λ n ) (2) estimators in terms of their sparsity patterns.To the best of our knowledge, such an important question was not investigated in existing literature.We aim at providing novel insights by showing that the two estimated sparsity sets have strictly nested structure and are almost never of the same size.Main results are summarized in Theorems 5 and 6.DEFINITION 2. We say that the sparsity pattern of an estimator β is optimal for an estimator η, if non-zero part of the estimator β satisfies Karush-Kuhn-Tucker optimality conditions of the non-zero part of the estimator η.
We find values of the tuning parameters λ N and λ n , which allow sparsity pattern of the sub-Lasso to match the one of the Lasso estimator.We begin by showing in Lemma 2 (in the Appendix) that the signal variables identified by sub-Lasso estimator (4) lie in the feasible set of the Lasso problem (2).Moreover, we show that sub-Lasso estimator requires much smaller regularization (compared to the expected order of log p/N ) in order to reach the number of identified signal variables similar to Lasso.This phenomenon complements the recent work of [14] in which the authors argue empirically that higher correlations, as expected by subsampling the rows of data matrix, induce smaller tuning parameter.THEOREM 5. Let Conditions 1, 2 and 3 hold.Then, sparsity pattern of each of the sub-Lasso estimators (4), with λ N ≥ c 1 log p/n, c 1 > 0, with probability 1 − p 1−c , c > 1, is optimal for Lasso estimator (2) with λ n satisfying . Note that the choice of λ N depends on the w as well, through the last term in the previous inequality.Without evaluating prediction on the testing data, sub-Lasso and Lasso estimators would not be close and in case of not valid small sample irrepresentable condition, the gap could be enormous.The core of the proof of Theorem 5 relies on evaluating finite sample bounds for the out-of-sample properties for each of the sub-Lasso estimators.One of the consequences of the proof is that out of sample prediction property of the sub-Lasso estimators is linearly inflated by λ max (X T I b m X I b m ) i.e by the conditioning of the new data.This phenomenon was separately noticed in [7] through detailed simulation investigations.Since new data in our bagging procedure has much more rows then columns, n − N p, out of sample prediction is fairly good.This is the key of good properties of the proposed small sample bagging methodology.
In the remainder of this section, we show contrapositive result; that without requiring irrepresentable condition on each of the small samples, it is not possible to obtain exact sparse recovery.THEOREM 6.Let Conditions 1, 2 and 3 hold.Then, sparsity pattern of Lasso estimator (2), for λ n ≥ c 1 σ log p/n for some c 1 > 1 is, with probability 1 − p 1−c , c > 0, optimal for each of the sub-Lasso estimators (4), with Remark 2. Let us denote with f λ N : P → R + a set function that maps the set to its cardinality defined as f λ N (S) = |S|, where S ⊂ P and P denotes sets of all possible subsets of {1, • • • , p}.Then, as a consequence of the previous Theorems 5 and 6, we conclude that f λ N (∪ w Ŝb m (λ N , w)), as a function of λ N , has a discontinuity of the first order with probability close to 1.Its jump discontinuity lies in the interval for some small but possibly random, positive constants c and c 1 .Therefore, "maximal-contrast" selection acts as a smoothing operator and allows us to recover sparsity set S optimally.

. Implementation
In this section, we discuss extensions of the proposed method and propose adaptive choice of λ N along with weight vector w which, remarkably, allows only 3 repetitions of reweighing and 2 bootstrap resampling total.
6 .1.Adaptive warm-start choice of tuning parameters.Since we seldom know in advance a good value for λ N , we often compute a sequence of solutions for a (typically) decreasing sequence of values λ N 1 > λ N 2 > • • • > λ N q .Warm-start or continuation methods use the solution at λ i as an initial guess for the solution at λ i+1 and often yield better computational efficiency.When sample size of the original data n is too large, plain Lasso estimator (2) cannot be efficiently computed and we have no relative comparison on how to choose appropriate sequence of λ N .With careful analysis of all the factors of Theorems 5 and 6, we design the following adaptive, warm-start scheme for choice of tuning parameter λ N .
For b and given sample index partition From the design of the subagging, we know that κ 1 m and κ 1 m will be finite but overinflated, as all variables are included in the model.Due to usage of warm up techniques for choices of Âb , both quantities will decrease in the future iterations and have small fluctuation (of the order of n −1/2 ) around their "true" value ).Hence, instead of doing extensive grid search for the optimal choice of λ N , we propose to only consider (11) λ Constant 8 here guaranties good properties of the main Lasso estimator by choosing its tuning parameter of the order of 8 log p/n (see [5]) and governs the upper bound on the choice of λ N .RHS of previous inequality is always smaller or equal than that of λ 1 N and intuitively, we expect the difference between its two subsequent bootstrap ranges to shrink as n −1/2 .Therefore, our procedure does not need for more than just two bootstrap permutations of the whole data (see Example 2 of Section 7 ).6 .2. Optimal Weighting Scheme.Previously, we have introduced stochastic weight vector w in order to imitate nonparametric bootstrap realizations as closely as possible.We also saw that assuming RE conditions on the whole and small subsample data was unavoidable.RE conditions are established for several broad classes of random matrices, ranging from those with dependent entries, including random matrices with subgaussian rows and non-trivial covariance structure, to the matrices with independent and uniformly bounded entries [19,26].Such structure would be desirable at the population level.Unfortunately, interplay of population conditions corresponding to the whole data and to the weighted subsamples is not quite interpretable, hence we propose alternating the choice of the weight vector w.For every b and m, we propose choosing w which maximizes ζ N , such that Hence, w increases the minimal eigenvalue of the sub matrix XT X and at the same time imitates bootstrap realizations.Unfortunately, this is an integer optimization problem which is NP hard.However, the following proposition shows us that with simple relaxation of w 1 , • • • , w N ∈ R N + , the problem can be recasted as a semi-definite program (SDP) and hence efficiently implemented.THEOREM 7. The problem of maximizing the minimum eigenvalue of XT X under conditions that 1 T w = n and w ∈ R N + is equivalent to solving the following SDP max t s.t.
+ , where x i j ,J k denotes i j -th row of matrix X I b m with columns (i.e.indices) in J k .Previous stochastic optimization problem, given by Theorem 7, should have been solved for all possible sparsity patterns v = 0 such that However, this is computationally less attractive, so we relax it to take into account the sparsity structure of the previous step estimator and look only at those sub matrices.By doing so, we are making a tradeoff between computational efficiency and theoretical requirements.
Problem ( 12) is a regular SDP that can be plugged in any of the existing SDP-solvers.The problem is not of large size, as |I b m | = N n, and therefore computational costs of repeating it are not substantial.Since there exists a unique optima of the previous problem, there is no need for K to be larger then 3. Step two can still potentially have too large ŝ, hence to correct for it -we allow one additional k iteration.

. Simulated and Real Data Examples
Simulated and real data are analyzed in order to underline strengths of the proposed weighted maximal-contrast subagging.We generated 3 synthetic datasets from a simple linear model (1) with varying sample size n, feature size p, signal strength and correlation structure in the design matrix X.All the choices of the tuning parameters were followed by the adaptive results.7 .1.Example 1.In this example, we measure performance of the proposed method through false positive(FP) and false negative(FN) error control.In order to compare performance between proposed method and plain Lasso estimator, sample size n was chosen such that it has large, but computationally feasible value.Therefore, we fix n = 10000 and p = 500 and generate design matrix X from multivariate normal distribution with correlation matrix ρ |i−j| with choices of ρ: 0.8(a), 0.2(b), 0.2(c), 0.8(d).We vary signal to noise ratio by setting the first 30 parameters to be equal to 1 in (a) and (b) and to 0.3 in (c) and (d) and the rest to be equal zero.We contrast the results of traditional Lasso (2), traditional subagging (3) and proposed weighted maximal-contrast subagging (6) with varying choices of parameters γ and fixed number of bootstrap permutations and weight approximations B = 2, K = 3.We repeat all procedure 100 times.
Results are summarized in Figures 1 and 2. Figure 1 outlines the difference among the three methods with respect to the true and false positives, by plotting average selection probability for the signal and noise variables, against the choice of size of the subsamples N .Methods are color coded as follows: blue is for subagging, green for Lasso and pink for the proposed weighted maximal-contrast subagging.In these examples, we explore worst case behavior of the proposed methods and chose τ = 1 in order to do so.
We expect Lasso to be able to select the sparse set in example (b) but to fail to some extent in other examples; examples (a) and (d) are designed to violate the Condition 1, whereas examples (c) and (d) decrease signal to noise ratio.Figure 1 confirms all of the above and illustrates much more.Surprisingly, subagging fails to perform variable selection at all; it selects almost all features as important in all 4 settings.We discover that weighted maximal-contrast subagging results in better than expected properties.It manages to control the number of FP in all 4 cases at fast logarithmic rate, with respect to the size of subsamples.The number of FN depends on the choice of τ , but for most cases with τ ∼ 1/2 it keeps FN at a low rate, even for subsamples of size √ n.As expected, FN deteriorates with smaller signal strength, requiring larger subsample sizes for obtaining low rate of FNs;.For example (b), both subsample size of 75 with τ ∼ 1/2 and subsample size of 87 with τ ∼ 3/4, guarantee that both FP and FN will be close to zero.In the case of example (d), the equivalent requirements for subsample size are larger with 130 and 250 respectively.On the other hand, examples (a) and (d) show how additional strong feature correlation helps weighted maximal-contrast subagging to control FN with little or no increase in FP.
Figure 2 zooms into the left column plots of Figure 1 and outlines the true positives (TP) of weighted maximal-contrast subagging.We show box plots of the selection probabilities for each of the signal variables individually with N = 100.Although in each individual run, sub-Lasso estimator missed a few of the first 30 important variables, given that we have averaged among many small randomized Lassos and perturbed the data, final estimator was able to pick up those "loose" variables in all four setups.As long as τ was bigger then 1/2, all of the significant variables were picked up.In that sense, the proposed method advances the finite sample selection property of plain Lasso estimator, given that it keeps FN= 0 even when Condition 1 is violated.

.2. Example 2.
We are also interested in testing the sensitivity of the proposed method on the choice of the number of bootstrap permutations B. For that purpose, we generate synthetic data from simple linear model with a different number of boostrap data permutations.Model specifications are equivalent to the ones of Example 1 (b), with small column correlations and strong signal.As measure of performance, we contrast mean selection frequency of the first 30 important variables with a different number of boostrap replications.Figure 3 summarizes our findings and reports average selection frequency and its 5% confidence interval for 4 different bootstrap replications: B= 1, 2, 5 and 10.As expected, the bigger the number of bootstrap replications, the smaller is the variability in estimating π j .Interestingly, B = 2 was sufficient to guarantee π j > 1/2.Even for B = 1 this seems to be true in the example considered but  the variability is significantly larger then for B ≥ 2, hence making general conclusion seem inappropriate.Figure 4 shows variability in sampling frequencies.In most modern application we don't have control over the size of γ or don't know the size of the whole sample (hence we only know that γ is potentially small).We consider four different scenarios where n = 500, 1000, p = 250, 500 and γ = 0.75, 0.8 .First and third scenario use simple equal weighting scheme =n/N , whereas all others use the optimal weights of Algorithm 1.We show that, even with relatively small subsampled data sets, proposed method preforms reasonably well (see Figures 4(b),(d)).Note that implemented maximumcontrast selection in Figures 4(a) and (c), corresponds to paired stability selection of [20] (since M = 1 and K = 1).Hence, we observe that paired stability selection underperforms for small subsamples and small n/p ratios.7 .4.Million Songs Dataset.The Million Song Dataset, developed with support of NSF, is new, freely-available collection of audio features and metadata for a million contemporary popular music tracks.The goal of this project is development of smarter platforms for enjoying music, apps that deliver enhanced listening experiences and tools for detailed musical analysis.In order to get a sense of the size of this project, note that 1% subset of this data is 1.8 GB and is substantial on its own.Hence, Columbia University together with The Echo Nest and LabROSA have started a Million Songs Challenge (http://labrosa.ee.columbia.edu/millionsong/)with the goal of finding best solutions to the number of problems related to this dataset, few of which are: dividing a song into meaningful segments, associating appropriate keywords to some specific sound segment, predicting tags based on clusters of song titles, etc.One such task is the prediction of an album's year of release.Using special audio features like segments timbre, covariance of every song and possibly some other information like energy, danceability, etc, our goal is to try to understand which one of them has the highest correlation with the year or decade at which the song has been released.A simplified dataset for that task is available on the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/YearPredictionMSD).
Even simplified, this data is still huge.It consists of 515345 different track instances and 90 audio features for each track.Simply applying Lars package is not possible, especially given the longitudinal nature of the data.We split the data into training (first 463,715 examples) and testing subsets (last 51,630 examples) in order to avoid the producer effect, by making sure no song from a given artist ends up in both train and the test set.We apply the proposed adaptive weighted maximum-contrast subagging on the training data with generalized linear regression model with link function g We model the probability of i-th observation belonging to k-th year of release using logit function , with β 0 denoting an intercept.Choice on the tuning parameters are B = 2, K = 3 for a sequential grid of values of λ N of Section 4.Among a few possible values for λ N , we pick the one that minimizes 5-fold cross validation statistics.
Error prediction rate for each year of release k = 1, • • • , 83 is defined as for some threshold c usually taken to be larger then 0.012 and n 1 the size of the testing dataset.Since we have 83 years of releases, we summarize our findings across decades of track's release in Table 1.Due to size of the dataset, plain Lasso estimator could not be implemented, hence we compare performance of the proposed weighted maximum-contrast subagging (WMCSub) with the plain subgagging(Sub) in terms of their prediction errors and estimated sparsity sets.
For each of the decades, we report the median misclassification rate and its standard deviation, of each year of release in that decade (column 2 of Table 1).First, we observe differences in rates of a successful classification, with recent decades showing much better results compared to the early beginnings of the 20th century.This may potentially be explained by the unbalanced nature of the data, with recent years having much more observations than the early 20th century.Second, we observe that prediction rates of the two methods are very close to each other but that the selected features are not.Subbagging selects almost all the features all the time, whereas the proposed method finds smaller subsets.
Let us comment on the structure of the selected sets and a pattern that apears consistent across decades.Interestingly, we find that the first 12 timbre-features are being selected repeatedly for different years of song releases, whereas amongst the last 72 timbre-correlation features, only a small portion is selected.Figure 5 tries to capture this consistency in selection across tracks from different decades.Variability within timbre audio features median value of coefficients . Visualization of selection of audio features for one representative year in each decade.Y-axes gives the median value among 8 possible choice of tuning parameter λ n (11) for each of the timbre features.Colors black, blue, green, yellow, orange, pink, purple, magenta and grey represent year 1924, 1934, 1944, 1954, 1964, 1974, 1984, 1994 and 2004 respectively.
given decades is visible in the Table 1.From Figure 5, we can see that, apart from the sixties and seventies, all of the first 12 of timbre-audio features are being selected as important.Sixties and seventies have only a few of those present.Nonetheless, relative difference in the sizes of the coefficients among the first 12 and the last 72 correlation features is quite significant.

. Discussion
In this paper, we have outlined the causes for failure of the nonparameteric bootstrap aggregation and showed its inability to successfully differentiate between signal and noise variables.We have presented new, weighted and smooth nonparametric subagging of Lasso estimators and demonstrated ability of this method to effectively recover sparsity set in the context of high dimensional linear models.This is of particular importance when dealing with extremely large datasets in which sparse recovery is of interest and fast answers are of great significance.
In general, the proposed method can be viewed as a sequence of stochastic approximations of the regularized empirical risk minimization, designed to keep computational costs low, yet maintain small approximation error.These two conflicting requirements lie in parallel to Neyman-Pearson paradigm, where Type I and Type II error cannot be simultaneously satisfied.Type I error is now replaced with a requirement of low computational costs and Type II error with a requirement of small statistical error.This important paradigm extends well beyond this current work and if left for future explorations.

Appendix: Detailed Proofs
This Appendix contains details of the proofs of Theorems 1, 2 and 3, whereas the supplement contains all the details of the proofs of Theorems 4, 5, 6 and 7.
A. Preliminary Lemma LEMMA 1.Let w = (w 1 , • • • , w N ) be a vector of exchangeable random variables that satisfies Condition 3. Let P and P * stand for the probability measures generated by the error vector (ε 1 , • • • , ε N ) and generated jointly by the weights w 1 , • • • , w N and the errors ε 1 , • • • , ε N .Then for a sequence u n of real numbers bounded from below with n −1/2 it holds where Let R denote a random permutation uniformly distributed over the set of all permutations of 1, • • • , N and let w = (w 1 , • • • , w N ) be a vector of exchangeable random variables that satisfy Condition 3. Note that X 1 , • • • , X N are interdependent of (w, R).Define S to be a random permutation of {1, • • • , N } by requiring that w S(1) ≥ w S(2) ≥ • • • ≥ w S(N ) and if w S(j) = w S(j+1) then S(j) < S(j + 1).This is one possible definition that is ambiguous to the presence of ties.Random permutation R is independent of (w, S).Notice that P * = P × P w .By exchangeability of vector w we have Since R • S is distributed as R and independent of S we have The right hand side of previous equation is upper bounded by Then, the following chain of inequalities hold as an upper bound of the right hand side of the last equation Let us denote with p t tail probability on the right hand side of previous equation, i.e.
By direct application of tail inequality for Gaussians we have the following simple bound By applying Caushy Shwarz and Jensen inequalities, equation ( 13) can be upper bounded with To bound last term we can use (14) to conclude that where x j R stands for the j-th column of matrix X R(I b m ) , x j R ∞,2 for the maximum of the squared and x j R ∞ maximum of the absolute values of elements of the nvector x j R .Then, by observing simple relations where in the last step we used w 2,min = ∞ 0 P 2 w (min 1≤i≤N w i > t)dt.

B. Proofs of Section 2
Major part of the proof of Theorem 1 lies in the following two simple lemmas and major results of Theorems 5 and 6 analyzing sparsity patterns of the sub-Lasso and Lasso estimators.B.1.Preliminary Lemmas.Let βb m be the sub-Lasso estimator and β Lasso estimator throughout this section.LEMMA 2. For all j for which βb m,j (λ N ) = 0 and Proof of Lemma 2. Since both optimization problems are convex it suffices to analyze their respective Karush-Kuhn-Tucker (KKT) optimality conditions.For all j ∈ {1, • • • , p} such that β * j = 0, Lasso estimator needs to satisfy , where KKT for sub-Lasso estimator are Then we are left to check if under conditions of the lemma sub-lasso estimator βb m satisfies KKT condition of Lasso estimator.That calculation follows easily from the following inequality where the first term in the rhs is bounded by λ N and the second with λ n − λ N (by the assumption of the Lemma).LEMMA 3.For all j such that βj (λ n ) = 0 and Let us assume that the KKT conditions of the plain Lasso estimator, computed over the complete dataset hold i.e. that for those j such that βj = 0, By analyzing two cases individually: Case (I): to hold in both cases.With it we can then see that since each w i ≤ n for all i almost surely.Hence to show that β satisfies KKT for sub-Lasso as well, we need B.2. Proof of Theorem 1.Previous Lemmas are stated for general sub-Lasso estimator but can be easily adapted to the bagged estimator.With their help and results of Theorems 5 and 6, we are ready to finalize the proof of Theorem 1. Equivalent of Lemma 2 requires N − nλ n to hold.The proof follows easily as a consequence of results obtained in Theorems 3, 5 and 6.
First, on the subject of approximating sparse recovery, we conclude that it is not possible as a consequence of the proof of Theorem 3 .For bagged estimator, there exists no feasible q that is different from zero which solves the equivalent of (24).Equivalent of (24) would require, on one side q > log p/N and on the other q < log p/n.For N n this is not possible as log p/N > log p/n.For a special case of N = n/k, only the choice of k = 1 and fixed, not divergent s and p, will allow both conditions to be satisfied.
Second, on the subject of the exact sparse recovery, as a consequence of previous equivalent of Lemma 2 and Theorem 5 we have ( 16) As a consequence of equivalent of Lemma 3 and Theorem 6 we have (17) If N n then, from the above contradictory conditions we can see that subagged estimator can not have the exact same sparsity of the Lasso estimator, For a special case of N = n/k we see that the only choice of k = 1 and fixed, not divergent s and p will allow equations ( 16) and ( 17) to be satisfied up to a constant.That is, there exist two constants 0 < c < ∞ and 0 < c 1 < ∞ such that for the choice of Following the steps parallel to those in [5], we can obtain the predictive bounds of the order of sλ N /ζ 2 N , for N = n/k and λ N ≥ 2σ 2k log p/n.From the classical results on Lasso prediction bounds, we know that optimal λ n ≥ 2σ 2 log p/n.The statement of the theorem follows if we are able to bound the following expression = n for some n ≥ 0. Then for λ N ≤ 2σ 2k log p/n we have This claim is equivalent to claiming that . But, from RE(N) condition we know that 0 ≤ n < ηζ 2 n for some constant η < 1. Hence constant C > 1 that satisfies above properties is (η + 1)/(1 − η).Therefore, we can conclude that δ n ≤ 2Cσ

C. Proof of Theorem 3
Main ingredient of the proof is based on the intermediary results (of significant importance on its own), stated in Theorems 5 and 6.First part of the statement follows by utilizing Theorem 6 in order to conclude that S ⊆ Ŝτ with probability close to 1. Unfortunately, as conditions of Theorems 5 contradict those of Theorem 6, we cannot easily apply them to conclude the second part of the statement of Theorem 3. Therefore, we develop a new method for finding the optimal value of the tuning parameter λ N .It is based on the optimal bias-variance tradeoff, where bias is replaced with variable selection error and variance with prediction error.It allows good, but not the best prediction properties while obtaining desired variable selection properties.For that end, we will split the proof into two parts.The first is based on bounding the number of false positives and second is based on finding the optimal choice of λ N that implies results of Theorem 6 and bounds false positives simultaneously.) is exchangeable for all λ N .Also, assume that the approximate randomized lasso procedure is not worse than a random guess, i.e. for (18) |N Let us fix b and denote sets I b m with I m (i.e.random split of the samples {1, . . ., n} into M disjoint subsets each of size n γ ).Then, similar to [20] we can define a binary random variable H λ K = 1{j ⊆ ∩ M m=1 Ŝm (λ N )} for all variables j ⊂ {1, . . ., p}.Then minimax selection probability is expressed as a function of simultaneous selection probability intrinsically controls predictive properties, they cannot be satisfied simultaneously on sets A 0 (λ N ).For q = 0 best prediction is still achievable but variable selection (Theorem 6) is not.Hence, we need to allow for possible deviation of the smallest sets A 0 (λ N ) by allowing small perturbations of size q.Our goal is to find the smallest possible perturbation q which allow high probability bounds on the selection of the false negatives and simultaneously controls the size of the selected sets in sub-lasso estimators to allow small number of false positives.
Let us represent conditions of stochastic problem (19) in a concise way.Note that from optimality conditions of each of the sub-Lasso problems we can see that whenever βb m,j (λ Then, on the event A q from Theorem 4 we have the following and . Hence, from all above and tower property of expectations we can conclude that (23) E| Ŝb Now we are left to evaluate the size of sets A q .This step of the proof is based on a Bernstein's inequality for exchangeable weighted bootstrap sequences contained in an intermediary result Lemma 1. From Lemma 1 we have, For a choice of t of λ N −q we then have P Hence the solution to the stochastic optimization problem (19) also solves min q ≥ 0 (24) for some constants c 1 < c 2 .The RHS of the last constraint inequality is a consequence of Theorem 4 (which is used numerous times in the steps of the proof).First constraint of the above problem can be reformulated as .
Then we can see that the optimal value of q is of the order of c 3 σ log p n for Notice that such value allows sets A q (λ N ) to have large coverage probability.Such optimal choice of q results in the choice of the optimal value of the tuning parameter λ N as follows (c 1 −c 3 )σ log p n ≤ λ N ≤ (c 2 −c 3 )σ log p n , satisfies all previous constraints, hence leading to the statement of the Theorem.

D. Proof of Theorem 4
Although sub-Lasso can be reformulated as modified Lasso estimator, sparse oracle inequalities for Lasso cannot straightforwardly apply as the error terms in the modified problem are not independently distributed anymore.Main results of Theorem 4 are proved using the novel, preliminary result of Lemma 1. From the definition of sub-Lasso problem we have for every realization of random weights w, holds for any value of β.For simplicity of notation we have suppressed the dependence βb m of λ N and k.By using Y i = X i β * + ε i , and setting β = β * , previous becomes equivalent to In the rest of the proof we are providing evidence to a result more stronger then the one in the statement of the Theorem 2. This is done, as such a result is needed for Theorem 1.Consider the event which leads to the first conclusion.From the previous result we can also see that Here E w denotes expectation taken with respect to probability measure generated by w.Using Jensen's inequality for concave functions and independence of the weighting scheme w of vectors X i , we have Combining previous inequalities we have Size of set A q (λ N ) can be deduced from Lemma 1 and Theorem 4 and is hence omitted.
E. Proofs of Section 5 E.1.Proof of Theorem 5. Using results of Lemma 2 it is sufficient to show that for every sub-Lasso estimator βb m the event N } happens with large probability.In other words, sub-Lasso will be one of the possible optima for Lassoif we can evaluate its behavior on the data on which it was not build on.
To that end, let us introduce notation of the following kind.Let We are left to bound the last term on the RHS of (26).As X T ÂX j is the projection of X j into the space spanned by columns of X Â we have that X T E.2.Proof of Theorem 6.To wrap up the proof we need to show that the condition of Lemma 2 holds with high probability for the Lasso estimator β.To that end lets define an event By repeating similar decomposition analysis developed in Theorem 5 (direct steps omitted), and using the intermediary result of the following Lemma.

F. Proof of Theorem 7
Using representation XT k Xk = n i=1 xi xT i where xT i is the ith row of Xk .However, note that the first w 1 rows of Xk are identical, so we add the same term w 1 times.After this, we add a different row w 2 times, and so on.We can then rewrite XT k Xk = N j=1 w j x i j x T i j

DEFINITION 1 .
(a) Estimator Ŝ of support set S achieves "exact sparse recovery", if Ŝ = S with high probability.(b) Estimator Ŝ of support set S achieves "approximate sparse recovery", if S ⊆ Ŝ with high probability and E| Ŝ ∩ S c | is negligible.
X I b m ,S ) and κ(X T I b m ,S X I b m ,S

N
j=1 w j x i j x T i j tI N j=1 w j = n, w ≥ 0 This proposition enables us to design an adaptive choice of weight vector w, in a way that increases the minimal eigenvalue of XT X compared to X I b m T X I b m .Let k = 1, • • • , K denote the number of realizations of vector w.For k = 1, choose equal weights of size n/N .For next iterations k=2, and for fixed b and fixed sample splitting, take J k = {j : βb m,j (λ N , k − 1) = 0} and ŝ = #{j : βb m,j (λ N , k − 1) = 0} and choose w k = w such that

Figure 2 .Figure 3 .
Figure 2. Distribution of selection probabilities π * j in weighted maximum-contrast subagging among the first important 30 variables with n = 10000, p = 500

Figure 4 .
Figure 4. Sample variability of selection probabilities π * j among signal variables with simultaneous changes in problem and subsample sizes.

C. 1 .
Bounding False Positives.Assume that the distribution of {1(j ∈ S b m (λ N ), j ∈ S c }, with S b m (λ N ) = ∪ w S b m (λ N , w)

2 /
j − β * j | 2 leading to the conclusion that βb m − β * ∈ C(3) and using the expected Restricted Eigenvalue Condition on the set Ω w we have j∈S| βb m,j − β * j | 2 ≤ E w X( βb m − β * ) to X I b m (β * − βb m ) 2 ≤ (4λ N − q) √ snE w √ w ∞ /ζ N (min 1≤i≤N w i ).Hence, if we define a N as such that event {min 1≤i≤N w i ≥ E w √ w ∞ /a N } has probability close to

.
m was used as a short hand notation for the compliment of I b m .With Ŝb m a set of non-zero coefficient of βb m , let us denote with Â = Ŝb m ∪ Ŝ, where Ŝ is a set of non-zero coefficient of Lasso estimator β.Let P Â be a projection operator into the space spanned by all variables in the set Â, that isP Â = X I b m , (I − P Â)(Y I b m − X I b m βb m ) + X T I b m ,j P Â(Y I b m − X I b m βb m ).Controlling the size of the set Ω n is equivalent to upper bounding the RHS of the previous equation.E.1.1.Step I: Controlling X T I b m ,j P Â(Y I b m − X I b m βb m ).This term is upper bounded by Since irrrepresentable condition holds on the whole data then the first term in bounded by λ N .The second term of the above equation is bounded from above by(26) λ N (X T I b m , ÂX I b m , Â) −1 − (X T ÂX Â) −1 X T I b m , ÂX I b m ,j 2where stands for operator norm of a matrix.Note that by definition of operator norm we have that for two semi-positivedefinite matrices A, BA −1 − B −1 = max{|λ min (A −1 − B −1 )|, |λ max (A −1 − B −1 )|}.Using Weyls inequalities[3] we have1 λ max (A) − 1 λ max (B) ≤ λ max (A −1 − B −1 ) ≤ 1 λ max (A) − 1 λ min (B),1 λ min (A) − 1 λ max (B) ≤ λ min (A −1 − B −1 ) ≤ 1 λ min (A) − 1 λ min (B)where we used λ max (−B −1 ) = −λ min (B −1 ) = −λ −1 min (B) and λ max (A −1 ) = λ −1 max (A).Therefore,A −1 − B −1 ≤ 1 λ min (A) − 1 λ min (B) ≤ 1 λ min (A)Using B = X T ÂX Â and A = X

2 .P Âε 2 .P
Step II: Controlling X T I b m ,j (I − P Â)(Y I b m − X I (I − P Â)ε .Right hand side of the previous inequality is then bounded by max Since ε i are i.i.d.normal N (0, σ 2 ) we have that there exists a constant c > 0 such that Âε 2 ≥ 2σ 4ŝ log p n − N ≤ exp{−c log p} (exact steps omitted).Combining all the previous results of Step I and Step II we have that 1 − p 1−c , c ≥ 1.

LEMMA 4 . 1 −
If RE( X, ζ N ) condition holds almost surely, and λ N ≥ c 1 σ log p/n, then | Ŝb m (λ N )| ≤ ŝ := Csλ max ( XT X)/(nζ N ) with probability 1 − p 1−c , c > 1. Proof of Lemma 4. Result of Lemma 4 follows by repeating steps of Theorem 3 and by repeating similar steps as in [5] hence it is omitted.(Y I b m −X I b m β) ≤ λ n (1+λ −1 min (p 1−c , c > 1.Hence, P (Ω n ) ≥ 1 − p 1−c if λ n (2 + λ −1 min ( with N n.Then previous Condition 3 is easily satisfied.Example 2.(Coupling Poisson and Multinomial) Consider the following game, where we throw n balls into N n urns.If Z j denote the number of balls successfully thrown in an urn labeled j, then Z 1 , • • • , Z N are independent random variables, with Poisson distribution with rate n/N and (Z 1

Table 1 .
Summary of UCI Million Songs Dataset results by decades.WMCSub denotes Weighted Maximum-Contrast Subagging whereas Sub denotes traditional Subagging.
[2] result of[2]only holds for fixed not divergent n) B.3.Proof of Theorem 2. If RE condition holds on the bootstrapped data matrix, then the result of this Proposition follows by repeating the steps of proof of Theorem 4 with simplification of no weighting scheme w to obtain