Semi-analytic approximate stability selection for correlated data in generalized linear models

We consider the variable selection problem of generalized linear models (GLMs). Stability selection (SS) is a promising method proposed for solving this problem. Although SS provides practical variable selection criteria, it is computationally demanding because it needs to fit GLMs to many re-sampled datasets. We propose a novel approximate inference algorithm that can conduct SS without the repeated fitting. The algorithm is based on the replica method of statistical mechanics and vector approximate message passing of information theory. For datasets characterized by rotation-invariant matrix ensembles, we derive state evolution equations that macroscopically describe the dynamics of the proposed algorithm. We also show that their fixed points are consistent with the replica symmetric solution obtained by the replica method. Numerical experiments indicate that the algorithm exhibits fast convergence and high approximation accuracy for both synthetic and real-world data.

Modern statistics require the handling of high-dimensional data.The term highdimensional refers to the situation where the number of parameters that are to be estimated is one or several orders of magnitude larger than the number of available samples.Among the many tasks in high-dimensional statistics, variable selection of statistical models is a notoriously difficult problem.In high-dimensional settings, standard sparse regression methods, including the least absolute shrinkage and selection operator (LASSO) method [1], suffer from the problem of choosing the regularization parameter.Although re-sampling methods, such as stability selection (SS) [2], can provide much more accurate variable selection criteria, these methods require substantial computational costs.
Unfortunately, this estimated support Ŝ(γ, D) depends strongly on the choice of the regularization parameter γ in real-world datasets.Hence, choosing the regularization parameter for variable selection can be more challenging than for prediction of the response variable where cross-validation is guaranteed to offer the optimal choice on average if features are generated independently from an identical distribution [3].SS was proposed for tackling this difficulty.We denote D * = {(a * 1 , y * 1 ), (a * 2 , y * 2 ), . . ., (a * M , y * M )} by a resampled dataset of size M drawn with replacement from D. For this resampled dataset, the resampling probability Π i (γ) that the variable i is included in the estimated support is given by The probability in ( 4) is with respect to the random resampling and it equals the relative frequency for xi (γ, D * ) = 0 over all M M resampled dataset with size M .The probability in (4) can be approximated by B random samples D * 1 , D * 2 , . . ., D * B (B should be large): where 1l(...) is the indicator function.This probability is termed the selection probability and measures the stability of each variable.SS chooses variables that have large selection probabilities.The original literature [2] combined the above resampling procedure with the randomization of the regularization parameter γ as follows Π i (γ 0 ) = Prob D * ,γ [x i (γ, D * ) = 0] , i = 1, 2, . . ., N, x(γ, D * ) = arg min Figure 1 illustrates the comparison of the LASSO solution (2) and the selection probability (6).Here we used the colon cancer dataset [4].The task is to distinguish cancer from normal tissue using the micro-array data with N = 2000 features per example.The data were obtained from 22 normal (y µ = −1) and 40 (y µ = 1) cancer tissues.The total number of the samples is M = 62.The left panel of figure 1 shows the LASSO solutions for the various regularization parameters.Non-zero variables depend strongly on γ.Choosing the proper value of γ is difficult for the original LASSO.Although the cross-validation can optimize the prediction for the response variable, this choice often includes false positive elements [5].The right panel of figure 1 shows the selection probability for various γ 0 in (8).This figure motivates that choosing the regularization parameter γ 0 is much less critical for the selection probability and that the selection probability approach has a better chance of selecting truly relevant variables.
A major drawback of SS is its computational cost.SS repeatedly solves the 1 regularized logistic regression in (7) for multiple resampled datasets and regularization parameters.The number of resampled datasets and regularization parameters B needs to be large so that the selection probability is reliably estimated.In this study, we address the problem of this computational cost.We propose a novel approximate inference algorithm that can conduct SS without repeated fitting.The algorithm is based on the replica method [6] of statistical mechanics and vector approximate message passing (VAMP) [7,8] of information theory.We term our algorithm replicated VAMP (rVAMP).
The rest of the paper is organized as follows.In section 2, we describe stability selection in generalized linear models (GLMs) that we will focus on, and in section 3, we derive the proposed algorithm using the replica method and VAMP.In section 4, we analyze the proposed algorithm in a large system limit under the assumption that the set of features is characterized by rotation-invariant matrix ensembles.There, we derive the state evolution that macroscopically describes the convergence dynamics of rVAMP, and show that its fixed point is consistent with the replica symmetric solution.In section 5, we apply the proposed algorithm to logistic regression.Through numerical experiments, we confirm the validity of our theoretical analysis and show that the proposed algorithm exhibits fast convergence and high approximation accuracy for both synthetic and realworld data.The final section is devoted to a summary and conclusion.

Related work
Malzahn and Opper first proposed a combination of the replica method and approximate inference to reduce the computational cost of resampling methods [9][10][11].They demonstrated that employing the adaptive Thouless-Anderson-Palmer (TAP) method [12,13], as an approximate inference algorithm, can accurately estimate the bootstrap generalization error for Gaussian process classification/regression.However, the poor convergence of this method is a major flaw of their approach.The adaptive TAP method is based on a naive iteration of TAP equations.The literature in information theory has revealed that the convergence property of such naive iteration scheme is terribly bad [8,14,15].Thus it requires to find a correct choice of initial conditions.As an algorithm, the adaptive TAP method is undesirable because approximate inference aims to save computation time.
The aforementioned algorithmic problem has been significantly improved by the discovery of approximate message passing (AMP) algorithms in information theory.This type of algorithms was first introduced as an efficient signal processing algorithm [16].[16] analyzed its convergence dynamics in a large system limit and showed its fast convergence.[16] also revealed that the fixed point of the AMP algorithm shares the same fixed point with the corresponding TAP equation, and thus, AMP can be used as an efficient algorithm to solve the TAP equation.Subsequently, [14,17] developed its mathematically rigorous analysis.These rigorous analyses were further generalized in [18,19].However, the above analyses are based on the assumptions that the elements of the feature vectors are independently and identically distributed (i.i.d.) zero-mean random variables, which is not realistic in the context of statistics.To go beyond such simple distributions, VAMP and similar generalizations [7,8,20] were developed based on expectation propagation (EP) of machine learning [21,22].Under the assumption that feature matrices, whose rows are composed of each feature vectors, are drawn from rotation-invariant random matrix ensembles, VAMP algorithms were analyzed in a large system limit.These analyses derived the convergence dynamics of the VAMP algorithms and revealed that their fixed points are consistent with the corresponding adaptive TAP equations [7,8,[23][24][25][26].In this paper, we extend such VAMP algorithms to replicated systems for approximately performing SS in GLMs.
[27] proposed an AMP-based approximate resampling algorithm for SS.However, the algorithm assumes independence between the features and was developed for linear regression only.A preliminary application of VAMP to SS in linear regression was also demonstrated [28].In the present study, we further generalize the use of VAMP to GLMs, and also carry out a theoretical analysis of this method.

Stability selection in generalized linear models
In the following, we consider SS in generalized linear regression/classification. We have a dataset D = {(a µ , y µ )} M µ=1 , where each a µ = (a µ1 , a µ2 , . . ., a µN ) ∈ R N is an Ndimensional vector of features or predictors, and each y µ ∈ Y ⊂ R is the associated response variable.The domain of the response variables Y includes R for regression and {−1, 1} for classification.We also use matrix/vector notations A = [a µi ]1≤µ≤M Let D * = {(a * 1 , y * 1 ), . . ., (a * M , y * M )} be a resampled dataset composed of M data points drawn with replacement from D. Some data point (a µ , y µ ) in D appears multiple times in D * , and while others do not appear at all.SS in generalized linear regression/classification computes the selection probability Π ∈ [0, 1] N by repeatedly refitting GLMs p y|z for multiple resampled datasets and regularization parameters: x(γ, D * ) = arg min where γ 0 > 0 is a control parameter that determines the amount of the regularization.The goal of this paper is to develop a computationally efficient algorithm that returns Π(γ 0 ) for any positive γ 0 .

Replicated vector approximate message passing
To approximate the computation of the selection probability Π, we will use the replica method and VAMP.This section provides a derivation of the proposed algorithm.

Occupation vector representation of sampling with replacement
For convenience, let us introduce the occupation vector representation of the resampled dataset D * .The resampled dataset D * is composed of M data points sampled from D with replacement.Hence, it can be represented by a vector of occupation numbers c = (c 1 , c 2 , . . ., c M ) ∈ {0, 1, . . ., M } M with M µ=1 c µ = M , where c µ is the number of times that the data point (a µ , y µ ) appears in D * .Although the strict distribution of c is the multinomial distribution, for large M , the correlation among {c µ } M µ=1 is weak.By ignoring this correlation, we can approximate the distribution of c by a product of Poisson distribution with mean 1 [9] as: In this way, we can rewrite the average with respect to D * by the average over the random variable c ∈ {0, 1, . ..}M that follows the probability distribution (12), which is simple and easy to handle.

Statistical mechanical formulation of stability selection
The selection probability Π in ( 9) is defined through the optimization problem in (10).
To use techniques of statistical mechanics and approximate inference algorithm, we introduce the Boltzmann distribution as where x ∈ R N , z ∈ R M , β > 0 is the inverse temperature, and Z is the partition function.The random variables γ and c follow distributions (11) and (12), respectively.Then the selection probability can be written using the Boltzmann distribution at the zero-temperature limit as follows: xi (c, γ) = lim In the rest of the paper, we will omit the argument D when there is no risk of confusion to avoid cumbersome notation.Still, note that we calculate the above quantities only for the fixed dataset D.

Replica method for semi-analytic approximate resampling method
Our purpose is to compute the selection probability Π(γ 0 ) for any γ 0 > 0. For this, we compute the distribution of xi : which is reduced to computing the moments E c,γ [x r i (c, γ)] for any r = 1, 2, . ... We now describe how the replica method can be used for this purpose, following the approach of [9].
We use d r x = dx 1 dx 2 . . .dx r to denote a measure over R N ×r , with x 1 = (x 1,1 , . . ., x 1,N ) , . . ., x r = (x n,1 , . . ., x n,N ) .Analogously, we denote by d r z = dz 1 dz 2 . . .dz r as a measure over R M ×r , with z 1 = (z 1,1 , . . ., z 1,M ) , . . ., z r = (z r,1 , . . ., z r,M ) .Using the definition (16), the moments E c,γ [x r i (c, γ)] can be written as which is difficult to evaluate analytically due to the presence of the partition function that depends on c and γ in the denominator.The replica trick [6] bypasses this problem via an identity lim n→0 Z n−r = Z −r .Using this identity, ( 18) is re-expressed as where The advantage of this formula is that for integers n ≥ r, the negative power of the partition function (Z (β) (c, γ)) −r is eliminated by an integral with respect to n replicated variables.More precisely, using the definition of the partition function ( 14), we obtain where Ξ n is the normalization constant The expression ( 21) is much easier to evaluate than the negative power of the partition function.We call the probability density function given by the replicated system.Note that by construction lim n→0 Ξ n = 1.
In this way, we have replaced the original problem with computing first moments of the replicated system (23).Of course, we wouldn't expect that we could compute the moments exactly.Otherwise we should have obtained the exact solution without using the replicas.The replica method evaluates a formal expression of lim β→∞ A (β) i,n for n = r + 1, r + 2, . . .under appropriate approximations, and then extrapolates it as n → 0.
To obtain a formal expression of lim β→∞ A (β) i,n , the following observation is critical.Because the replicated system ( 23) is merely a product of the n-copied systems, it is intrinsically invariant under any permutations of {(x 1 , z 1 ), (x 2 , z 2 ), . . ., (x n , z n )}.This property is termed the replica symmetry.From this property, de Finetti's representation theorem [29] guarantees that the replicated system ( 23) is expressed as where η is a vector of some random variables that reflects the effects of c and γ.This expression indicates that A (β) i,n is reduced to a considerably simple form that can be easily extrapolated as n → 0. Thus by obtaining tractable approximate densities for p (β) (x, z|η) and p (β) (η) in (24), we can obtain an arbitrary degree of the moment without refitting ‡.

Replica symmetric Gaussian expectation propagation in the replicated system
To approximate the replicated system ( 23), we will use the Gaussian diagonal EP of machine learning [21,22] that is used to derive VAMP in [8].
R n , respectively.The Gaussian diagonal EP recursively updates the following two approximate densities: where Λ 2z,µ ∈ R n are natural parameters of the Gaussians.The first approximation is a factorized distribution but contains the original non-Gaussian factors.The second approximation is a multivariate Gaussian distribution that replaces the non-Gaussian factors by the ‡ Of course, the replica symmetry may not hold for n / ∈ N. In such cases, we have to include the effect of the replica symmetry breaking [6].However, we restrict ourselves to the replica symmetric case for simplicity.

Algorithm 1 Expectation propagation
Require: Approximate densities p 2 and the number of iterations T iter .compute the moments ( 28)- (31) for the approximate density p (β) 1 .
The critical issue is to choose an appropriate form of the natural parameters in ( 26) and (27).Based on the observations in subsection 3.3, we impose the replica symmetry for these parameters: With these parameterizations, we use Q1x = ( Q1x,1 , Q1x,2 , . . ., Q1x,N ) for the vector notation.Q2x , Q1z , Q2z , v1x , v2x , v1z , v2z , h 1x , h 2x , h 1z , and h 2z are defined similarly.These parameterizations allow the extrapolation n → 0 as follows. For x,i and φ We also denote by Dx = e −x 2 /2 / √ 2π the standard Gaussian measure, and by Diagm(x) a diagonal matrix with [Diagm(x)] ii = x i .The use of the replica symmetric parameterizations (32)-(39) yields the following expressions for the moments and the moment-matching conditions that are used in line 6-8 and 12-14 in algorithm 1.First, for the approximate density p (β) 1 , we obtain x s,i x t,i p x 2 s,i p where Next, for the approximate density p (β) 2 , we obtain where Finally, the moment-matching conditions are written as In all of the above expressions, the indices i and µ run as i = 1, 2, . . ., N and µ = 1, 2, . . ., M , respectively.χ x and χ z are termed susceptibility.v x and v z are termed variance.Clearly, these equations can be easily extrapolated as n → 0.
Inserting the limiting form of these quantities at n → 0, β → ∞ into the algorithm 1, we obtain rVAMP in algorithm 2. There, g 1x , g 1z , g 1x and g 1z are denoising functions and their derivatives.These are defined as follows: where Algorithm 2 rVAMP Require: Denoising functions g 1x , g 1z from (72) and (74), the features A ∈ R M ×N , the response variable y ∈ Y M , the convergence criterion tol , the maximum number of iterations T iter .
1z ∈ [0, ∞) M .2: for t = 1, 2, . . ., T iter do 3: // Factorized part 4: 5: 7: // Moment-matching (1 → 2) 11: 12: // Gaussian part 14: x(t) 2 = X(h 16: // Moment-matching (2 → 1) end if 26: end for 27: return h If the likelihood p y|z is differentiable with respect to z, g 1z can be written as Because the averages with respect to c and γ are incorporated in line 4-9 of the algorithm 2 as the averages with respect to one-dimensional random variables, rVAMP does not require refitting.Although the two approximate densities have the same first and second moments at a fixed point, these two densities have different characteristics.For higher-order marginal moments, we expect that p is argued to have more accurate offdiagonal moments because it includes the interaction term correctly [22,30].Thus, these two distributions should be used depending on the objective.Because we are interested in the distribution of the marginal moment (17), here we use p This yields the selection probability Π i (γ 0 ) as

Implementation details
For practical implementation, we find that it is helpful to make several small modifications to rVAMP of the algorithm 2. In this subsection, we discuss these minor modifications.
First we address the computational complexity regarding the matrix inversion.Although rVAMP requires the matrix inversion in line 14, this computational cost is reduced to O(M 3 ) from O(N 3 ) using the Woodbury identity [31]: Because in high-dimensional statistics, the number of the samples in the data is often one or several orders of magnitude smaller than the number of the parameters, the computational cost is drastically reduced using this identity.Second, for a real-world dataset, introducing a small amount of damping factor η d ∈ (0, 1] improves the convergence of the algorithm.We suggest replacing line 20 and 21 with the damped versions: Third, GLMs may require including an intercept term z 0 so that y µ ∼ p y|z (y µ |z 0 + a µ x 0 ).To incorporate the intercept term, we add an extra column in the feature matrix so that A 0,µ = 1, µ = 1, 2, . . ., M , and for this component we do not require any regularization term.
The last point regards how to obtain the selection probability for various values of the regularization strength γ 0 .In practice, we are often interested in finding the selection probability not only for a single fixed γ 0 , but also for the various regularization parameters γ 0 (as in Figure 1).A reasonable approach is to begin with the largest γ 0 .Then, we decrease γ 0 by a small amount and run rVAMP until convergence.Decreasing γ 0 again and using previous parameters at the fixed point as the initial conditions (warm start), we then run rVAMP until convergence.Using this method, we can efficiently compute the selection probabilities over a grid of γ 0 .

Macroscopic analysis
The salient feature of the VAMP algorithms is that their convergence dynamics are macroscopically analyzed in a large system limit under specific assumptions on the distributions of the set of feature vectors.The derived dynamics are termed the state evolution (SE).In this section, we derive SE for rVAMP and show that its fixed point is consistent with the replica symmetric solution obtained by the replica method that is believed to be exact in the large system limit.

Setup for the macroscopic analysis
For the theoretical analysis, we assume the actual data generation process as follows.First, the true parameter vector x 0 and the response variables are generated as Generally, the model used for the fitting and the actual generation model may be different p y|z = q y|z or e −γ|x| = q x 0 .Additionally, we assume that the feature matrix A is drawn from the rotation-invariant random matrix ensembles, i.e. for the singular value decomposition , we assume that U and V are drawn from uniform distributions over M × M and N × N orthogonal matrices.We are interested in the large system limit where both of the number of data points and the number of parameters diverge as M, N → ∞ keeping the ratio α ≡ M/N ∈ (0, ∞).Because U and V are drawn independently from uniform distributions over M × M and N × N orthogonal matrices, for vectors ω ∈ R N and φ ∈ R M , we expect that the empirical distributions of V ω and U φ converge to Gaussians with mean zero and variance ω 2 2 /N and φ 2 2 /M in this large system limit, respectively.

State evolution
To derive the SE of rVAMP, we make the following assumptions follwing the literature [23].Assumption 1: At each iteration t = 1, 2, . . ., T iter , the susceptibilities χ 2z and the variances v 2z are self-averaging in the above-described large system limit, i.e. at each iteration t positive constants χ hold.Under the above conditions, the natural parameters are also self-averaging when the initial choices are constants, i.e. at each iteration t positive constants V (h U (h hold, where .= denotes the equality of empirical distributions, z 0 is Ax 0 , and kz , (k = 1, 2, t = 1, 2, . . ., T iter ) are mutually independent standard Gaussian variables.
Under assumption 1, rVAMP is equivalent to the self-averaging (SA) rVAMP in algorithm 3. We will use it in following analysis.
To characterize macroscopic behavior of rVAMP, we introduce the following macroscopic order parameters for t = 1, 2, . . ., T iter : , q , q These order parameters and the susceptibilities have limiting expressions in the limit N → ∞.First, q (t) 1x can be written as Here, the summation is replaced with the average in the limit N → ∞.The average can be written as follows: Algorithm 3 self averaging rVAMP Require: Denoising functions g 1x , g 1z from ( 72) and (74), the features A ∈ R M ×N , the response variable y ∈ R M , the convergence criterion tol , and the maximum number of iterations T iter . 7: 10: // Moment-matching (1 → 2) 11: // Gaussian part 14: 16: // Moment-matching (2 → 1) h end if 26: end for 27: return h where we used the independence between ξ 2z , x 0 and {λ i }, and we denoted by E λ [...] an average with respect to the limiting eigenvalue spectrum ρ(λ) of A A. The calculations for m 2z and v (t) 2z are similar.Finally, using the singular value decomposition A = U SV , T x and T z are written as where the average of z 0 is taken with respect to a Gaussian measure based on the observation in [32]; for a vector ω ∈ R N that is independent of A, the empirical distribution of Aω is a Gaussian with mean zero and variance E λ [λ] ω 2 2 /(αN ) in the large system limit.
The moment-matching conditions also have the following limiting expressions.First, m(t) 2x can be written as where the limit (a) follows from the definition of m(t) 2x ; (b) follows from the momentmatching condition of SA rVAMP; (c) follows from the definitions of m (t) 1x and m(t) 1x .For χ(t) 2x , its update rule can be written as 1x where (a) follows from the definition of χ(t) 2x ; (b) follows from the moment-matching condition of SA rVAMP and the assumption 2; (c) uses the independence between x 0 and ξ (t) 1x , and the definition of m (t) 1x ; (d) can be obtained from the following integration by parts according to , its update rule is derived exactly same way as in (114).For χ( 2x and ξ Similar results can be obtained for m 1z and χ(t+1) 1z .The above observations yield the SE of SA rVAMP as follows: 1z ∈ [0, ∞).Iteration: For t = 1, 2, . . ., T iter , update the parameters as follows: Factorized part: Moment-matching: Gaussian part: Moment-matching: T x χ T z χ where E c [. ..] is the average with respect to the probability function p(c) = e −1 /c!, c = 0, 1, . ... At the fixed point, q 2x , and m 2x are approximate values of the following quantities: (145) A similar interpretation is also possible for q 2z , and 2z .

Replica analysis
Generally, typical values of the macroscopic order parameters introduced in the last section can be obtained by calculating the Helmholtz free energy f using the replica method [6]: Although the above formula contains the nested replicas, its replica symmetric computation is formally analogous to the standard 1-step replica symmetry breaking (1-RSB) computation by treating l as the Parisi's breaking parameter.Because the 1-RSB computation was already described in appendix C of reference [23], we only show the final result.By rescaling the replica number as l = l/β, we obtain the following expression: extr mx,qx,vx,χx, mz ,qz ,vz ,χz where In the limit l → 0, β → ∞, the extreme condition yields the same form of the equations that appear in the fixed point condition of the SE equations ( 127)-(142).Additionally, at the extremum, the variational parameters q x , χ x , v x and m x are in accordance with the right-hand side of the equations ( 143)-( 146).Similar accordance also holds for q z , χ z , v z and m z .Thus, the fixed point of SE of SA rVAMP is consistent with the replica symmetric calculation.

Application to logistic regression
For checking the validity of the results obtained so far, we applied rVAMP to logistic regression and conducted numerical experiments in order to (i) validate our SE, (ii) obtain insights about the convergence speed from SE, and (iii) test the applicability of rVAMP to real-world problems.
In logistic regression, the domain of the response variables Y is {−1, 1}, and the likelihood is given as Additionally, g 1z in (80) can be written as All the experiments were conducted on a single Intel(R) Core(TM) i7-8700B (3.20GHz) CPU.

Comparing with SE using synthetic data
Synthetic data were generated under the settings described in subsection 4.1.The actual data generation process are described by where N (µ, σ 2 ) is the Gaussian measure with mean µ and variance σ 2 , and ρ ∈ [0, 1] is the sparsity.The system size N , the measurement ratio α = M/N , and the sparsity  ρ were specified as N = 10000, α = 0.2, and ρ = 0.01, respectively.Additionally, the feature matrix A was drawn from the row-orthogonal ensemble [33] for which the limiting eigenvalue distribution of A A was ρ(λ) = αδ(λ − 1) + (1 − α)δ(λ).
To validate SE, we compared the iteration dynamics of SA rVAMP to those of SE. Figure 2 plots the order parameters and the parameters of p (β) 1 versus the iteration index t.The data of SA rVAMP were obtained from 1000 random trials.The error bars are smaller than the size of the markers.Although some systematic disagreements are present in Q(t) 1x and v(t) 1x possibly due to the finite-size effect, most of the experimental values are in good agreement with the predictions of SE.This shows the validity of our SE.
The iteration dynamics of SE suggest that rVAMP converges in a few dozens of iterations, guaranteeing the fast convergence of rVAMP for the synthetic data.

Applicability of rVAMP in real world data
We explored the performance of rVAMP on the colon cancer dataset [4], which is also used in the introduction.The data is publicly available at http://genomics-pubs. princeton.edu/oncology/.The task is to distinguish cancer from normal tissues using   M = 62.We pre-processed the data by carrying out base 10 logarithmic transformation and standardizing each feature to zero mean and unit variance.Because the class labels are biased, we included the intercept term.To obtain the selection probabilities for a grid of γ 0 , we used the warm start procedure.Finally, the damping factor η d was set to 0.85.
First, we examined the convergence speed of rVAMP.Figure 3 shows the time evolution of the convergence criterion max{ x(t) 1 − x(t) 2 /M } by plotting its value versus the iteration step t.For various regularization strengths, regular exponential decay is observed, This demonstrating the fast convergence of rVAMP in a real-world dataset.
Next, we examine the accuracy of rVAMP.To compare the estimate of rVAMP with that of the naive refitting procedure of SS, the naive refitting on 1,000,000 resampled datasets was conducted using GLMNet [34]. Figure 4 shows the intercept term plotted versus the regularization strength.For a wide range of γ 0 , rVAMP accurately estimated the intercept term.Figure 5 plots the comparison between the selection probabilities estimated by rVAMP and by the naive reffiting for the entire grid of γ 0 .For ease of viewing, we only plot these values for the 10 features that had the largest selection probabilities for the smallest γ 0 .Figure 6 plots the same comparison of all of the features for a selected set of γ 0 .Although the accuracy decreases slightly as we weaken the regularization, rVAMP successfully approximate the selection probability.The upper panel of ignore 7 plots the difference between approximated and naively calculated selection probabilities as a function of the number of resampled datasets B. These results also provide evidence for the accuracy of rVAMP.The lower panel of figure 7 plots the elapsed time used to obtain all of the selection probabilities for various γ 0 .Although the actual computation time depends on the implementation, this figure suggests that rVAMP can provide accurate estimate of Π in a much shorter time than the naive SS.These observations demonstrate the accuracy of rVAMP.

Summary and conclusion
We developed an approximate SS algorithm that enables SS without the use of the repeated fitting procedure.The key concept is to use the combination of the replica method of statistical mechanics and the VAMP algorithm of information theory.The derivation of the algorithm was based on the expectation propagation of machine learning.We also derived the state evolution that macroscopically describes the dynamics of the proposed algorithm, and showed that its fixed point is consistent with the replica symmetric solution.Through numerical experiments, we confirmed that the state evolution equation is valid and that the proposed algorithm converges in a few dozens of iterations.We applied the proposed algorithm to logistic regression and demonstrated its application to a real-world dataset through numerical experiments.Although the real-world dataset has statistical correlations among the features, the proposed algorithm achieved fast convergence and high-estimation accuracy, demonstrating its utility for real-world problems.
A possible drawback of our algorithm is its computational complexity, even though it was not significant for the experiments described in section 5.Because the algorithm requires the computation of matrix inversion at each iteration, the computational burden may increase significantly with the increasing number of samples in the datasets.This shortcoming may be addressed by the self-averaging version of the proposed algorithm or the dual-decomposition-like variable augmentation used in the alternating direction method of multipliers [35,36].
A promising future research direction includes analyzing the variable selection performance of the SS algorithm using SE.Generally, theoretical analysis of resampling techniques is difficult because we cannot explicitly write down the analytical form of the estimators.This difficulty prevents the obtaining of useful insights from quantitative theoretical analysis.Thus, the replica theory [6] may provide a promising analytical tool in this area.

Figure 1 .
Figure 1.Left: The LASSO solutions x(λ, D) based on (2) for the colon cancer dataset with M = 62 and N = 2000.The vertical line corresponds to the crossvalidation optimal regularization parameter.The red-dashed lines represent variables chosen by the cross-validation procedure.The non-zero variables strongly depend on the choice of the regularization parameter γ.Right: The selection probability Π(λ 0 ) based on (6).The selection probability is less dependent on the choice of γ 0 , indicating that choosing the regularization parameter is less critical than the naive LASSO.

1
to compute Π i (γ 0 ).The approximate density p (β) 1 yields the following distribution of the marginal moment: ) follows from the definition of χ(t+1) 1x ; (b) follows from the moment-matching condition and the assumption 2; (c) uses the independence between V x 0 and ξ (t) 2x , and the definition of m (t) 2x ; (d) can be obtained from the independence between V x 0 , S ξ (t) 2z

Figure 2 .
Figure 2. Comparison between the iteration dynamics of SA rVAMP in the algorithm 3 and in the SE equations defined in (127)-(142).The solid lines show the SE trajectories.The symbols represent the median of SA rVAMP trajectories that are obtained from 1000 experiments.Top left: Macroscopic variables q (t) 1x , χ (t) 1x , v (t) 1x , and m (t) 1x versus algorithm iteration.Top right: Macroscopic variables q (t) 1z , χ

Figure 3 .
Figure 3.Time evolution of the convergence criterion max{ x(t) 1 − x(t)

2 22
/M } is plotted versus the iteration step t.The error bars represent the standard errors.The symbols represent the median of rVAMP trajectories obtained from 1000 experiments.

Figure 4 .
Figure 4. Intercept term of logistic regression model plotted versus γ 0 .The red line is obtained by the naive refitting procedure, while the blue line is obtained using rVAMP.

Figure 5 .
Figure 5.Comparison of the selection probability plotted for various values of the regularization strength γ 0 .For ease of viewing, the selection probabilities are shown only for 10 features that had the largest selection probability for the smallest γ 0 .Red lines are obtained using the naive refitting procedure, whle blue lines are obtained using rVAMP.

Figure 7 .
Figure 7. Upper panel: The difference between approximated and naively calculated selection probabilities plotted versus the number of resampled datasets B. We denote by Π approximate the selection probability obtained by rVAMP, and by Π naive that obtained by naive resampling procedure using B resampled datasets.The difference is measured as a q-quantile of the difference for all of the selection probabilities in the grid of γ 0 .Lower panel: Elapsed time is plotted versus the size of the resampled dataset B.