Computing Leapfrog Regularization Paths with Applications to Large-Scale K-mer Logistic Regression

High-dimensional statistics deals with statistical inference when the number of parameters or features p exceeds the number of observations n (i.e., p≫n). In this case, the parameter space must be constrained either by regularization or by selecting a small subset of m≤n features. Feature selection through l1-regularization combines the benefits of both approaches and has proven to yield good results in practice. However, the functional relation between the regularization strength λ and the number of selected features m is difficult to determine. Hence, parameters are typically estimated for all possible regularization strengths λ. These so-called regularization paths can be expensive to compute and most solutions may not even be of interest to the problem at hand. As an alternative, an algorithm is proposed that determines the l1-regularization strength λ iteratively for a fixed m. The algorithm can be used to compute leapfrog regularization paths by subsequently increasing m.


INTRODUCTION
I n statistics and machine learning, a common problem is to learn a model of the form l h (z) = r h 1 f 1 (z) + . . . + h m f m (z) ð Þ ‚ with coefficients h = (h 1 ‚ . . . ‚ h m ) 2 R m and where f j denotes the j th feature computed from the input data z.
In statistics, r is called mean function, whereas in machine learning it is commonly referred to as activation function. The model l h is typically estimated by minimizing a given loss function on a set of n observations. For instance, in linear regression, the mean function is r(x) = x and the parameters h would be estimated by minimizing the sum of squared residuals, also called ordinary least squares (OLS). In the case of artificial neural networks, l h represents a single output neuron and f j denotes the jth output of the preceding layers of the network. The activation function r is typically either a sigmoid function or some variant. Neural networks are estimated by minimizing, for instance, the mean squared or cross-entropy error.
Two fundamentally different ways can be found in the literature of how features f j are estimated. Some machine learning methods, such as neural networks, treat each f j as a parametric function, which is optimized jointly with the model parameters h. In contrast, in statistics a fixed set of p > m features fx j = f j (z)g is considered, from which a subset of size m is selected. Although the former is computationally more tractable, it often leads to nonconvex optimization problems. The focus of this study lies in the latter. In particular, it is assumed that the feature space is very large and that only a small subset of size m must be selected (p ) m). Ideally, the parameters of a statistical model are estimated for a given loss function x : R p ! R by minimizing x(h) subject to the constraint h k k 0 = m. However, this approach is generally not feasible since it requires to test all subsets of size m.
Several strategies exist to compute approximate solutions. For instance, matching pursuit (MP) (Tropp et al., 2007) is a greedy strategy that selects one feature at a time until the desired number of m features is reached. This algorithm was found to be overly greedy in practice (Efron et al., 2004) and may provide poor subset selections. A more powerful strategy is to consider the ' 1 constraint h k k 1 = L (Tibshirani, 1996). In practice, this leads to the equivalent unconstrained optimization problem where the penalty strength k ! 0 is a parameter that controls the sparseness of h and is often chosen so that h k k 0 = m. Although this approach often selects nearly optimal subsets, it comes with the difficulty of determining the penalty or regularization strength k.
Unfortunately, there exists no known functional relation between m and k. However, grid search can be used to compute solutions of the optimization problem P0 for a manually specified set of regularization strengths. More efficient algorithms, such as least angle regression (LARS) (Efron et al., 2004) and the homotopy (Osborne et al., 2000) algorithm, initially start with k = k max for which all coefficients are zero. These methods decrease k and determine breakpoints k 1 > k 2 > . . . k q at which features either become active (nonzero coefficient) or inactive (zero coefficient). By interpolating between breakpoints a regularization path fĥ(k)j0 k k max g is obtained, which contains solutions for all feasible regularization strengths.
Computing full regularization paths is costly for large feature spaces and complex models. In such cases, the regularization path is only computed up to a maximal m q ( p. Furthermore, a solution of the optimization problem P0 with m features will often behave very similarly to a solution with m + 1 features. Therefore, the focus of this study is to present and test an algorithm that computes leapfrog regularization paths containing solutionsĥ(k k ) such thatĥ(k k ) 0 = m k , where m 1 < m 2 < . . . < m q is a given sequence of cardinalities. The main advantage of this algorithm is that m q can be chosen much larger than with existing regularization path algorithms, because it does not compute all solutions of the regularization path. Furthermore, the sequence of cardinalities m 1 < m 2 < . . . < m q can be selected so that a clear difference in performance for m k and m k + 1 is observed.

ALGORITHMIC BACKGROUND
Many feature selection methods were first developed for linear regression and later extended to more complex models, such as generalized linear models (GLMs). Hence, linear regression may serve as a role model for introducing feature selection methods. Let X 2 R n · p denote the covariates or data matrix with n observations fx i g(rows) and p features ff j g(columns). For linear regression, r is the identity function so that l h (x i ) = x i h and the objective function x = x L is the sum of squared residuals defined as Furthermore, y 2 R n is a vector of n dependent variables, also called response, and e = y -Xh is referred to as residuals. If p n and X has full rank p then the coefficients h can be estimated by OLS, which minimizes the sum of squared residuals x L (h) = e k k 2 2 = y -Xh k k 2 2 . When p > n, parameters can be estimated by minimizing the ' 1 -penalized loss which is known as the Lasso (Tibshirani, 1996).
The following interpretation of OLS (Hastie et al., 2009) is essential for understanding many feature selection methods. In R n the dot product Xh is a point in the hyperplane H spanned by the columns [f 1 ‚ . . . ‚ f p ] of X and y is a vector typically not contained in H. The point Xĥ in H that minimizes the length of the residual vector e is located directly below y, that is, it is the projection of y onto H. It follows that Xĥ = X(X T X) -1 X T y ‚ where X(X T X) -1 X T is a projection matrix. Therefore, the OLS solutionĥ = (X T X) -1 X T y can be interpreted as the scalar projection of y onto H. If the covariates are appropriately normalized, that is, X T X = I, thenĥ = X T y. Most importantly, X T y can be interpreted as the correlation of y with X. The OLS solution satisfies X T (y -Xĥ) = 0, which shows that the residuals are uncorrelated with X. Many feature selection methods decrease the correlation c(h) = X T (y -Xh) iteratively starting with c(0) until the residuals are orthogonal to X.

Matching pursuit
MP is a greedy feature selection algorithm (Tropp et al., 2007) that selects one feature at a time. Let the columns of X have unit length, that is, Hence, the algorithm selects the feature j with the largest scalar projection of y onto f j , or the feature that is most correlated with y.
All remaining features are selected accordingly. Let O = fj 1 ‚ . . . ‚ j t g denote the set of active features and that is, it selects that feature j for which the scalar projection of e t onto f j is largest. Alternatively, f T j e t can be interpreted as the correlation of the jth feature with the residuals e t .
The MP algorithm has been applied to several other models, including logistic regression (Lozano et al., 2011), although the correlation c(ĥ O ) k might be less informative.

LARS and homotopy algorithm
LARS is an iterative feature selection method for linear regression (Efron et al., 2004). The algorithm is related to MP, but it is less greedy and gives only as much weight to each feature as it deserves (Hastie et al., 2009). Another method is the homotopy algorithm (Osborne et al., 2000), which is identical to LARS except that it also allows features to be removed within an iteration. It is referred to as LARS with Lasso modification in the literature (Efron et al., 2004;Donoho and Tsaig, 2008). Both methods are introduced hereunder. X is typically assumed to be standardized so that the correlations c(h) are not tainted by heterogeneous scaling.
The LARS and homotopy algorithms maintain a set of active features O & f1‚ . . . ‚ pg all equally correlated with the residuals y -Xĥ for the current estimateĥ. In contrast, features not in O are less correlated with the residuals. Both algorithms initially set h = 0. In each subsequent iteration, the coefficients h are updated so that the correlations of all features in O uniformly shrink toward zero until some other feature j 0 2 O c is equally correlated with the residuals. For the next iteration, j 0 is added to the active set O. The homotopy algorithm also removes a feature j from the active set O when the coefficient h j becomes zero. In this way, k is incrementally reduced from its maximum value, where all features are inactive (O = ), to zero, where all coefficients are nonzero (O = f1‚ 2‚ . . . ‚ pg). The values of k at which a feature enters or leaves the active set O are referred to as breakpoints.
More specifically, the decrement in correlation at which a feature enters the active set is given by where min + is the minimum over positive elements and c j (h) denotes the jth element of c(h): For LARS without Lasso modification, the step size c Ã equals c + ; however, the homotopy algorithm also removes a feature j from the active set when for some c 562 BENNER The subsequent breakpoint is given by c Ã = minfc + ‚ cg. LARS with Lasso modification is equivalent to solving the Lasso problem (P1) for all regularization strengths k ! 0. A direct application of LARS or the homotopy algorithm to GLMs is not possible, because the path along which the correlations c O (h) = k1 uniformly decline is nonlinear. As a consequence, there is no analytical solution to the position of breakpoints. Nevertheless, a linear approximation for GLMs exists (Park and Hastie, 2007). The algorithm, however, requires several iterations until a breakpoint is reached.

COMPUTING LEAPFROG REGULARIZATION PATHS
LARS and the homotopy algorithm compute full regularization paths fĥ(k)j0 k k max g for all feasible regularization strengths k. These algorithms start with k = k max and decrease the regularization strength until the unconstrained model is obtained. The full and exact regularization path is computed by linear interpolation between breakpoints. Extensions of LARS to GLMs (LARS-GLM) instead compute approximate regularization paths, becauseĥ(k) is nonlinear between breakpoints and there exists no analytical expression to compute at what values of k breakpoints occur.
For certain applications the full regularization path can be very costly to compute, especially when the number of features p is large. In such cases, the algorithms can be stopped when a maximum number of features m q is reached. Nevertheless, because LARS-GLM successively decreases the regularization strength k, it might take very long until a desired number of features m is reached. Furthermore, estimated models with m and m + 1 selected features might behave very similarly, for instance, in terms of classification performance. For such situations, an algorithm is introduced that computes for an a priori defined sequence m 1 < m 2 < . . . < m q the estimatesĥ(k k ) such thatĥ(k k ) 0 = m k . The set of estimates fĥ(k k )jĥ(k k ) 0 = m k g q k = 1 is called a leapfrog regularization path. The algorithm can be derived from proximal gradient descent (Boyd et al., 2004) or from a modification of LARS-GLM.
Let x denote the objective function and x 0 j (ĥ) = @ @h j x(h)j h =ĥ the jth partial derivative at the current estimateĥ. Given a fixed m 2 fm 1 ‚ m 2 ‚ . . . ‚ m q g the following update rules are iterated until convergence: The algorithm has converged when O is identical between two successive iterations. The first update step computes a new active set O. All features with nonzero coefficients remain in the active set. Additional features are added based on the absolute value of the partial derivatives. An estimate of the optimal k parameter is obtained during this step. In the second step, the coefficients of the active set are re-estimated, all other coefficients remain zero. The re-estimation is computationally cheap if m ( p, because the gradient descent method operates only on m selected features. Once the algorithm has converged for a fixed m = m k , the procedure is repeated for m = m k + 1 (see Algorithm 1).
until O remains unchanged 7: r)r [ fĥ O g 8: end for 9: return r 10: end procedure Algorithm 1 relies on ' 1 -regularization for feature selection. Compared with MP it is much less greedy, because it does not simply select one feature per iteration (Hastie et al., 2009). Instead, the active set is refined until a solution to optimization problem (P0) is found. The algorithm is more similar to LARS-GLM, but it uses a different method for determining the regularization strength k. Whereas LARS estimates the position of subsequent breakpoints, Algorithm 1 iteratively updates k until a given number of m features is selected.
Despite extremely high-dimensional feature space, the estimation of parameters is relatively cheap when m ( p, because the gradient descent method operates only on a small subset of m features and not the full feature space of size p. In other words, estimating the coefficients of the logistic regression with a gradient method takes O(nmT) steps, where T is the number of iterations of the gradient descent method. This is much faster than estimating the parameters on the full feature space, which would take O(npT) steps.

Derivation from proximal gradient descent
The optimization problem P0 is commonly solved using proximal gradient descent (Boyd et al., 2004). In general, proximal gradient descent can be used to compute minimizers of functions f (x) = g(x) + h(x), where g is convex differentiable and h is convex but not differentiable. The update equation is given by and c is the step size. The proximal operator prox h can be solved analytically for h(x) = x k k 1 . Note that the equation inside the proximal operator is a simple gradient descent step applied to g:.
For the optimization problem P0 the following update equations are obtained: where # is the result after the gradient descent step and before applying the proximal operator. Assume that for a feature j the coefficient h j is zero. The coefficient remains zero during subsequent update steps unless the gradient update is large enough, because of the application of the proximal operator. More specifically, for any step size c > 0, a coefficient h j remains zero unless jx 0 j (ĥ)j > k.

Derivation from LARS
In each iteration, LARS either removes a feature from the active set or selects a new one. The focus here is on the selection of new features, that is, Equation (1). By assuming that f T j Xv = 0, a much simplified step size is obtained, which shows that k -c + = max j2O c jc j (h)j and remember that for linear and logistic regression c j (h) = x 0 j (h). Hence, m features can be selected by setting k equal to the mth largest absolute correlation or partial derivative.

APPLICATIONS TO LARGE-SCALE LOGISTIC REGRESSION
Algorithm 1 can be easily applied to GLMs, such as logistic regression, as demonstrated in the following. The closest method in the literature is LARS-GLM and the GNU-R library glmpath is used for comparing it with Algorithm 1. Decisive for the computational cost is the number of steps required for computing a regularization path. A step here refers to the estimation of parameters for a fixed regularization strength k, that is, solving optimization problem P0. For Algorithm 1 this corresponds to one evaluation of the two update equations. In addition, the number of passes of the gradient descent method through the entire data set (epochs) is informative. However, because different gradient descent methods are used for Algorithm 1 and LARS-GLM, a direct comparison in terms of epochs is not possible.

BENNER
Enhancers are regulatory elements in the genome that control the cell type-specific expression of genes. The DNA sequence of an enhancer is predictive of the cell types in which it actively controls gene expression (Lee et al., 2011;Hashimoto et al., 2016;Kelley et al., 2016;Yang et al., 2017). Training and test data sets were compiled from DNA sequences of 56‚ 880 enhancers active in embryonic mice at day 15.5 in brain (positive set) and nonbrain tissues (negative set). Enhancers were detected using ModHMM (Benner and Vingron, 2020) and the data are available online (Benner, 2020). For each enhancer sequence, the K-mer content is determined. K-mers are short DNA subsequences of length K. In this example, K-mers of variable length are considered. The resulting data matrix X = (x ij ) has n rows (observations) and p columns (features), where x ij is the number of occurrences of the jth K-mer in the ith enhancer sequence. A logistic regression model is used to discriminate between brain and nonbrain enhancers. The average negative log-likelihood (average loss) is given by where r is the sigmoid function and y 2 f0‚ 1g n is a vector of n class labels. Analogously to linear regression, when p > n, parameters are estimated by minimizing the ' 1 -penalized average logistic loss For logistic regression the objective function G k is convex and the optimumĥ must satisfy X T (y -r(Xĥ)) 2 k@ h k k 1 . SAGA (Defazio et al., 2014) is used as gradient descent method to compute solutions of this optimization problem. To guarantee stable convergence of the gradient descent method, the data are normalized to unit variance.
In a first example, K-mers of length 2 -7 are used, that is, K 2 f2‚ 3‚ . . . ‚ 7g. This leads to a feature space of size p = 21‚ 840, where only K-mers that actually occur in the data set are considered. Figure 1 shows the leapfrog regularization path computed with Algorithm 1 for m 2 f1‚ 10‚ 500g. In addition, a regularization path computed with LARS-GLM is shown and the algorithm was stopped when it reached 500 features. The example shows that the computation of the leapfrog path requires much fewer steps, however, at the expense of not evaluating the full path up to 500 features. Nevertheless, for larger feature spaces and larger m the LARS-GLM algorithm quickly becomes computationally too expensive.
In a second example, the set of K-mers is extended to lengths of 2 -12, that is, K 2 f2‚ 3‚ . . . ‚ 12g, which results in a feature space of size p = 16‚ 876‚ 344. Figure 2 shows the estimation of parameters for a leapfrog regularization path with Algorithm 1 and targets m 2 f10‚ 100‚ 1000‚ 5000g. The algorithm FIG. 1. Comparison of regularization paths. Leapfrog regularization path (left) with m 2 f1‚ 10‚ 500g and LARS-GLM regularization path (right). Vertical dashed gray lines show steps of the path algorithms, that is, k values for which parametersĥ were estimated successively decreases the penalty strength k until a target is reached. After only 1500 epochs, corresponding to *50 steps of Algorithm 1, the leapfrog regularization path is computed. In contrast, extracting the 5000 most important K-mers with the LARS-GLM algorithm is computationally not feasible.

Bias-variance trade-off
In statistics and machine learning one is often confronted with the task of choosing a model family H that captures the underlying structure of the observed data, but also generalizes well to new observations. The bias-variance trade-off suggests that both cannot be accomplished with arbitrary precision. When the capacity of H is too small, the estimated models has poor performance on the observed data as well as on new observations. The underfitting of the estimated models is caused by a strong (inductive) bias of H. As the capacity of H increases, the model is capable of extracting more complex patterns from the training data, but the variance of parameter estimates across samples increases. If the capacity of H is too large it causes overfitting, that is, the estimated models will perform very well on the training data but generalize poorly. A model family that generalizes well must minimize both sources of error, the bias and the variance.
For instance, in the case of linear or logistic regression, the capacity of H can be increased by using more features (i.e., increasing p). Regularization, in contrast, constrains the parameter space and, therefore, reduces the capacity of H (Hastie et al., 2009). In practice, it is often observed that models generalize best if regularization is used in combination with very large p (Bühlmann and Van De Geer, 2011).
In this section, the effect of regularization and the size of the feature space p on the capacity of H is studied. A balanced set of n = 10,000 brain and nonbrain enhancers is used. The size of the feature space p is controlled by the maximal length of K-mers. Figure 3 shows the training and test error for a fixed penalty

BENNER
k and increasing p. Whereas the training error decreases monotonically, the test error starts to decrease after it reaches a maximum. This double descent has been observed before (Vallet et al., 1989;Opper et al., 1990;Duin, 2000;Belkin et al., 2019) and is theoretically expected (Cheema and Sugiyama, 2020). It shows that the capacity of H is the result of a complex interplay between the size of the feature space p and the regularization strength k.
The dependency between k and p makes it difficult in practice to find optimal values for both parameters, especially when double descents are possible, because an extensive search is required to eliminate the possibility of being in a local optimum. Instead of adding more parameters by increasing p, Algorithm 1 is utilized to control the number of active features m for a fixed p. Figure 4 shows that in this case no double descent is observed, suggesting a monotonic relation between m and the capacity of H, as expected from the bias-variance trade-off. Furthermore, the test error increases rapidly as the interpolation threshold is crossed, that is, the point where the model is able to perfectly discriminate between the positive and negative class. Figure 5 shows the converse case, the average loss for fixed m but variable p. Also in this case no double descent of the test error is observed. However, the training error increases with p after reaching a minimum, because the regularization strength must increase to keep m fixed. The generalization gap, that is, the difference between test and training loss, steadily decreases, showing that the capacity of H seems to decline.

CONCLUSION
Several iterative feature selection methods were discussed in this study. MP is among the simplest methods, but can lead to poor subset selection because of its overly greedy strategy (Hastie et al., 2009). More advanced methods, such as LARS and the homotopy algorithm, often yield better results in practice.

COMPUTING LEAPFROG REGULARIZATION PATHS 567
The estimates of LARS with Lasso modification are equivalent to solving the Lasso problem (P1) for all regularization strengths k. Extensions to GLMs are more expensive to compute, because there exists no analytical solution to the k values at which the active set of features changes. As a result, many more iterations are required to compute a full regularization path. In practice, one is often not interested in the full solution, because estimates with m and m + 1 features provide very similar predictions. Algorithm 1 instead computes for an a priori defined sequence m 1 < m 2 < . . . < m q a leapfrog regularization path fĥ(k k )jĥ(k k ) 0 = m k g q k = 1 . This algorithm can be derived from proximal gradient descent or a simple modification of LARS for GLMs. It is very effective for high-dimensional problems where only a small subset of features is required for making accurate predictions. It is less greedy than MP, because it does not simply select one feature per iteration. As opposed to LARS, which computes subsequent breakpoints, Algorithm 1 iteratively refines the regularization strength k until a given number of m features is selected.
On a data set of DNA sequences the algorithm very efficiently computes the leapfrog regularization path despite the high-dimensional feature space, where LARS-GLM is computationally too expensive. In addition, the number of selected features m allows a good control of model complexity. An implementation of the algorithm is available at: https://github.com/pbenner/kmerLr