Feature-space selection with banded ridge regression

Encoding models provide a powerful framework to identify the information represented in brain recordings. In this framework, a stimulus representation is expressed within a feature space and is used in a regularized linear regression to predict brain activity. To account for a potential complementarity of different feature spaces, a joint model is fit on multiple feature spaces simultaneously. To adapt regularization strength to each feature space, ridge regression is extended to banded ridge regression, which optimizes a different regularization hyperparameter per feature space. The present paper proposes a method to decompose over feature spaces the variance explained by a banded ridge regression model. It also describes how banded ridge regression performs a feature-space selection, effectively ignoring non-predictive and redundant feature spaces. This feature-space selection leads to better prediction accuracy and to better interpretability. Banded ridge regression is then mathematically linked to a number of other regression methods with similar feature-space selection mechanisms. Finally, several methods are proposed to address the computational challenge of fitting banded ridge regressions on large numbers of voxels and feature spaces. All implementations are released in an open-source Python package called Himalaya.

To quantify the number of feature spaces effectively used in the model, one candidate metric is to count the number of feature spaces with a non-zero effect on the prediction. However, because the hyperparameters optimized by banded ridge regression never reach infinity, the coefficients b * are never exactly zero, even for feature spaces that are effectively ignored. For this reason, all feature spaces have non-zero values in the variance decomposition, and one cannot simply count the number of non-zero values. To fix this issue, it is possible to define a positive threshold, and to count the number of values above the threshold. Yet, this threshold would have an arbitrary value, and would artificially separate values that are slightly above from values that are slightly below, leading to a discontinuity in the metric. Instead, a better metric of the number of feature spaces effectively used in the model would vary continuously as some feature space contribution goes to zero.
To address this issue, we leverage a metric to compute the rank of a linear system. The rank of a linear system is the number of independent equations in the system, and corresponds to the number of non-zero eigenvalues λ i . The rank is thus a discontinuous function of the eigenvalues. To remove the discontinuity, the rank can be approximated by the effective rank. Several effective rank definitions exist, such as e [Roy and Vetterli, 2007], r 0 [Bartlett et al., 2020], R 0 [Langeberg et al., 2019, Bartlett et al., 2020, and h (proposed in this work). Assuming the eigenvalues are non-negative and sorted in decreasing order (λ 0 ≥ λ 1 ≥ . . . ≥ λ m−1 ≥ 0), these four metrics are defined as All four metrics (e, r 0 , R 0 , and h) have similar useful properties. They are continuous functions of the m eigenvalues, with values in the interval [1, m]. Because they are continuous, they do not require defining any threshold. They are also not affected by additional unused dimensions (dimensions i with λ i = 0). They are equal to k when k eigenvalues are equal (and non-zero) and all the other eigenvalues are equal to zero. The four metrics only differ in the way these integer values are interpolated. To describe how they interpolate these integer values, Figure 1 presents the metric values on different 2D and 3D spaces.
To use an effective rank metric to quantify the number of feature spaces effectively used in the model, the following procedure is used. First, the variance decomposition ρ ∈ R m is computed, for example using the product measure as described in Section 2.6. Then, negative values are clipped to zero, and an effective rank metric is computed on the vector ρ. Although all four metrics give similar results, the definition e was used throughout this paper for its intuitive link to the entropy H = − i λ i log(λ i ) [Roy and Vetterli, 2007, Araki and Lieb, 2002, Bach, 2022. Because the effective rank estimates the number of feature spaces effectively used over m features spaces, it is notedm.
[  The variance decompositions are defined on the simplex ∆ m−1 , the space of vectors in dimension m with positive values that sum to one. With m = 3, the simplex is a 2-dimensional surface in the shape of a triangle. All four effective rank metrics are equal to 1 at the triangle corners, to 2 at the edge midpoints, and to 3 at the center. The four metrics differ in the way they interpolate these integer values. The color mapping is discrete to highlight the structures of the different metrics. (Bottom) Values of the four metrics on different 1-dimensional trajectories, linearly interpolating between two points in dimension 2 or 3. All four metrics are continuous, and equal to k when the values are equally split over k dimensions (depicted by the black dots). Moreover, adding unused dimensions (second versus first panel) does not alter the metric values. All four metrics are thus reasonable choices for quantifying the number of feature spaces used in banded ridge regression. The definition e was used throughout this paper for its intuitive link to entropy.
A.2 List of the 22 feature spaces from the short-films dataset In Sections 3.6 and 4.8, banded ridge regression is fit on 22 features spaces from the "short-films" dataset [Nunez-Elizalde et al., 2018]. Here, a short description of the 22 feature spaces is given.
• Stimulus motion energy: spatiotemporal Gabor filters computed on the stimulus luminosity (11,845 features; [Nishimoto et al., 2011]). • Retinotopic motion energy: spatiotemporal Gabor filters, computed on the stimulus luminosity, corrected for eye movements (11,845 features). • Eye position: cubic polynomial expansion of the eye position during movie watching (36 features). • Visual semantic: semantic labels of objects and actions present in the scene [Huth et al., 2012], projected in a semantic space (985 features; [Huth et al., 2016]). • Visual thematics (5 feature spaces): categorization of verb classes into categories based on the type of events described. VerbNet (85 features; [Kipper et al., 2006]) was used to construct a thematic roles feature space (36 features), a selective restrictions feature space (38 features) and their interaction (71 features), and a verb categories feature space (101 features). • Spectrogram: spectral power density of the sound wave (2388 features). • Wavenet: Auditory features related to musical instruments extracted using a pretrained convolutional neural network (512 features; [Oord et al., 2016]). • Word rate: cubic polynomial expansion of word rate (13 features). • Speech syntax position: part-of-speech tags of each word in the sentences, extracted using a pre-trained neural network (12 features; [Andor et al., 2016]). • Speech syntax labels: word dependencies in the sentence, extracted using a pre-trained neural network (43 features; [Andor et al., 2016]). • Speech semantic: spoken words projected in a semantic space (985 features; [Huth et al., 2016]). • Speech thematics (7 feature spaces): categorization of verb classes into categories based on the type of events described. VerbNet [Kipper et al., 2006] labels were used to construct a thematic role feature space (39 features), a verb class feature space (274 features), and a verb category feature space (102 features). Four thematic roles (verb, actor, recipient, place) were further decomposed by projecting words in a semantic space (985 feature each).

A.3 Banded ridge regression is equivalent to multiple-kernel ridge regression
As stated in Section 3.8, banded ridge regression is equivalent to multiple-kernel ridge regression, up to a subtle difference. This subsection first demonstrates the equivalence, then discusses the source and implications of the difference. Note that a deep understanding of the difference is not critical to the rest of the paper.
Equivalence. To reformulate banded ridge regression as a multiple-kernel ridge regression, one can note that banded ridge regression is equivalent to a ridge regression with weighted feature spaces X i = X i * µ/λ i (see Section 4.3). This ridge regression is in turn equivalent to a kernel ridge regression [Saunders et al., 1998] (see Section 4.2) with a linear kernel Multiple-kernel ridge regression corresponds to a kernel ridge regression with a weighted kernel K = i γ i K i and a regularization hyperparameter µ. Because µ is arbitrary in the change of variables, one can choose it such that i γ i = i µ/λ i = 1, which leads to µ = ( i 1/λ i ) −1 . In this way, γ is restricted to a subspace called the simplex ∆ m−1 = {γ ∈ R m + ; i γ i = 1}, which corresponds to the standard formulation in the multiple-kernel learning literature [Gönen and Alpaydın, 2011]. Banded ridge regression is thus equivalent to multiple-kernel ridge regression. In Sections 4.5 and 4.6, this equivalence is used to solve banded ridge regression efficiently when the number of features is larger than the number of samples.
Key difference. The equivalence between banded ridge regression and multiple-kernel ridge regression hides a subtle difference in the way hyperparameters are optimized. In banded ridge regression, all hyperparameters λ ∈ R m + are optimized through cross-validation. Thus, in the equivalent multiple-kernel ridge regression, both the kernel weights γ ∈ ∆ m−1 and the regularization µ > 0 must be optimized through cross-validation. On the contrary, in the multiple-kernel learning literature, the kernel weights γ are usually learned within the training set [Gönen and Alpaydın, 2011], and only µ is learned through cross-validation. Some rare papers have proposed learning kernel weights through cross-validation [Tsuda et al., 2004, Keerthi et al., 2007, Klatzer and Pock, 2015, but this practice is not common in the multiple-kernel learning literature. This difference is important, and is further discussed in the rest of the subsection. To highlight the difference between the two strategies, the following notations is used: • MKRR-cv is a multiple-kernel ridge regression with kernel weights learned over cross-validation.
• MKRR-in is a multiple-kernel ridge regression with kernel weights learned within the training set.
Using these notations, banded ridge regression is equivalent to MKRR-cv, whereas the standard formulation in multiple kernel learning is MKRR-in. Additionally, the squared group lasso is equivalent to MKRR-in [Bach, 2008, Rakotomamonjy et al., 2008.
Group scalings. To understand the difference between MKRR-cv and MKRR-in, it is useful to consider the concept of group scalings. When feature spaces are unbalanced (for example in terms of number of features), the regularization can have a different strength on each feature space. To fix this issue, multiple studies have proposed that the regularization strength on each feature space could be evened out by adding group scalings to the regularization. In the group lasso literature, group scalings are positive weights d i defined for each feature space i, which creates a scaled L 2,1 norm i d i ||b i || 2 . Similarly, in the multiple kernel learning literature, group scalings modify the optimization by constraining kernel weights to a scaled simplex∆ m−1 = {γ ∈ R m + ; i γ i d 2 i = 1} [Bach, 2008]. Note that group scalings are defined additionally to the kernel weights γ i .
Because feature spaces can be unbalanced in different ways, many group scalings have been proposed in the group lasso and multiple-kernel learning literature: the square root of the number of features [Yuan and Lin, 2006], the kernel trace [Lanckriet et al., 2004], the square root of the kernel trace [Bach et al., 2004], the kernel empirical rank [Bach et al., 2005], or more sophisticated group scalings (e.g. [Bach, 2008, Obozinski et al., 2011, Pan and Zhao, 2016). However, no group scaling emerged as being superior to the others. In MKRR-in, it is thus necessary to test different group scalings, and select the best one through cross-validation.
On the contrary, in MKRR-cv the kernel weights γ ∈ ∆ m−1 are learned by cross-validation. Thus, group scalings would have no effect on MKRR-cv. In other words, MKRR-cv is equivalent to using MKRR-in with group scalings optimized via cross-validation.
Summary. To summarize this subsection, MKRR-cv and MKRR-in are two variants of multiple-kernel ridge regression. In MKRR-cv, the kernel weights are learned using cross-validation, whereas in MKRR-in, the kernel weights are learned within the training set. Banded ridge regression is equivalent to MKRR-cv, whereas the squared group lasso is equivalent to MKRR-in. In MKRR-in, one can add group scalings to help balance the feature spaces, but the choice of group scaling needs to be cross-validated. In MKRR-cv, group scalings are less important because optimal kernel weights are already learned through cross-validation.

A.4 Derivation of the hyperparameter gradient
In Section 4.6, hyperparameter gradient descent is used to optimize hyperparameters in multiple-kernel ridge regression. Hyperparameter gradient descent is a method that iteratively improves hyperparameters. At each iteration, the update is based on the gradient of the cross-validation loss L val with respect to hyperparameters δ. Here, the computation of this gradient is derived using the implicit differention framework.
As explained in Section 4.6, multiple-kernel ridge regression is reparameterized with hyperparameters δ ∈ R m . The training and validation loss functions are defined as where w * (δ) = argmin w L train (w, δ), and for each feature space i, K train , i = X train , iX train , i is the training kernel, and K val , i = X val , iX train , i is the validation cross-kernel.
Implicit differentiation framework. At each iteration, gradient descent updates the hyperparameters δ in the direction of the gradient of the cross-validation loss. The cross-validation loss L val (w * (δ), δ) has two dependencies in δ, one direct, and one indirect through the dual weights w * (δ). The gradient of L val with respect to δ is thus the sum of the direct gradient ∂L val ∂δ , and the indirect gradient ∂L val ∂w * ∂w * ∂δ . In the indirect gradient, the Jacobian ∂w * ∂δ can be computed with implicit differentiation [Larsen et al., 1996, Chapelle et al., 2002, Foo et al., 2007, which leads to the full gradient expression In this expression, the most expensive term to compute is the inverse Hessian ( ∂ 2 Ltrain ∂w∂w ) −1 . To reduce this computational bottleneck, three separate approximations are considered. The first approximation, proposed in [Pedregosa, 2016], uses a conjugate gradient method to solve ∂L val ∂w * ( ∂ 2 Ltrain ∂w∂w ) −1 up to a limited precision. The second approximation, proposed in [Lorraine et al., 2019], uses Neumann series to estimate the inverse Hessian with H −1 ≈ k j=0 (I n − H) j , with typically k = 5. The third approximation is more aggressive, and approximates the full gradient with the direct gradient ∂L val ∂δ . This last approximation is equivalent to assuming the inverse Hessian to be 0. In many applications, this approximation is not available because the direct gradient is equal to zero [Lorraine et al., 2019]. However, the direct gradient is not equal to zero in multiple-kernel ridge regression, so it can be used to approximate the full gradient.
Application to multiple-kernel ridge regression. The implicit differentiation framework can be immediately applied to multiple-kernel ridge regression. Yet, some improvements can be done to maximize efficiency. For example, the gradient of the training loss at optimum ∂Ltrain ∂w w * = K train ((K train + I n )w * − y train ) = 0 can be simplified to (K train + I n )w * − y train = 0 by using the gradient with respect to the scalar product x, y K = x Ky. This change improves gradient conditioning, and leads to the same solutions when the kernel has full rank. In addition, this change largely simplifies the following derivative expressions and decreases their computational cost The other derivative expressions are not affected by this change, and are equal to Once the gradient ∂L * val ∂δ is computed, gradient descent improves the hyperparameters δ with the update where η is called the step size. To estimate the optimal step size, one could use the Lipschitz constant of ∂L * val ∂δ , but that is hard to estimate [Pedregosa, 2016]. Fortunately, the Lipschitz constant of the direct gradient ∂L val ∂δ can be estimated. Noting χ i = e δi K val,i w, the validation loss reads L val = || m i=1 χ i − y val || 2 2 . Using ∂χi ∂δi = χ i , the gradient and Hessian are derived as where [i = j] is equal to 1 if i = j and 0 otherwise. The local Lipschitz constant L is given by the maximum eigenvalue of the Hessian, which can be estimated with power iteration. In our experiments, a step-size η = 1/L was used, not only for the direct gradient, but also for the full gradient approximations. Several adaptive step-size strategies were also explored, but none were computationally as efficient as using the estimated Lipschitz constant of the direct gradient. The entire algorithm is listed in pseudo-code in Appendix A.7.

A.5 Hyperparameter Bayesian search
In Section 4.5 and 4.6, two methods are proposed to solved banded ridge regression. Another method proposed in the past to solve banded ridge regression is hyperparameter Bayesian search [Nunez-Elizalde et al., 2019]. Hyperparameter Bayesian search [Mockus et al., 1978] is a method similar to hyperparameter random search. Both search methods compare a number of hyperparameter candidates and select the candidates with the lowest cross-validation error. The specificity of Bayesian search is that each candidate is selected conditionally to the performances of previously tested candidates. The Tikreg Python package [Nunez-Elizalde et al., 2019] published previously by our group implements Bayesian search to solve banded ridge regression. Specifically, Tikreg implements a Bayesian search using trees of Parzen estimators [Bergstra et al., 2011] through the Hyperopt Python package [Bergstra et al., 2013].
A major limitation of hyperparameter Bayesian search is that it cannot optimize multiple losses simultaneously. For this reason, Tikreg cannot optimize a separate loss per voxel. Instead, Tikreg optimizes the average validation loss over all voxels. This approximation introduces a bias towards the preferred hyperparameters of the voxel majority. To mitigate this bias, Tikreg later selects the best hyperparameters per voxel out of all candidates explored by Bayesian search. However, this mitigation will only be successful if Bayesian search explores the hyperparameter space broadly and does not find its optimum quickly.
Another limitation of Tikreg's implementation of Bayesian search is that it does not use all computational tricks to solve the ridge regression subproblems. As described in Section 4.1, ridge regression can be solved efficiently by reusing computations for multiple regularization hyperparameters. To leverage this computational trick in random search (Section 4.5), hyperparameters are separated into γ ∈ ∆ m−1 and µ > 0. For each randomlysampled candidate γ, a ridge regression is solved efficiently for a large number of µ. This factorization is key to explore hyperparameter space efficiently, but is not present in Tikreg's implementation.

A.6 Algorithms for weighted-kernel ridge regression
To solve multiple-kernel ridge regression, the hyperparameter gradient descent solver updates the kernel weights δ ∈ R m at each iteration. Then, at each iteration, the regression weights w must be optimized for the current (fixed) kernel weights δ ∈ R m . We call this sub-problem weighted-kernel ridge regression. Specifically, for a fixed δ ∈ R m , weighted-kernel ridge regression is defined as: This section describes how to solve this sub-problem.
This problem has a closed formed solution w * = (K + I n ) −1 y, where K is the kernel weighted sum K = m i=1 e δi K i . However, if different voxels have different kernel weights δ, this closed-form solution is not efficient, as it requries computing K and inverting (K + I n ) −1 for each different kernel weights δ. During hyperparameter gradient descent, a precise solution of the weighted-kernel ridge regression sub-problem is not necessary, and an approximation is sufficient [Pedregosa, 2016]. Thus, an approximate solution can be obtained with gradient descent (Algorithm 1), conjugate gradient descent (Algorithm 3), or even finite Neumann series (Algorithm 2). (Neumann series approximates (K + I n ) −1 ≈ f k j=1 (I n − f (K + I n )) j , where the factor 0 < f < 1 needs to be tuned to make the series converge.) All three algorithms are presented here for a single voxel, yet they can be straightforwardly extended to multiple voxels using matrices and tensors.
Interestingly, another sub-problem of hyperparameter gradient descent can be solved as a weighted-kernel ridge regression. During hyperparameter gradient descent, the gradient computation requires computing This expression can be considered as the solution of equation (10), with y = K val (K val w − y val ). Therefore, the three approximate solvers of weighted-kernel ridge regression presented in this section can also be used to compute this element of the hyperparameter gradient computation.
In the example presented in Scetion 4.8, the ridge dual weights were updated with Algorithm 1 because Algorithm 2 was too coarse and Algorithm 3 led to oscillations in the convergence curves. For the hyperparameter gradient term, because the term did not require a precise approximation, all three algorithms (Algorithm 1, Algorithm 3, and Algorithm 2) were tested and compared.

A.7 Algorithms for multiple-kernel ridge regression
This section describes algorithms to solve multiple-kernel ridge regression, optimizing δ ∈ R m over the validation loss: where w * (δ) is the solution of the weighted-kernel ridge regression (10). This problem can be either solved with hyperparameter random search (Algorithm 4) or with hyperparameter gradient descent (Algorithm 5).
Algorithm 4 Multiple-kernel ridge regression solver -Hyperparameter random search for all cross-validation splits s do 4: K train , K val , y train , y val = split s (K, y)