Approximate multiple kernel learning with least-angle regression

Kernel methods provide a principled way for general data representations. Multiple kernel learning and kernel approximation are often treated as separate tasks, with considerable savings in time and memory expected if the two are performed simultaneously. Our proposed Mklaren algorithm selectively approximates multiple kernel matrices in regression. It uses Incomplete Cholesky Decomposition and Least-angle regression (LAR) to select basis functions, achieving linear complexity both in the number of data points and kernels. Since it approximates kernel matrices rather than functions, it allows to combine an arbitrary set of kernels. Compared to single kernel-based approximations, it selectively approximates different kernels in different regions of the input spaces. The LAR criterion provides a robust selection of inducing points in noisy settings, and an accurate modelling of regression functions in continuous and discrete input spaces. Among general kernel matrix decompositions, Mklaren achieves minimal approximation rank required for performance comparable to using the exact kernel matrix, at a cost lower than 1% of required operations. Finally, we demonstrate the scalability and interpretability in settings with millions of data points and thousands of kernels.


Introduction
Kernel functions provide general data representations and are used to numerically represent many object types, such as vectors, strings or trees. Multiple kernel learning (MKL) determines relevance of predefined kernels for a particular task. Kernel-based algorithms often involve evaluating the kernel for all pairs of data points -the kernel matrix -implying at least quadratic time and storage complexity [30] . In this work, we present a principled way to approximate a relevant subset of kernels in regression.
Kernel approximation methods can be split into two groups. First, approximations of the kernel function are data independent, space-and time-efficient, but constrained to kernels of particular functional forms, characterized by smoothness, differentiability, stationarity or input domain-based properties. Functional decompositions rely on approximating the implicit feature mapping or through spectral densities [25,26,34,36] . The second major group are the data-dependent, kernel matrix approximations . A subset of the input data points -termed the active set or inducing pointsis selected. The kernel matrix is then approximated by projections to the basis functions centered on the active set. The active set 1. Existing kernel matrix approximation methods rarely address a multiple kernel scenario. Indeed, the linear span of the kernel matrix sum is equivalent to the concatenation of the implicit feature spaces. However, we show that this property is lost with an approximation of the kernel matrix sum (see A.5). 2. Standard feature selection methods in linear regression do not scale efficiently to kernel basis functions selection. We provide robust approximation and scoring of the basis functions. 3. The most efficient kernel approximations pose significant limitations on input data and types of kernels. Approximating the kernel matrix suffers no loss of generality and is useful when different families of kernels need to be used. We show several such examples.
The Mklaren algorithm is derived in Section 2 . The related methods are described in Section 3 and compared experimentally to evaluate the differences in output function spaces, inducing point selection, multiple kernel learning, compactness and time complexity ( Section 4 ). Auxiliary experiments and derivations are presented in the Appendix. The implementations of all evaluated algorithms and experiments is provided as a Python-based MKL library -mstrazar/mklaren on GitHub. 1 Notation. Lower-case bold letters denote vectors (e.g. g ), uppercase bold are matrices (e.g. G ), and parentheses index vectors and matrices (e.g. G (i,j) is a scalar at row i and column j .). Subscripts link vectors to data points or kernels, i.e. g q,i is a vector related to kernel q and data point i . Kernel functions are denoted as k . Notation k ( x i , · ) is a basis function of k centered at data point x i . Kernel matrices K or functions k ( · , · ) are used interchangeably as appropriate, keeping in mind that the kernel matrix need never be computed entirely.

The Mklaren algorithm
We start by introducing the basic terminology of multiple kernel regression and present ideas underpinning Mklaren, Incomplete Cholesky Decomposition and Least-angle regression. Derivation of particular steps and complexity analysis conclude this section.

Multiple kernel regression
Let { x 1 , x 2 , . . . , x n } be a set of input data points associated with target output values y ∈ R n . Let X 1 , X 2 , . . . , X p be inner product spaces, each containing { x 1 , x 2 , . . . , x n } and endowed with respective kernel functions k 1 , k 2 , . . . k p . The X 1 , X 2 , . . . , X p can all be equal (e.g. real-valued vector spaces) or may contain different representations of the same objects (e.g. a text document can be seen both as a vector of word counts or a long unstructured string).
The kernels k q are positive definite mappings from X q × X q to R . In practice, they enable learning non-linear output functions f , that map from X 1 , X 2 , . . . , X p to R and thus approximate y . Each input point x i generates a basis function k q ( x i , · ) in terms of kernel k q , generating a representation. Examples of kernels and their underlying prior assumptions are listed in Section 3.2 .
The output functions f assume the following parametric form: Here, the vector μ ∈ R p represents the kernel weights , proportional to the relevance of each kernel, while α are standard regression parameters. 1 Available at https://github.com/mstrazar/mklaren .
Common MKL algorithms solve for μ, α or both, and differ in computational complexity and constraints on μ, α. For instance, μ can be learned independently, be limited to unconstrained, convex or linear combination [9] , or be subject to norm constrains [8,27] .
The parameters μ, α can also be constrained jointly using matrix norms [15] . Alternatively, the kernel matrix can be optimized directly with convex optimization techniques [18] .

Approximate multiple kernel learning with Mklaren
Computational complexity can be reduced by selecting a subset of all basis functions in Eq. (1) . Evaluating k q ( x i , x j ) for each pair x i , x j in the training set determines kernel matrices K q ∈ R n ×n . These can be approximated with matrices G 1 , G 2 , . . . , G p , where G q ∈ R n ×r q and the total rank r = q r q < n .
Mklaren solves the above problem in a regression setting. Briefly, Incomplete Cholesky decomposition (ICD) is used to construct each G q . The selection of basis functions is guided by leastangle regression (LAR). A principal challenge in adapting LAR to kernel learning is avoiding the evaluations of all possible k q ( x i , x j ). This adaptation is our principal contribution and hinges on approximation of the correlations between k q ( x i , · ) and y . An example scenario and the steps of Mklaren are illustrated in Fig. 1 . The complete pseudo code is listed in Algorithm 1 , below.

Result :
G 1 ∈ R n ×r 1 , G 2 ∈ R n ×r 2 , . . . G p ∈ R n ×r p Cholesky factors, H ∈ R n ×r combined feature space, A 1 , A 2 , . . . , A p active sets of pivot indices, f ∈ R n regression estimate on the training set, w ∈ R r regression coefficients determining function estimator ˆ f .
3 Compute initial ICDs with δ columns for G 1 , G 2 , . . . , G p . 4 while q r q < r do 5 Compute ˆ a q,i and ˆ c q,i for each kernel q and pivot i / ∈ A q (Eq. 8).

Fig. 1.
A hypothetical model with three kernels, q ∈ {1, 2, 3}. The data is accessed only through kernels and the kernel matrices are never computed explicitly. The markers circle, rectangle and triangle represent the selected pivot columns for kernels 1, 2 and 3, respectively.

Incomplete Cholesky decomposition
Incomplete Cholesky decomposition (ICD, ref. [12] ) is a numerical method to approximate symmetric positive definite matrices. Comparing to simpler approximations, e.g. random sampling of basis functions, it implements a cheap heuristic to search in unexplored input regions. For simplicity, we introduce the method assuming a single kernel matrix.
A positive semidefinite symmetric matrix K ∈ R n ×n is approximated by Cholesky factors G ∈ R n ×r , computed in time O ( nr 2 ) < O ( n 2 ), such that K ≈ GG T . Candidate pivots are listed in the remaining set , J = { 1 , 2 , . . . , n } -r of them will be added to the active set , initialized as A = ∅ . Initially, G ∈ R n ×r is set to 0 . For each pivot i , a lower bound on reduction in absolute error and its pivot column g i = G (: , j) is computed as The selected pivot is added to A and auxiliary variables are updated Since K is determined by a kernel k , the pivot columns g i can also be seen through basis functions k ( x i , · ) with fixed inducing points x i . The pivot selection in Eq. (2) can be tailored to a specific task, e.g. kernel regression.

Basis function selection based on least-angle regression
Least-angle regression (LAR) is a feature selection method and an approximation to the Lasso solution path [11] . With multiple kernels, each K q corresponds to its approximation G q ∈ R n ×r q . Concatenating the Cholesky factors G q would form a combined kernel approximation H . Note that H is not equivalent to a Cholesky decomposition of the uniform sum of base kernels K unif = p q =1 K q . As the matrix H is constructed iteratively, it requires a nontrivial adaptation of the LAR-based basis function selection. Let h q,i denote a scaled and centered pivot column g q,i ( Eq. 6 ), corresponding to kernel k q and data point x i , s.t. h q,i = 1 and 1 T h q,i = 0 . We describe how h q,i are selected to form H . The exact computation of all h q,i is indeed unfeasible, but for simplicity, suppose that h q,i are known. In Section 2.6 , we then describe how to approximate h q,i .
The auxiliary variables for computing the Cholesky factors of each K q are initialized as in Section 2.3 , above. The feature matrix H , regression estimate f , and the residual e are initialized as H = 0 , f = 0 , and e = y .
A unique pair (kernel, pivot) q, i is selected for each column j . At j = 1 , the first vector is chosen to maximize correlation c q,i = e T h q,i = ρ(e , h q,i ) , After iteration j , H contains j vectors, through which we can draw a bisector u , having u = 1 and making equal, less than 90 degree angles with e (A.1). Updating the regression estimate f along direction u and the residual e , causes the correlations c j = ρ(e , H (: , j)) to change equally for all H (:, j ) and an arbitrary step size α ∈ R . The value α is set con-servatively such that some new vector h q,i not in H will have the same correlation with e new as all the vectors already in H : ρ(e new , H (: , 1)) = . . . = ρ(e new , H (: , j)) = ρ(e new , h q,i ) .
The step size α and h q,i are selected as follows. Define the following quantities where c q,i = ρ(e , h q,i ) , a q,i = ρ(u , h q,i ) , and min + is the minimum over positive arguments for each q, i . The minimizer of Eq. (5) determines H (: , j + 1) ← h q,i . The variables j, c j , f , e , A q are updated as above, according to α. In the last iteration j = r, the step simplifies to α = C/A , yielding the full least-squares solution for H and y .

Scaling and centering the basis functions
Since LAR assumes centered and scaled basis functions, for each pivot column g q,i we define with the centering operator P = (I − 11 T n ) and the sign of the correlation with the residual s q,i = (Pg i ) T e . Importantly, the transformed basis functions and their correlations with the residual e are directly comparable among kernels.

Look-ahead approximation
Explicit calculation of all g q,i and corresponding h q,i would yield a complexity of order O ( n 3 ). The issue is addressed by using approximations ˆ g q,i and ˆ h q,i that are less expensive to compute. The approach is based on selecting δ look-ahead columns, spanning a look-ahead approximation to the kernel matrix [4,7] . By definition of ICD in Eq. (3) , the values of a candidate pivot column g q,i at step j = r q and pivot i / In place of a complete kernel matrix K q , a lookahead approximation with δ additional columns is used L q = G q (: , 1 : r q + δ) G q (: , 1 : r q + δ) T . The additional columns are computed as δ standard Cholesky steps ( Section 2.3 ). This in Given ˆ g q,i and consequently ˆ h q,i , the computation of approximate ˆ c q,i equals Cancelling out , the correlations are computed using P ˆ g q,i 2 = P G q (: , r q + 1 : r q + δ) G T q (r q + 1 : r q + δ, i ) 2 = G q (i, r q + 1 : r q + δ) A 1 G q (i, r q + 1 : r q + δ) T , where q 11 T G q (r q + 1 : r q + δ, r q + 1 : r q + δ) , (P ˆ g q,i ) T e = e T P G q (: , r q + 1 : r q + δ) G T q (r q + 1 : r q + δ, i ) T = e T A 2 G q (r q + 1 : r q + δ, i ) , where 1 n e T 11 T G q (: , r q + 1 : r q + δ) .

(9)
Computation of ˆ a q,i is analogous. Correctly ordering the computations yields a computational complexity of O ( δ 2 ) per column. Note that matrices G T q G q , G T q 11 T G q , e T G q , e T 11 T G q are the same for all columns (independent of i ) and need to be computed only once per iteration.
The pair q, i is selected according to Eq. (5) , but with ˆ a q,i and ˆ c q,i in place of exact values. A Cholesky step ( Eq. (3) ) is then performed to compute the exact g q,i and corresponding h q,i . The vector h q,i is then treated as the only remaining candidate for computing the correct step size α in Eq. (5) with exact values for a q,i and c q,i .
Crucially, as the newly discovered h q,i might have a higher correlation with the residual than the previously selected vectors -a situation that cannot happen in canonical LAR -the value of α is allowed to be negative for this update.
Finally, a new set of δ look-ahead columns is to be recomputed using the standard Cholesky steps. Note that increasing δ monotonically increases the approximation accuracy and is limited only by the available computational resources. See Section 2.8 for more discussion.

The Mklaren algorithm
The steps described in the previous sections complete the Mklaren algorithm ( Algorithm 1 ). Given a sample of n data points with targets y and p kernel functions, the model hyperparameters are: the maximum rank r of the combined feature matrix, the number of look-ahead columns δ and L 2 norm regularization parameter λ (constraining f , discussed in A.4).
The variables related to regression estimate ( f , residual e and bisector u ) and individual decompositions G q (active sets A , column counters r q ) are initialized in lines 1-2. Each G p is initialized using standard ICD with δ look-ahead columns, as described in Section 2.6 (line 3).
The main loop is executed for r iterations, until the sum of selected pivot columns equals q r q = r, where at each iteration a kernel k q and a pivot column i / ∈ A q are selected and added to G q and consequently the combined feature matrix H . For each kernel k q and each pivot i / ∈ A q , ˆ a q,i and ˆ c q,i are computed. Based on these approximate values, the kernel k q and pivot i are selected. Given the optimal k q and pivot i , the pivot column g q,i and h q,i are computed. The new pivot column g q,i is added to G q (:, r q ), r q is incremented and the δ columns at G q (: , r q + 1 : r q + δ) are recomputed using standard ICD (lines 5-11). Having computed the exact g q,i , the true values a q,i and c q,i can be computed and the regression estimate f and the residual e are updated (lines 12-16).
The regression coefficients w solving for Hw = f and required for out-of-sample prediction are obtained by constructing H and solving the linear system discussed in A.2 (line 17).
Instead of stopping after consuming the predefined time/memory budget determined by r , another possibility is to estimate the expected generalization error. This type of an approach is sketched in A.6. Other types of high probability bounds related to kernel approximation can also be readily applied to Mklaren [1,3] .

Computational complexity
The Mklaren algorithm scales linearly both in the number of data points n and kernels p . Assuming r < n , the computational complexity is The look-ahead Cholesky decompositions are standard Cholesky decompositions with δ pivots and complexity O ( n δ 2 ). The main loop is executed r times. The selection of kernel and pivot pairs includes inverting H T H of size r × r subject to complexity O ( r 3 ). However, as each step is a rank-one modification to H T H , the Morrison-Sherman-Woodbury lemma can be used to achieve complexity O ( r 2 ) per update [21] . The computation of correlations in

Related work
The experiments elucidate the differences between methods implementing approximate multiple-kernel Ridge regression , which are reviewed here.

A note on compared methods
All the experiments assume a data set with { x 1 , x 2 , . . . , x n } points in p inner product spaces X q with corresponding kernel functions k q and real number targets y = (y 1 , y 2 , . . . , y n ) . The dimension of the reduced feature space equals r < n for all methods, hereby listed according to the increasing number of assumptions about the data or kernels.
Cholesky with Side Information (CSI), Incomplete Cholesky Decomposition (ICD) and the Nyström method all approximate the kernel matrix using r inducing points defining the active set A ⊂ { 1 , 2 , . . . , n } , with |A| = r and the basis functions k q ( x i , · ) for i ∈ A , for a kernel k q . Importantly, no further restrictions on the kernels or input spaces are assumed. The methods can be seen as discrete optimization and differ in sampling of A ; ICD approximates the reduction in error ( Section 2.3 ); the Nyström method uses approximate leverage scores which determine the sampling distribution; CSI uses a matching pursuit-based heuristic. 2 In contrast to Mklaren, only a single kernel is assumed; to implement MKL, we approximate an unweighted sum of all p kernels up to rank r ( K unif ; see Section 2.5 ).
Random Fourier Features (RFF) approximate the frequency components of the basis functions [26] . The matching pursuit optimization is used; δ basis functions are generated for each k q , and selected greedily up to rank r [37] . Non-stationary RFF (RFF-NS) are obtained by sampling from the spectral densities proposed by Ton et al. [35] .
Sparse Pseudo Input Gaussian Processes (SPGPs) explore a space of models with the largest capacity by continuously optimizing both the locations of the inducing points and kernel hyperparameters [32] . This limits its applications to real vector spaces X q = R d of finite (and not too large) dimension d with continuous and differentiable kernels. Although the locations of pseudo inputs are optimized, the zero-mean prior over function values on the pseudo inputs tends to an uniform, well-spread placement. 2 With matching pursuit , we refer to the approach which chooses the basis functions in a greedy manner and solves for the full least-squares estimate at each step.
As an empirical lower bound on prediction error we used the L 2 -regularized Kernel learning (L2-KRR), optimizing both the kernel weights and the regression coefficients given the exact kernel matrix [8] . To evaluate multiple kernel learning regardless of approximation, we used the state-of-the-art methods Align (independently learned weights for each kernel), Linear (linear combination) and Convex (convex combination) from Cortes et al. [9] .

Kernels used in this study
We briefly review kernels used in this section, with emphasis on the interpretation of respective hyperparameters.
Linear kernel . The linear kernel corresponds to the canonical inner product This kernel is included in order to perform feature selection experiments.
Exponentiated-quadratic kernel . The infinite differentiability of the exponential-quadratic generates smooth output functions: with the frequency hyperparameter γ or length scale = 2 γ , which determines the distance x − x at which the output functions covary.
The Matérn kernel . Both the smoothness ν and the length scale of the output functions are controlled [28] . With ν of form t + 1 / 2 , for integer t , the kernel is a product of a t -degree polynomial and an exponential of the distance x − x . We use the Matérn 3/2 kernel as an example Periodic-exponential kernel . The periodic-exponential is useful for modeling periodic signals, such as time series. It is defined as multiple exponentiated-quadratic kernels with length scale , repeated at frequency ω.
String kernels . The data points are strings up to length L sampled from a finite alphabet B. The spectrum kernel between strings s , s ∈ B L is a product of substrings counts: (15) with the substring length < L , and count( u, s ) equals the number occurrences of u in s . Explicitly constructing the feature space corresponding to k SPEC is unpractical as its dimension grows as |B| . The substring length is analogous to the bandwidth of the exponentiated-quadratic kernel, as increasing implies a more stringent similarity threshold.

Experimental results and discussion
This section presents extensive empirical comparison of Mklaren against approximate kernel methods, multiple kernel learning methods and hybrid methods. The methods are first compared on a simple 1D real line; Section 4.1 discusses the differences in output function shapes and their effect on performance in interpolation and extrapolation. The applicability of kernel matrix approximations to non-numerical, string data is discussed in Section 4.2 . Systematic comparison with related kernel approximations follow in Section 4.3 , multiple kernel learning methods and interpretability in Section 4.4 , and inducing points selection approaches in B.2. Finally, the empirical execution times are assessed in Section 4.5 .

Modeling one-dimensional functions
One of the simplest input spaces is arguably a regularlysampled interval on the real line R . It helps to illustrate the differences in the output functions and choice of inducing points. We perform two sets of experiments: interpolation (test points are intermixed with training points) and extrapolation (test points lie away from the training points). Time series are natural scenarios where these tasks are relevant.
The Appliances energy prediction data set 3 contains time series measurements of temperature in nine building compartments, denoted T 1 − T 9 . The measurements are recorded at regular 10 minute intervals over 18 weeks. Each week is thus temperature time series of length 1008.

Interpolation
Signals were split into training, validation and test sets by (round-robin) assigning each consecutive time point in a different set. The training set was used to fit the data on exponentiatedquadratic kernels ( Eq. (12 )) with the hyperparameter γ in { 10 −4 , 10 −2 , 1 , 10 2 , 10 4 } and the Matérn 3/2 kernel ( Eq. (13 )) with the length scale hyperparameter assuming equivalent values. The validation set was used to tune the regularization hyperparameter λ within {0, 10 −1 , 10 −0 . 5 , 1, 10 0.5 , 10 1 }. The look-ahead parameter for Mklaren, CSI, RFF and RFF-NS was set to δ = 10 , while rank was set to the number of days in a week ( r = 7 ). The performance evaluation metric was the root mean-squared error (RMSE) on the test sets, averaged over all 18 weeks.
The supervised methods selecting components from different length scales (i.e. Mklaren, SPGP, RFF and RFF-NS) performed best ( Table 1 ). The difference between Mklaren and other kernel matrix approximations ICD, CSI and Nyström are statistically significant at P < 0.05 (Friedman rank test, Fig. 2 ). With the exponentiatedquadratic kernel, Mklaren performed best in 4 out of 9 data sets, and 8 out of 9 cases with the Matérn kernel.
Strikingly, Mklaren slightly outperformed the SPGP model with continuous optimization of inducing points. Since Mklaren imposes no prior on inducing point locations, it can sometimes appear more flexible by placing the inducing points non-uniformly, while SPGP will favour uniform placement. This is confirmed on Fig. 3 , highlighting the differences in method-specific output function spaces. Low-rank approximations of a single kernel matrix ( K unif ; unweighted sum of kernel matrices; Section 2.5 ) -CSI, ICD and Nyström -all suffer from the same effect: the basis functions combinations of low-and high-frequency signals, caused by summing the kernels prior to kernel matrix approximation ( Eq. (5) , A.5). Constraining the output functions in this way leads to sub-optimal performance, regardless of the supervised selection of the inducing input points. Instead, Mklaren selected the basis functions and kernels from a larger function space ( Eq. (4) . A.5). Most similar to Mklaren, RFF samples functions with different frequencies, approximated with periodic components. However, input region-specific dependencies on different length scales are not entailed by the stationary RFF. The improved performance by Mklaren is likely attributable to direct exploitation of the target signal, rather than the two-step approach of initially generating random features and subsequently using matching pursuit. The non-stationary RFF-NS, somewhat unexpectedly, performed slightly worse than stationary RFF. We speculate that this is due to the specific way the optimization is implemented; as both approaches sample periodic basis functions and the signal variance appears constant across the input region, stationary features are not at a notable disadvantage. Since both methods are allowed the same number of δ look-ahead draws, and as RFF samples from a smaller space, the probability of finding useful basis functions appears slightly larger.

Extrapolation
The true potential of time series prediction lies in extrapolation to future times. Here, we compared the approximate multiple kernel learning methods to an autoregressive integrated moving average (Arima) baseline [22] . We used the same data as above, but employed a different splitting strategy. The first half ( n tr = 504 ) points was used for training, the following quarter ( n val = 252 ) for selecting regularization parameter λ, and the last quarter ( n te = 252 ) for measuring the final performance.
Together with k lin , this results in 170 kernels. We used Mklaren, CSI, ICD and the Nyström method with rank r = 7 , and look-ahead δ = 70 . With Arima, we fit multiple linear models f 1 , f 2 , . . . , f 504 , one for each possible delay τ ∈ 1...504 in the training set. To predict the value at y t+ τ , we used the feature representation of last 7 time points, x t = (y t−6 , y t−5 , . . . , y t ) with model f τ . This resulted in fitting a total of 504 × 7 = 3528 coefficients with a L 2 (Ridge) regression parameterized by λ.
An example fit for each method is shown on Fig. 4 . As the Arima model performs perfectly within the training set, the accuracy suffers a heavy drop with longer delays (test region in cyan). Similarly, the single kernel approximations CSI, ICD and the Nyström method are not able to pursue a clear future trend due to a uniform mixture of all frequencies and length scales. Mklaren identified a small number of frequencies and a slight linear trend, an advantage that is confirmed also numerically. As the mean-squared  error varies substantially, we select the method with lowest RMSE and count the number of wins within all combinations of compartments and weeks ( Table 2 ). Here, Mklaren is the best performing method in more than half of the cases, with the other methods sharing the wins nearly equally. The low performance of Arima can be explained by the test data being very far in the future, with a low number of corresponding delays in the training set. In summary, Mklaren delivers competitive performance even when compared to flexible inducing point and hyperparameter optimization methods. The performance benefits from different kernels in different input regions and adaptable placement of inducing points.

Non-numerical input data: regression with strings
String data are common in text mining and bioinformatics [33] . The data points s 1 , s 2 , . . . , s n are strings of length L , sampled from a finite alphabet, e.g. B = { A, C, G, T } representing DNA bases. Real number targets y ∈ R n are associated to each s i . We focus on the spectrum kernel ( Eq. (15) ), the inner product of common substring occurrences. The exponential complexity of constructing the explicit feature space renders SPGP and RFF unfeasible, so general kernel matrix approximations need to be used. By means of synthetic data, we first examine whether correct substring lengths can be identified. Second, the utility of our approach is shown with a real biological data set of RNA sequences.
Experiments on synthetic data . Random strings of length L = 30 are generated by uniformly sampling from B L . The sizes of training, validation and test sets were set to n tr = 500 , n val = n te = 50 0 0 . Ten spectrum kernels with ∈ { 1 , 2 . . . , 10 } were used. A ground truth active set A of r = 7 inducing points was sampled from the training set. The underlying true kernel matrix was kernel matrix with a fixed = 4 . Targets were sampled from a degenerate Gaussian Process, y ∼ N (0 , L ) . Out of 30 cross-validation comparisons, Mklaren achieved the lowest RMSE in 23 cases (76.6 %), while CSI in the remaining 7 cases (23.3%) ( P < 0.0017, Wilcoxon paired signed rank test). The critical distance plots showing the differences in ranks at P < 0.05 ( Fig. 5 a). Differences in the results can be explained by the effect of approximating the kernel sum as discussed in the Section 4.1 , but this time with discrete rather than continuous input space. Recall that different substring lengths are analogous to signal frequency components ( Section 3.2 ); we investigate how different affect the shape of the output functions. Due to the d = 10 =1 4 = 1 , 398 , 100 dimensional feature space, model visualization is challenging. We devise a visualization approach in B.1. Fig. 5 b shows a typical example where Mklaren outperforms the kernel matrix sum ( K unif ) approximations and most closely matches the profile of the true output. The methods approximating the kernel sum are prone to the short length substrings (i.e. =  1 , 2 , 3 ), which are analogous to high frequency continuous signals in Section 4.1 . These confirms Mklaren is suitable for modelling discrete string data and identifying suitable substring lengths.
Predicting RNA-binding protein binding affinities . A biological case study contains RNA sequences of L = 30 and B = { A, C, G, U} , along with affinities of nine RNA-binding proteins [16] . There are 18 data Table 3 Root-mean squared error (RMSE) on the RBP binding affinity prediction data sets using low-rank matrix reconstruction on string kernels.  sets, 2 replicates for each protein, with the number of sequences n in tens of thousands. We used n tr = n va = 30 0 0 randomly selected sequences for training and validation. All the remaining sequences were used for testing. The training-validation-testing process was repeated 30 times. For all methods, the rank was set to r = 10 and the look-ahead to δ = 10 (for Mklaren and CSI).
Mklaren and CSI achieve lowest average RMSE on all data sets ( Table 3 ). The differences are slight, but statistically significant -Mklaren achieves the minimal average RMSE in 16 out of 18 data sets ( P = 0 . 0028 , Wilcoxon paired signed-rank test versus CSI). Low-rank approximations are suitable with data sets where the objects are accessed only through the kernel, where Mklaren achieves competitive performance.

Savings due to low-rank approximation
The principal difference of Mklaren compared to other approximations is simultaneous approximation of multiple kernels and the guarantee that all the basis functions are equally correlated with the residual at each step. We evaluate the minimal necessary rank to achieve equivalent error to full-rank MKL method L2-KRR, considered an empirical lower bound. The experimental design is inspired by Bach and Jordan [4] , and uses 28 regression data sets. 4 We have used p = 10 exponentiated-quadratic kernels ( Eq. (12) ) with varying bandwidth γ ∈ { 10 −3 , 10 −2 , . . . , 10 5 , 10 6 } . The lookahead δ was set to 10 for Mklaren, CSI, RFF and RFF-NS. All features were standardized and the targets y were centered. Up to 30 0 0 data points were per data set, assuring sensible time to compute the exact L2-KRR. The performance was assessed using 5-fold cross-validation; a training set (60% of the size) and validation sets (20%) were used to infer the approximation and select the regularization parameter λ from { 0 , 10 −5 , 10 −4 , . . . , 10 1 } , respectively.
The performance was scored by RMSE on the remaining test data. Cases where no method achieved the target were removed from comparison ( mv, pole, stock, concrete ). Table 4 shows the minimal setting of r where the performance is at most one standard deviation away from the performance obtained by L2-KRR. The differences in ranks among all evaluated methods are shown in Table 4 . The differences are statistically significant ( P < 2 . 2 × 10 −16 , Friedman rank-sum test) and are summarized on Fig. 6 . The best two methods are SPGP and Mklaren, Table 4 Minimal rank r at which the error of each method is at most one std. dev. away from L2-KRR. In bold: lowest rank among kernel matrix approximations (Mklaren, CSI, ICD, Nyström). Ratio denotes approximate savings in operations for solving linear systems, nr 2 n 3 , i.e. Mklaren cost  Table 4 . The thick black lines connect statistically indistinguishable performance at significance level P < 0.05. whose rankings are statistically indistinguishable at P < 0.05 (Friedman test), although SPGP consistently achieves the minimal rank. Expectedly, Mklaren, RFF and RFF-NS perform comparably, due to selection of basis functions from a given set of kernels centered on the training set. Only Mklaren and SPGP achieve the goal within the given rank constraints in all cases. There is a larger gap between Mklaren and other kernel matrix approximation methods, attributable to supervised selection of inducing points from multiple kernels, and the difference in function spaces, discussed in A.5 and Section 4.1 . The difference between Mklaren and CSI is statistically significant when examined independently (Wilcoxon paired signed-rank test, P = 0 . 009 ). Expectedly, Mklaren outperforms general kernel matrix approximations CSI, ICD, and Nyström when they approximate the kernel matrix sum up to rank r . This holds even if the three methods approximate each of the p kernels individually, up to rank r p (data not shown). Note that this way, the minimal achievable rank is 10, which is greater than the minimum in half of the cases in Table 4 .
In all cases, the goal can be achieved with r much lower than n , ensuring a low-complexity model in terms of degrees of freedom (Section A.6). The fraction of required operations for solving the corresponding linear systems with Mklaren are estimated as nr 2 n 3 (column ratio in Table 4 ). The highest ratio (worst case) was 6 . 4 × 10 −3 for friedman , indicating that the solution can be found with a mere 0.64 % operations, and orders of magnitude lower for most other data sets.
Due to practical difficulties in ensuring consistent implementations (programming language, etc.) we opt for rank r as a measure of effective complexity. The absolute timings are nevertheless provided in Section 4.5 . In summary, Mklaren provides most compact approximations comparing to general kernel matrix approximations, solving the linear systems for a small fraction of cost. The performance is also competitive to highly optimized methods for exponentiated-quadratic kernels. The kernels that are not included in the model can be discarded, which is corroborated in the following section.

Multiple kernel learning and interpretability
The comparison of Mklaren to full-rank MKL methods is challenging, as it is unrealistic to expect improved predictive performance with approximations. We focus on the ability of Mklaren to select from a large set of kernels in linear time, without sacrificing the consideration of between-kernel correlations. Here, we evaluated both the predictive performance and interpretability.
Inspired by Cortes et al. [9] , we use four sentiment analysis data sets compiled by Blitzer et al. [6] . The data are user reviews of books ( n = 5501 ), electronics ( n = 5901 ), kitchens ( n = 5149 ) and dvds ( n = 5118 ), with target ratings in the integer interval [1,5] . The features are counts of the 40 0 0 most frequent unigrams and bigrams, represented by rank-one linear kernels. The splits into training and test set were provided by the authors of the data set.
We compared Mklaren with MKL methods, maximizing centered kernel alignment by Cortes et al. [9] : Align Linear and Convex, described in Section 3.1 and adapted for kernel Ridge regression. Prior to optimization, the features were initially filtered according to the descending centered alignment metric for Align, Linear, Convex, which ensured the same dimension of the feature space. When using Mklaren, the r pivot columns were selected from the complete set of p = 40 0 0 kernels. The parameter δ was set to 1. Note that in this scenario, Mklaren is equivalent to the original LAR method, thus excluding the effect of low-rank approximation.
The performance was measured via 5-fold cross-validation. At each step, 80% of the training set was used for kernel matrix approximation (Mklaren) or determining kernel weights (Align, Linear, Convex). The remaining 20% of the training set was used for selecting regularization parameter λ from the set 0 , 10 −3 , 10 −2 , . . . 10 3 , and the final performance was reported on the test set as RMSE. The results for different settings of rank r are shown in Table 5 , with the uniform kernel sum included as a baseline. Mklaren outperformed all other MKL methods that assume complete kernel matrices, and its RMSE is more than one standard deviation away from that of any other methods. On all four data sets, the performance appears to saturate at r = 160 .
The interpretability of kernel selection is assessed on the kitchen data set. Each of the methods infers an ordering of kernels; with Mklaren, it is defined by the pivot columns; with alignmentbased methods, by the kernel weight vector. In Fig. 7 , we display the performance of up to 40 features. The full least-squares coefficients w OLS, j is used for each feature subset up to step j . The arrows indicate the sign of the coefficient w OLS, j . The gradient in explained variance is higher for features corresponding to words associated to strong sentiments: great, good, love , etc. Not surprisingly, the order in which features are added to the model critically influences the explained variance. Due to the equivalence between LAR and Lasso, the strong sentiment features are identified early, irrespective of their magnitude. On the other hand, the regularized centered alignment methods appear to select words with a high prior frequency, e.g. propositions. These results show that the Mklaren model can identify a sensible combinations of kernels with a small number of iterations.

Empirical execution times
Implementations of all discussed methods were compared in terms of training time. Data was generated as described in Section B.2, assuming k exp kernels ( Eq. (12) ), rank r = 30 , input dimension d = 100 , regularization λ = 0 . 1 and look-ahead δ = 10 for Mklaren, CSI and RFF. All methods were presented with an X ∈ R n ×d input matrix, a y ∈ R n target vector and k 1 , k 2 , . . . , k p kernel functions. The time to compute the complete kernel matrix was included where relevant. The processes were allowed memory resources of 512 GB RAM and a time limit of one hour.
Timings of different scenarios for varying n ∈ {10 2 , 10 2.5 , 10 3 , . . . , 10 5.5 , 10 6 } and p = 10 are shown in Table 6 . For up to 316 examples, the full-rank MKL methods are comparable to low-rank approximations. SPGP is least efficient due to optimization of inducing points. For large data sets, only low-rank matrix approximations met the time constraints, with RFF being most efficient due to skipping computation of the kernel entirely. The linear complexity is confirmed for Mklaren; it takes 3.9-5.1 times longer than to Nyström due to look-ahead steps for 10 kernels.
The performance with varying the number of kernels p ∈  Table 7 . Note that only MKL-based methods are compared. The methods with quadratic complexity in p (Linear, Convex, L2-KRR, SPGP) exceeded the constraints when p exceeded 100. Mklaren is second most efficient up to p = 31 , after which the overhead in selectively evaluating the kernel outweighs pre-computation of the complete kernel matrix.
With these results, we empirically demonstrate the scalability of Mklaren to large data sets in both n and p , achieving the best performance among supervised kernel matrix approximations. Fig. 7. Increase in explained variance upon incrementally including features to an ordinary least-squares model (the kitchen data set). The order of features is determined by the magnitude of kernel weights for Align, Linear and Convex or the order of selection by Mklaren. Arrows indicate positive (black) or negative (gray) sign of the feature in the model weight vector upon inclusion. Underscored are words "great" and "not", which significantly alter the explained variance when discovered by the Align, Linear and Convex models.

Table 6
Timings (in seconds) for increasing number of data points n . A dash (-) denotes cases where processes exceeded the given resources. The rows are sorted by: 1) number of cases not exceeding one hour and 2) average rank.

Table 7
Timings in seconds for increasing number of kernels ( p ) for n = 10 0 0 . Sorting and notation is defined as in Table 6 .

Conclusion
Subquadratic complexity in the number of training examples is essential in large-scale application of kernel methods. Learning the kernel matrix efficiently from the data and the selection of relevant portions on the data early reduces the time and storage requirements. Using a greedy low-rank approximation to multiple kernels, we achieve linear complexity in the number of kernels and data points without sacrificing the consideration of in-between kernel correlations.
When used with low-rank approximations, the unweighted sum of kernel matrices is not equal to concatenation of the feature spaces induced by the kernels. Sampling the inducing pointcentered basis functions only from the relevant kernels enables using different hyperparameters in different regions of the input space. This property proved important both in matching the output function shapes in one-dimensional, continuous signals, string data sets, as well as in multidimensional data sets.
The Mklaren algorithm is a flexible approach for multiple kernel regression. It is kernel function independent, with data assumed to be accessed only through kernels. Nevertheless, its performance is comparable with methods that are optimized for specific kernels or types of input spaces. Natural extensions of the approach would include different target tasks, such as classification, ranking or general linear models. If the used kernels are differentiable with respect to hyperparameters, the latter could also be optimized at the time of approximation by appropriate modification of the inducing point selection heuristic. With the abundance of different data representations, we expect kernel methods to remain ubiquitous in machine learning applications.

Declarations of interest
None.

Supplementary material
Supplementary material associated with this article can be found, in the online version, at doi: 10.1016/j.neucom.2019.02.030 .