Fast Learning Rate of Non-Sparse Multiple Kernel Learning and Optimal Regularization Strategies

In this paper, we give a new generalization error bound of Multiple Kernel Learning (MKL) for a general class of regularizations, and discuss what kind of regularization gives a favorable predictive accuracy. Our main target in this paper is dense type regularizations including \ellp-MKL. According to the recent numerical experiments, the sparse regularization does not necessarily show a good performance compared with dense type regularizations. Motivated by this fact, this paper gives a general theoretical tool to derive fast learning rates of MKL that is applicable to arbitrary mixed-norm-type regularizations in a unifying manner. This enables us to compare the generalization performances of various types of regularizations. As a consequence, we observe that the homogeneity of the complexities of candidate reproducing kernel Hilbert spaces (RKHSs) affects which regularization strategy (\ell1 or dense) is preferred. In fact, in homogeneous complexity settings where the complexities of all RKHSs are evenly same, \ell1-regularization is optimal among all isotropic norms. On the other hand, in inhomogeneous complexity settings, dense type regularizations can show better learning rate than sparse \ell1-regularization. We also show that our learning rate achieves the minimax lower bound in homogeneous complexity settings.


Introduction
Multiple Kernel Learning (MKL) proposed by Lanckriet et al. (2004) is one of the most promising methods that adaptively select the kernel function in supervised kernel learning.Kernel method is widely used and several studies have supported its usefulness (Schölkopf and Smola, 2002;Shawe-Taylor and Cristianini, 2004).However the performance of kernel methods critically relies on the choice of the kernel function.Many methods have been proposed to deal with the issue of kernel selection.Ong et al. (2005) studied hyperkernels as a kernel of kernel functions.Argyriou et al. (2006) considered DC programming approach to learn a mixture of kernels with continuous parameters.Some studies tackled a problem to learn non-linear combination of kernels as in Bach (2009); Cortes et al. (2009a); Varma and Babu (2009).Among them, learning a linear combination of finite candidate kernels with non-negative coefficients is the most basic, fundamental and commonly used approach.The seminal work of MKL by Lanckriet et al. (2004) considered learning convex combination of candidate kernels as well as its linear combination.This work opened up the sequence of the MKL studies.Bach et al. (2004) showed that MKL can be reformuc 2011 Taiji Suzuki.lated as a kernel version of the group lasso (Yuan and Lin, 2006).This formulation gives an insight that MKL can be described as a ℓ 1 -mixed-norm regularized method.As a generalization of MKL, ℓ p -MKL that imposes ℓ p -mixed-norm regularization has been proposed (Micchelli and Pontil, 2005;Kloft et al., 2009).ℓ p -MKL includes the original MKL as a special case as ℓ 1 -MKL.Another direction of generalization is elasticnet-MKL (Shawe-Taylor, 2008;Tomioka and Suzuki, 2009) that imposes a mixture of ℓ 1 -mixed-norm and ℓ 2 -mixednorm regularizations.Recently numerical studies have shown that ℓ p -MKL with p > 1 and elasticnet-MKL show better performances than ℓ 1 -MKL in several situations (Kloft et al., 2009;Cortes et al., 2009b;Tomioka and Suzuki, 2009).An interesting perception here is that both ℓ p -MKL and elasticnet-MKL produce denser estimator than the original ℓ 1 -MKL while they show favorable performances.The goal of this paper is to give a theoretical justification to these experimental results favorable for the dense type MKL methods.To this aim, we give a unifying framework to derive a fast learning rate of an arbitrary norm type regularization, and discuss which regularization is preferred depending on the problem settings.
In the pioneering paper of Lanckriet et al. (2004), a convergence rate of MKL is given as M n , where M is the number of given kernels and n is the number of samples.Srebro and Ben-David (2006) gave improved learning bound utilizing the pseudo-dimension of the given kernel class.Ying and Campbell (2009) gave a convergence bound utilizing Rademacher chaos and gave some upper bounds of the Rademacher chaos utilizing the pseudo-dimension of the kernel class.Cortes et al. (2009b) presented a convergence bound for a learning method with L 2 regularization on the kernel weight.Cortes et al. (2010) gave the convergence rate of ℓ p -MKL as log(M ) n for p = 1 and M 1− 1 p √ n for 1 < p ≤ 2. Kloft et al. (2011) gave a similar convergence bound with improved constants.Kloft et al. (2010) generalized this bound to a variant of the elasticnet type regularization and widened the effective range of p to all range of p ≥ 1 while 1 ≤ p ≤ 2 had been imposed in the existing works.One concern about these bounds is that all bounds introduced above are "global" bounds in a sense that the bounds are applicable to all candidates of estimators.Consequently all convergence rate presented above are of order 1/ √ n with respect to the number n of samples.However, by utilizing the localization techniques including so-called local Rademacher complexity (Bartlett et al., 2005;Koltchinskii, 2006) and peeling device (van de Geer, 2000), we can derive a faster learning rate.Instead of uniformly bounding all candidates of estimators, the localized inequality focuses on a particular estimator such as empirical risk minimizer, thus can give a sharp convergence rate.Localized bounds of MKL have been given mainly in sparse learning settings (Koltchinskii and Yuan, 2008;Meier et al., 2009;Koltchinskii and Yuan, 2010), and there are only few studies for non-sparse settings in which the sparsity of the ground truth is not assumed.The first localized bound of MKL is derived by Koltchinskii and Yuan (2008) in the setting of ℓ 1 -MKL.The second one was given by Meier et al. (2009) who gave a near optimal convergence rate for elasticnet type regularization.Recently Koltchinskii and Yuan (2010) considered a variant of ℓ 1 -MKL and showed it achieves the minimax optimal convergence rate.All these localized convergence rates were considered in sparse learning settings, and it has not been discussed how a dense type regularization outperforms the sparse ℓ 1 -regularization.Recently Kloft and Blanchard (2011) gave a localized convergence

Preliminary
In this section we give the problem formulation, the notations and the assumptions required for the convergence analysis.

Problem Formulation
Suppose that we are given n i.i.d.samples {(x i , y i )} n i=1 distributed from a probability distribution P on X ×R where X is an input space.We denote by Π the marginal distribution of P on X .We are given M reproducing kernel Hilbert spaces (RKHS) {H m } M m=1 each of which is associated with a kernel k m .We consider a mixed-norm type regularization with respect to an arbitrary given norm • ψ , that is, the regularization is given by the norm ( f m Hm ) M m=1 ψ of the vector ( f m Hm ) M m=1 for f m ∈ H m (m = 1, . . ., M ) 1 .For notational simplicity, we write f ψ = ( f m Hm ) M m=1 ψ for f = M m=1 f m (f m ∈ H m ).
1. We assume that the mixed-norm ( fm Hm ) M m=1 ψ satisfies the triangular inequality with respect to (fm) M m=1 , that is, ( fm + f ′ m Hm ) M m=1 ψ ≤ ( fm Hm ) M m=1 ψ + ( f ′ m Hm ) M m=1 ψ .To satisfy this condition, it is sufficient if the norm is monotone, i.e., a ψ ≤ a + b ψ for all a, b ≥ 0.

T. Suzuki
The general formulation of MKL, we consider in this paper, fits a function f = We call this "ψ-norm MKL".This formulation covers many practically used MKL methods (e.g., ℓ p -MKL, elasticnet-MKL, variable sparsity kernel learning (see later for their definitions)), and is solvable by a finite dimensional optimization procedure due to the representer theorem (Kimeldorf and Wahba, 1971).In this paper, we mainly focus on the regression problem (the squared loss).However the discussion can be generalized to Lipschitz continuous and strongly convex losses as in Bartlett et al. (2005) (see Section 7).
Example 3: Variable Sparsity Kernel Learning Variable Sparsity Kernel Learning (VSKL) proposed by Aflalo et al. (2011) divides the RKHSs into M ′ groups {H j,k } M j k=1 , (j = 1, . . ., M ′ ) and imposes a mixed norm regularization q p 1 q where 1 ≤ p, 1 ≤ q, and f j,k ∈ H j,k .An advantageous point of VSKL is that by adjusting the parameters p and q, various levels of sparsity can be introduced.The parameters can control the level of sparsity within group and between groups.This point is beneficial especially for multi-modal tasks like object categorization.

Notations and Assumptions
Here, we prepare notations and assumptions that are used in the analysis.Let . This is a little abuse of notation because the decomposition f = M m=1 f m might not be unique as an element of L 2 (Π).However this will not cause any confusion.
Throughout the paper, we assume the following technical conditions (see also Bach (2008)).
The first assumption in (A1) ensures the model H ⊕M is correctly specified, and the technical assumption |ǫ| ≤ L allows ǫf to be Lipschitz continuous with respect to f .The noise boundedness can be relaxed to unbounded situation as in Raskutti et al. (2010) if we consider Gaussian noise, but we don't pursue that direction for simplicity.
Let an integral operator It is known that this operator is compact, positive, and self-adjoint (see Theorem 4.27 of Steinwart ( 2008)).Thus it has at most countably many non-negative eigenvalues.We denote by µ ℓ,m be the ℓ-th largest eigenvalue (with possible multiplicity) of the integral operator T km .By Theorem 4.27 of Steinwart (2008), the sum of µ ℓ,m is bounded ( ℓ µ ℓ,m < ∞), and thus µ ℓ,m decreases with order ℓ −1 (µ ℓ,m = o(ℓ −1 )).We further assume the sequence of the eigenvalues converges even faster to zero.
Assumption 3 (Spectral Assumption) There exist 0 < s m < 1 and 0 < c such that where {µ ℓ,m } ∞ ℓ=1 is the spectrum of the operator T km corresponding to the kernel k m .
It was shown that the spectral assumption (A3) is equivalent to the classical covering number assumption (Steinwart et al., 2009).Recall that the ǫ-covering number N (ǫ, B Hm , L 2 (Π)) with respect to L 2 (Π) is the minimal number of balls with radius ǫ needed to cover the unit ball B Hm in H m (van der Vaart and Wellner, 1996).If the spectral assumption (A3) and the boundedness assumption (A2) holds, there exists a constant C that depends only on s and c such that and the converse is also true (see Steinwart et al. (2009, Theorem 15) and Steinwart (2008) for details).Therefore, if s m is large, the RKHSs are regarded as "complex", and if s m is small, the RKHSs are "simple".
An important class of RKHSs where s m is known is Sobolev space.(A3) holds with s m = d 2α for Sobolev space W α,2 (X ) of α-times continuously differentiability on the Euclidean ball X of R d (Edmunds and Triebel, 1996).Moreover, for α-times differentiable kernels on a closed Euclidean ball in R d , (A3) holds for s m = d 2α (Steinwart, 2008, Theorem 6.26).According to Theorem 7.34 of Steinwart (2008), for Gaussian kernels with compact support distribution, that holds for arbitrary small 0 < s m .The covering number of Gaussian kernels with unbounded support distribution is also described in Theorem 7.34 of Steinwart (2008).
(3) κ M represents the correlation of RKHSs.We assume all RKHSs are not completely correlated to each other.
Assumption 4 (Incoherence Assumption) κ M is strictly bounded from below; there exists a constant C 0 > 0 such that This condition is motivated by the incoherence condition (Koltchinskii and Yuan, 2008;Meier et al., 2009) considered in sparse MKL settings.This ensures the uniqueness of the decomposition f * = M m=1 f * m of the ground truth.Bach (2008) also assumed this condition to show the consistency of ℓ 1 -MKL.
Finally we give a technical assumption with respect to ∞-norm.
Assumption 5 (Embedded Assumption) Under the Spectral Assumption, there exists a constant C 1 > 0 such that Π) .This condition is met when the input distribution Π has a density with respect to the uniform distribution on X that is bounded away from 0 and the RKHSs are continuously embedded in a Sobolev space W α,2 (X ) where s m = d 2α , d is the dimension of the input space X and α is the "smoothness" of the Sobolev space.Many practically used kernels satisfy this condition (A5).For example, the RKHSs of Gaussian kernels can be embedded in all Sobolev spaces.Therefore the condition (A5) seems rather common and practical.More generally, there is a clear characterization of the condition (A5) in terms of real interpolation of spaces.One can find detailed and formal discussions of interpolations in Steinwart et al. (2009), and Proposition 2.10 of Bennett and Sharpley (1988) gives the necessary and sufficient condition for the assumption (A5).
Constants we use later are summarized in Table 1.

Convergence Rate of ψ-norm MKL
Here we derive the learning rate of ψ-norm MKL in the most general setting.We suppose that the number of kernels M can increase along with the number of samples n.The motivation of our analysis is summarized as follows: • Give a unifying framework to derive a sharp convergence rate of ψ-norm MKL.
• (homogeneous complexity) Show the convergence rate of some examples using our general framework, prove its minimax-optimality, and show the optimality of ℓ 1regularization under conditions that the complexities s m of all RKHSs are same.The number of samples.M The number of candidate kernels.L The bound of the noise (A2).c The coefficient for Spectral Assumption; see (A3).s m The decay rate of spectrum; see (A3).κ M The smallest eigenvalue of the design matrix; see Eq. (3).C 1 The coefficient for Embedded Assumption; see (A5).
• (inhomogeneous complexity) Discuss how the dense type regularization outperforms sparse type regularization, when the complexities s m of all RKHSs are not uniformly same.
Theorem 1 Suppose Assumptions 1-5 are satisfied.Let {r m } M m=1 be arbitrary positive reals that can depend on n, and assume λ . Then there exists a constant φ depending only on {s m } M m=1 , c, C 1 , L such that for all n and t ′ that satisfy log(M ) 12 and for all t ≥ 1, we have with probability 1 − exp(−t) − exp(−t ′ ).In particular, for λ The proof will be given in Appendix C. The statement of Theorem 1 itself is complicated.Thus we will show later concrete learning rates on some examples such as ℓ p -MKL.The convergence rate (6) depends on the positive reals {r m } M m=1 , but the choice of {r m } M m=1 are T. Suzuki arbitrary.Thus by minimizing the right hand side of Eq. ( 6), we obtain tight convergence bound as follows: There is a trade-off between the first two terms (a) := α 2 1 + β 2 1 and the third term (b) := that is, if we take {r m } m large, then the term (a) becomes small and the term (b) becomes large, on the other hand, if we take {r m } m small, then it results in large (a) and small (b).Therefore we need to balance the two terms (a) and (b) to obtain the minimum in Eq. ( 7).
We discuss the obtained learning rate in two situations, (i) homogeneous complexity situation, and (ii) inhomogeneous complexity situation: (i) (homogeneous) All s m s are same: there exists 0 < s < 1 such that s m = s (∀m) (Sec.4).(ii) (inhomogeneous) All s m s are not same: there exist m, m ′ such that s m = s m ′ (Sec.5).

Analysis on Homogeneous Settings
Here we assume all s m s are same, say s m = s for all m (homogeneous setting).In this section, we give a simple upper bound of the minimum of the bound (7) (Sec.4.1), derive concrete convergence rates of some examples using the simple upper bound (Sec.4.2) and show that the simple upper bound achieves the minimax learning rate of ψ-norm ball if ψ-norm is isotropic (Sec.4.3).Finally we discuss the optimal regularization (Sec.4.4).In Sec.4.2, we also discuss the difference between our bound of ℓ p -MKL and existing bounds.

Simplification of Convergence Rate
If we restrict the situation as all r m s are same (r m = r (∀m) for some r), then the minimization in Eq. ( 7) can be easily carried out as in the following lemma.Let 1 be the M -dimensional vector each element of which is 1: 1 := (1, . . ., 1) ⊤ ∈ R M , and • ψ * be the dual norm of the ψ-norm 2 .Lemma 2 Suppose s m = s (∀m) with some 0 < s < 1, and set , then for all n and t ′ that satisfy and for all t ≥ 1, we have with probability 1 − exp(−t) − exp(−t ′ ) where C is a constant depending on φ and κ M .In particular we have 2. The dual of the norm The proof is given in Appendix F.1.Lemma 2 is derived by assuming r m = r (∀m), which might make the bound loose.However, when the norm • ψ is isotropic (whose definition will appear later), that restriction (r m = r (∀m)) does not make the bound loose, that is, the upper bound obtained in Lemma 2 is tight and achieves the minimax optimal rate (the minimax optimal rate is the one that cannot be improved by any estimator).In the following, we investigate the general result of Lemma 2 through some important examples.

Convergence Rate of ℓ p -MKL
Here we derive the convergence rate of It is well known that the dual norm of ℓ p -norm is given as ℓ q -norm where q is the real satisfying If we further assume n is sufficiently large such that then the leading term is the first term, and thus we have Note that as the complexity s of RKHSs becomes small the convergence rate becomes fast.It is known that n − 1 1+s is the minimax optimal learning rate for single kernel learning.The derived rate of ℓ p -MKL is obtained by multiplying a coefficient depending on M and R p to the optimal rate of single kernel learning.To investigate the dependency of R p to the learning rate, let us consider two extreme settings, i.e., sparse setting ( f * m Hm ) M m=1 = (1, 0, . . ., 0) and dense setting ( f * m Hm ) M m=1 = (1, . . ., 1) as in Kloft et al. (2011).1+s) is fast for small p and the minimum is achieved at p = 1.This means that ℓ 1 regularization is preferred for sparse truth.
, thus the convergence rate is M n − 1 1+s for all p. Interestingly for dense ground truth, there is no dependency of the convergence rate on the parameter p (later we will show that this is not the case in inhomogeneous setting (Sec.5)).That is, the convergence rate is M times the optimal learning rate of single kernel learning (n − 1 1+s ) for all p.This means that for the dense settings, the complexity of solving MKL problem is equivalent to that of solving M single kernel learning problems.

T. Suzuki
Comparison with Existing Bounds Here we compare the bound for ℓ p -MKL we derived above with the existing bounds.Let H ℓp (R p ) be the ℓ p -mixed norm ball with radius There are two types of convergence rates: global bound and localized bound.
(comparison with existing global bound) Cortes et al. (2010); Kloft et al. (2010Kloft et al. ( , 2011) ) gave "global" type bounds for ℓ p -MKL as where R(f ) and R(f ) is the population risk and the empirical risk.The bounds by Cortes et al. (2010) and Kloft et al. (2011) are restricted to the situation 1 ≤ p ≤ 2. On the other hand, our analysis and that of Kloft et al. (2010) covers all p ≥ 1.Since our bound is specialized to the regularized risk minimizer f defined at Eq. ( 1) while the existing bound ( 12) is applicable to all f ∈ H ℓp (R p ), our bound is sharper than theirs for sufficiently large n.To see this, suppose that then we have )R p and hence our localized bound is sharper than the global one.Interestingly, the range of n presented in Eq. ( 13) where the localized bound exceeds the global bound is same (up to log M term) as the range presented in Eq. ( 10 s ) where the first term in our bound (9) dominates its second term so that the simplified bound (11) holds.That means that, at the "phase transition point" from global to localized bound, the first informative term in our bound becomes the leading term.
Finally we note that, since s can be large as long as Spectral Assumption (A3) is satisfied, the bound ( 12) is recovered by our analysis by approaching s to 1.
(comparison with existing localized bound) Recently Kloft and Blanchard (2011) gave a tighter convergence rate utilizing the localization technique as under a strong condition κ M = 1 that imposes all RKHSs are completely uncorrelated to each other.Comparing our bound with their result, there is min p ′ ≥p and p ′ p ′ −1 in their bound (if there is not the term p ′ p ′ −1 , then the minimum of min p ′ ≥p is attained at p ′ = p, thus our bound is tighter).Due to this, we obtain a quite different consequence from theirs.According to our bound (11), the optimal regularization among all ℓ p -norm that gives the smallest generalization error is ℓ 1 -regularization (this will be discussed later in Sec.4.4) while their consequence says that the optimal p changes depending on the "sparsity" of the true function f * .Moreover we will observe that ℓ 1 -regularization is optimal among all isotropic mixed-norm-type regularization.The details of the optimality will be discussed in Sec.4.4.

Convergence Rate of Elasticnet-MKL
Elasticnet-MKL employs a mixture of ℓ 1 and ℓ 2 norm as the regularizer: Then its dual norm is given by b . Therefore by a simple calculation, we have Hence Eq. ( 8) gives the convergence rate of elasticnet-MKL as Note that, when τ = 0 or τ = 1, this rate is identical to that of ℓ 2 -MKL or ℓ 1 -MKL obtained in Eq. ( 9) respectively.
Lemma 3 The dual of the mixed norm is given by The proof will be given in Appendix F.2. Therefore the dual norm of the vector 1 is given by . Hence, by Eq. ( 8), the convergence rate of VSKL is given as One can check that this convergence rate coincides with that of ℓ p -MKL when M ′ = 1.

Minimax Lower Bound
In this section, we show that the derived learning rate (8) achieves the minimax-learning rate on the ψ-norm ball when the norm is isotropic.
Definition 1 We say that ψ-norm • ψ is isotropic when there exits a universal constant c such that (note that the inverse inequality M ≤ 1 ψ * 1 ψ of the first condition always holds by the definition of the dual norm).
Practically used regularizations usually satisfy the isotropic property.In fact, ℓ p -MKL, elasticnet-MKL and VSKL satisfy the isotropic property with c = 1.We derive the minimax learning rate in a simpler situation.First we assume that each RKHS is same as others.That is, the input vector is decomposed into M components like x = (x (1) , . . ., x (M ) ) where {x (m) } M m=1 are M i.i.d.copies of a random variable X, and ) where each fm is a member of the common RKHS H.We denote by k the kernel associated with the RKHS H.
In addition to the condition about the upper bound of spectrum (Spectral Assumption (A3)), we assume that the spectrum of all the RKHSs H m have the same lower bound of polynomial rate.
Assumption 6 (Strong Spectral Assumption) There exist 0 < s < 1 and 0 < c, c ′ such that where {μ ℓ } ∞ ℓ=1 is the spectrum of the integral operator T k corresponding to the kernel k.In particular, the spectrum of T km also satisfies µ ℓ,m ∼ ℓ − 1 s (∀ℓ, m).
Without loss of generality, we may assume that E[f ( X)] = 0 (∀f ∈ H).Since each f m receives i.i.d.copy of X, H m s are orthogonal to each other: We also assume that the noise {ǫ i } n i=1 is an i.i.d.normal sequence with standard deviation σ > 0.
Under the assumptions described above, we have the following minimax L 2 (Π)-error.
where inf is taken over all measurable functions of n samples {(x i , y i )} n i=1 .
The proof will be given in Appendix E. One can see that the convergence rate derived in Eq. ( 8) achieves the minimax rate on the ψ-norm ball (Theorem 4) up to M log(M ) n that is negligible when the number of samples is large.Indeed if then the first term in Eq. ( 8) dominates the second term M log(M ) n and the upper bound coincides with the minimax optima rate.Note that the condition (17) for the sample size n is equivalent to the condition for n assumed in Theorem 4 up to factors of log(M ) 1+s s and a constant.
The fact that ψ-norm MKL achieves the minimax optimal rate (16) indicates that the ψ-norm regularization is well suited to make the estimator included in the ψ-norm ball.

Optimal Regularization Strategy
Here we discuss which regularization gives the best performance based on the generalization error bound given by Lemma 2. Surprisingly the best regularization that gives the optimal performance among all isotropic ψ-norm regularizations is ℓ 1 -norm regularization.This can be seen as follows.According to Eq. ( 8), we have seen that the convergence rate of ψ-norm MKL is upper bounded as and this is mini-max optimal on ψ-norm ball if ψ-norm is isotropic.Here by the definition of the dual norm • ψ * , we always have Therefore the leading term of the convergence rate for ℓ 1 -norm regularization is upper bounded by that for other arbitrary ψ-norm regularization as (here it should be noticed that the dual norm of ℓ 1 -norm is ℓ ∞ -norm and 1 ℓ∞ = 1).This shows that the upper bound ( 8) is minimized by ℓ 1 -norm regularization.In other words, ℓ 1 -regularization is optimal among all (isotropic) ψ-norm regularization in homogeneous settings.This consequence is different from that of Kloft and Blanchard (2011) where the optimal regularization among ℓ p -MKL is discussed.Their consequence says that the best performance is achieved at p 1 and the best p depends on the variation of the RKHS norms of {f * m } M m=1 : if f * is close to sparse (i.e., f * m Hm decays rapidly), small p is preferred, on the other hand if f * is dense (i.e., { f * m Hm } M m=1 is uniform), then large p is preferred.This consequence seems reasonable, but our consequence is different: ℓ 1 -norm regularization is always optimal in ℓ p -regularizations.The antinomy of the two consequences comes from the additional terms min p ′ ≥p and p ′ p ′ −1 in their bound ( 14) (there are no such terms in our bound).This difference makes our bound tighter than their bound but simultaneously leads to a somewhat counter-intuitive consequence that is contrastive against the some experiment results supporting dense type regularization.However such experimental observations are justified by considering inhomogeneous settings.Here we should notice that the homogeneous setting is quite restrictive and unrealistic because it is required that the complexities of all RKHSs are uniformly same.In real settings, it is natural to assume the complexities varies depending on RKHS (inhomogeneous).In the next section, we discuss how dense type regularizations outperform the ℓ 1 -regularization.

Analysis on Inhomogeneous Settings
In the previous sections (analysis on homogeneous settings), we have seen ℓ 1 -MKL shows the best performance among isotropic ψ-norm and have not observed any theoretical justification supporting the fact that dense MKL methods like ℓ 4 3 -MKL can outperform the sparse ℓ 1 -MKL (Cortes et al., 2010).In this section, we show dense type regularizations can outperform the sparse regularization in inhomogeneous settings (where there exists m, m ′ such that s m = s m ′ ).For simplicity, we focus on ℓ p -MKL, and discuss the relation between the learning rate and the norm parameter p.
Let us consider an extreme situation where s 1 = s for some 0 < s < 1 and s m = 0 (m > 1) 3 .In this situation, we have . for all p.Note that these α 1 , α 2 , β 1 and β 2 have no dependency on p. Therefore the learning bound ( 7) is smallest when p = ∞ because f * ℓ∞ ≤ f * ℓp for all 1 ≤ p < ∞.In particular, when ( f * m Hm ) M m=1 = 1, we have f * ℓ 1 = M f * ℓ∞ and thus obviously the learning rate of ℓ ∞ -MKL given by Eq. ( 7) is faster than that of ℓ 1 -MKL.In fact, through a bit cumbersome calculation, one can check that ℓ ∞ -MKL can be at least M 2s 1+s times faster (up to constants) than ℓ 1 -MKL in a worst case.Indeed we have the following learning rate of ℓ 1 -MKL and ℓ ∞ -MKL (say f (1) and f (∞) ).
3. In our assumption sm should be greater than 0. However we formally put sm = 0 (m > 1) for simplicity of discussion.For rigorous discussion, one might consider arbitrary small sm ≪ s. , This indicates that when the complexities of RKHSs are inhomogeneous, the generalization ability of dense type regularization (e.g., ℓ ∞ -MKL) can be better than sparse type regularization (ℓ 1 -MKL).
Next we numerically calculate the convergence rate: Here we randomly generated s m from the uniform distribution on [0, 1/3] and f * m Hm from the uniform distribution on [0, 1] with n = 100 and M = 10.Then calculated the minimum of Eq. ( 19) using a numerical optimization solver where ℓ p -norm is employed as the regularizer (ℓ p -MKL).We used Differential Evolution technique4 (Price et al., 2005;Chakraborty, 2008) to obtain the minimum value.Figure 1 plots the minimum value of Eq. ( 19) against the parameter p of ℓ p -norm.We can see that the generalization error once goes down and then goes up as p gets large.The optimal p is attained around p = 1.4 in this example.
In real settings, it is likely that one uses various types of kernels and the complexities of RKHSs become inhomogeneous.As mentioned above, it has been often reported that T. Suzuki ℓ 1 -MKL is outperformed by dense type MKL such as ℓ 4 3 -MKL in numerical experiments (Cortes et al., 2010).Our theoretical analysis in this section well support these experimental results.

Numerical Comparison between Homogeneous and Inhomogeneous Settings
Here we investigate numerically how the inhomogeneity of the complexities affects the performances using synthetic data.In particular, we numerically compare two situations: (a) all complexities of RKHSs are same (homogeneous situation) and (b) one RKHS is complex and other RKHSs are evenly simple (inhomogeneous situation).
The experimental settings are as follows.The input random variable is 20 dimensional vector x = (x (1) , . . ., x (20) ) where each element x (m) is independently identically distributed from the uniform distribution on [0, 1]: For each coordinate m = 1, . . ., 20, we put one Gaussian RKHS H m with a Gaussian width σ m : the number of kernels is 20 (M = 20) and for x = (x (1) , . . ., x (20) ) and x ′ = (x ′(1) , . . ., x ′(20) ).To generate the ground truth f * , we randomly generated 5 center points µ i,m (i = 1, . . ., 5) for each coordinate m = 1, . . ., 20 where µ i,m is independently generated by the uniform distribution on [0, 1].Then we obtain the following form of the true function: is independently identically distributed from the standard normal distribution.The output y is contaminated by a noise ǫ where the noise ǫ is distributed from the Gaussian distribution with mean 0 and standard deviation 0.1: We generated 200 or 400 realizations {(x i , y i )} n i=1 (n = 200 or n = 400), and estimated f * using ℓ p -MKL with p = 1, 1.1, 1.2, . . ., 3 5 .The estimator is computed with various 5.We included a bias term in this experiment, that is, we fitted f (x) + b to the data: min fm,b regularization parameters λ (n) 1 .The generalization error f − f * 2 L 2 (Π) was numerically calculated.We repeated the experiments for 100 times, averaged the generalization errors over 100 repetitions for each p and each regularization parameter, and obtained the optimal average generalization error among all regularization parameters for each p.The true function was randomly generated for each repetition.We investigated the generalization errors in the following homogeneous and inhomogeneous settings: 1. (homogeneous) σ m = 0.5 for m = 1, . . ., 20.
The difference between the above homogeneous and inhomogeneous settings is the value of σ 1 ; whether σ 1 = 0.5 or σ 1 = 0.01.The inhomogeneous situation is analogous to that investigated in Sec.5 where we assumed one RKHS is complex and the other RKHSs are evenly simple (small σ 1 corresponds to a complex RKHS).
Figure 2 shows the average generalization errors in the homogeneous setting with (a) n = 200 and (c) n = 400, and the inhomogeneous setting with (b) n = 200 and (d) n = 400.Each broken line corresponds to one regularization parameter.The bold solid line shows the best (average) generalization error among all the regularization parameters.We can see that in the homogeneous setting ℓ 1 -regularization shows the best performance, on the other hand, in the inhomogeneous setting the best performance is achieved at p > 1 for both n = 200 and 400.This experimental results beautifully matches the theoretical investigations.

Generalization of loss function
Here we discuss how a general loss function other than squared loss can be involved into our analysis.As in the standard local Rademacher complexity argument (Bartlett et al., 2005), we consider a class of loss functions that are Lipschitz continuous and strongly convex.Suppose that the loss function Ψ : R × R → R satisfies Lipschitz continuity: for all R > 0, there exists a constant T (R) such that Moreover, suppose that, for all y ∈ R, Ψ(y, f ) is a strongly convex with a modulus ρ(R) > 0: Some detailed discussions about these conditions and examples can be found in Bartlett et al. (2006).Under the loss functions satisfying these properties, we obtain simplified bound where some conditions can be omitted as follows: • We can remove the condition 4φ • The term exp(−t ′ ) is not needed in the tail probability.
] against the parameter p for ℓ p -MKL.Each broken line corresponds to one regularization parameter.The bold solid line shows the best generalization error among all the regularization parameters.
To obtain a fast convergence rate on a general loss functions Ψ, we move the regularization term in Eq. ( 1) into a constraint, and then consider the following optimization problem: where R is a regularization parameter.The above optimization problem is essentially equivalent to the original formulation (1), but by considering the constraint type regularization instead of the penalty type regularization the theoretical analysis of statistical performance can be simplified.We define P g as the expectation of a function g : R × R → R: For notational simplicity, we write P Ψ(f ] for a function f .We suppose there exists a minimizer for P Ψ(f ) as follows.

Assumption 7 (Minimizer Existence Assumption)
There exists unique Note that, due to the incoherence assumption (Assumption 4) and the strong convexity (21) of the loss function, if there exists a minimizer, then that is automatically unique.
To bound the convergence rate on a general loss function, it is convenient to utilize local Rademacher complexity on ψ-norm ball.Let H (r) where σ i ∈ {±1} is the i.i.d.Rademacher random variable with P (σ i = 1) = P (σ i = −1) = 1 2 .Evaluating the local Rademacher complexity is a key ingredient to show a fast convergence rate on a general loss function.We obtain the following estimation of the local Rademacher complexity (the proof will be given in Appendix F.4).
Lemma 6 Let {r m } M m=1 be arbitrary positive reals.Under Assumptions 2-5, there exists a constant φ depending on {s m } M m=1 , c, C 1 such that for all n satisfying log(M ) √ n ≤ 1 we have Finally note that the supremum norm of f with f ψ ≤ R can be bounded as Then, we obtain the excess risk bound as in the following theorem.
Theorem 7 Suppose Assumptions 2-5 and 7 are satisfied and the loss function Ψ satisfies the conditions (20) and (21).Let {r m } M m=1 be arbitrary positive reals that can depend on n and let T = T ( 1 ψ * R) and ρ = ρ( 1 ψ * R).Set R = f * ψ .Then there exists a constant φ′ depending on {s m } M m=1 , c, C 1 such that for all n satisfying log(M ) √ n ≤ 1, we have with probability 1 − exp(−t).
This can be shown by applying the bound of the local Rademacher complexity (Lemma 6) to Corollary 5.3 of Bartlett et al. (2005) 6 .Compared with the bound in Eq. ( 6), we notice that there is no exp(−t ′ ) term in the tail probability bound, and thus we don't need the condition 4φ Because of this, the range of n where the error bound holds is relaxed compared with that in Theorem 1.These simplifications are due to the Lipschitz continuity of the loss function.In Theorem 1, we should have bounded the discrepancy between the empirical and population means of the squared loss: Since the squared loss is not Lipschitz continuous, we required an additional bound for that discrepancy using Assumption 5 for the supremum norm, and it was shown that that discrepancy is negligible at the cost of exp(−t ′ ) in the tail probability.On the other hand, for Lipschitz continuous losses, we no longer need to bound such a quantity.Thus the tail probability loss exp(−t ′ ) is not induced.
Since the bound ( 23) is basically same as Eq.( 6), we obtain the same discussions as in the previous sections.For example, in the homogeneous setting, we obtain the following convergence bound.
6.In Corollary 5.3 of Bartlett et al. (2005), the range of the function class is assumed to be included in the interval [−1, 1].Here we utilize more general settings where the interval is [−a, a] and 1 ψ * R is substituted to a. See Lemma 9 of Kloft and Blanchard (2011).

Conclusion and Future Work
We have shown a unifying framework to derive the learning rate of MKL with arbitrary mixed-norm-type regularization.To analyze the general result, we considered two situations: homogeneous settings and inhomogeneous settings.We have seen that the convergence rate of ℓ p -MKL obtained in homogeneous settings is tighter and requires less restrictive condition than existing results.We have also shown convergence rates of some examples (elasticnet-MKL and VSKL), and proved the derived learning rate is minimax optimal when ψ-norm is isotropic.An interesting consequence was that ℓ 1 -regularization is optimal among all isotropic ψ-norm regularization in homogeneous settings.In the analysis of inhomogeneous settings, we have shown that the dense type regularization can outperform the sparse ℓ 1 -regularization using analytically obtained bounds and numerically computed bounds.We observed that our bound well explains the experimental results favorable for dense type MKL.Finally we numerically investigated the generalization errors of ℓ p -MKL in a homogeneous setting and an inhomogeneous setting.The numerical experiments supported the theoretical findings that ℓ 1 -regularization is optimal in homogeneous settings but, on the other hand, dense type regularizations are preferred in inhomogeneous settings.This is the first result that suggests that the inhomogeneity of the complexities of RKHSs well justifies the favorable performances for dense type MKL.
An interesting future work is about the M log(M ) n term appeared in the bound Eq. ( 8).Because of this term, our bound is O(M log(M )) with respect to M while in the existing work that is It is an interesting issue to clarify whether the term M log(M ) n can be replaced by other tighter bounds or not.To do so, it might be helpful to combine our technique developed in this paper and that developed by Kloft and Blanchard (2011) where the local Rademacher complexity for ℓ p -MKL is derived.

T. Suzuki
Proposition 9 If there exists constants 0 < s < 1 and C ≥ 1 such that e i (H m → L 2 (Π)) ≤ Ci − 1 2s , then there exists a constant c s > 0 only depending on s such that

Appendix B. Basic Propositions
The following two propositions are keys to prove Theorem 1.Let {σ i } n i=1 be i.i.d.Rademacher random variables, i.e., σ i ∈ {±1} and P (σ i = 1) = P (σ i = −1) = 1 2 .Proposition 10 (Steinwart, 2008, Theorem 7.16 Assume that there exist constants 0 < s < 1 and 0 < cs such that Then there exists a constant C ′ s depending only s such that Proposition 11 (Talagrand's Concentration Inequality Talagrand (1996); Bousquet (2002)) Let G be a function class on X that is separable with respect to ∞norm, and {x i } n i=1 be i.i.d.random variables with values in X .Furthermore, let B ≥ 0 and U ≥ 0 be B := sup g∈G E[(g − E[g]) 2 ] and U := sup g∈G g ∞ , then there exists a universal constant K such that, for Z := sup g∈G

Appendix C. Proof of Theorem 1
Let r m > 0 (m = 1, . . ., M ) be arbitrary positive reals.Given {r m } M m=1 , we determine U (m) n,sm (f m ) as follows: It is easy to see n,sm (f m ) is an upper bound of the quantity (this corresponds to the RHS of Eq. ( 25)) because where we used Young's inequality a 1−sm b sm ≤ (1 − s m )a + s m b in the second line, and similarly we obtain where we used sm(3−sm)   1+sm ≤ 3s m in the last inequality.Now we define where C * is a constant defined later in Lemma 16, C 1 is the one introduced in Assumption 5, K is the universal constant appeared in Talagrand's concentration inequality (Proposition 11) and L is the one introduced in Assumption 1 to bound the magnitude of noise.Remind the definition of η(t): We define events E 1 (t) and E 2 (t ′ ) as Using Lemmas 17 and 18 that will be shown in Appendix D, we see that the events E 1 (t) and E 2 (t ′ ) occur with probability no less than 1 − exp(−t) and 1 − exp(−t ′ ) respectively as in the following Lemma.
Lemma 12 Under the Basic Assumption (Assumption 1), the Spectral Assumption (Assumption 3) and the Embedded Assumption (Assumption 5), the probabilities of E 1 (t) and E 2 are bounded as
Theorem 13 Suppose Assumptions 1-4 are satisfied.Let {r m } M m=1 be arbitrary positive reals that can depend on n, and assume λ . Then for all n and t ′ that satisfy log(M ) √ n ≤ 1 and 4φ 12 and for all t ≥ 1, we have with probability 1 − exp(−t) − exp(−t ′ ).
Proof [Proof of Theorem 13] By the assumption of the theorem, we can assume Lemma 12 holds, that is, the event E 1 (t) ∩ E 2 (t ′ ) occurs with probability 1 − exp(−t) − exp(−t ′ ).
Below we discuss on the event E 1 (t) ∩ E 2 (t ′ ). Since Here on the event E 2 (t ′ ), the above inequality gives Before we prove the statements, we show an upper bound of M m=1 U n,sm (f m ) required in the proof.By definition, we have Now the sum of the first term is bounded as where we used Cauchy-Schwarz inequality and the duality of the norm in the last inequality.
The sum of the second term of the RHS of Eq. ( 32) is bounded as where we used Cauchy-Schwarz inequality and the duality of the norm in the last inequality.Finally we have the following bound of the third term of the RHS of Eq. ( 32): .
By assumption, we have 4φ Hence the RHS of the above inequality is bounded by Step 2. On the event E 1 (t), we have Step 3. Substituting the inequalities ( 35) and (36) to Eq. ( 30), we obtain T. Suzuki Now, by the triangular inequality, the term f − f * 2 ψ can be bounded as Thus, when λ , Eq. ( 37) yields Therefore by multiplying 2 to both sides, we have This gives the assertion.
Appendix D. Bounding the Probabilities of E 1 (t) and E 2 (t ′ ) Here we derive bounds of the probabilities of the events E 1 (t) and E 2 (t ′ ) (see Eq. ( 27) and Eq. ( 28) for their definitions).The goal of this section is to derive Lemmas 17 and 18.
Using Propositions 11 and 10, we obtain the following ratio type uniform bound.
Lemma 14 Under the Spectral Assumption (Assumption 3) and the Embedded Assumption (Assumption 5), there exists a constant C sm depending only on s m , c and C 1 such that ≤ δ} and z = 2 1/sm > 1. Define τ := s m r m .Then by combining Propositions 9 and 10 with Assumption 5, we have where we used s −sm m ≤ 3 for 0 < s m in the last line.
Thus by setting, , we obtain the assertion.
This lemma immediately gives the following corollary.
Corollary 15 Under the Spectral Assumption (Assumption 3) and the Embedded Assumption (Assumption 5), there exists a constant C sm depending only on s m , c and C 1 such that Proof By dividing the denominator and the numerator by the RKHS norm f m Hm , we have (∵ Lemma 14) Lemma 16 If log(M ) √ n ≤ 1, then under the Spectral Assumption (Assumption 3) and the Embedded Assumption (Assumption 5) there exists a constant C * depending only on Proof [Proof of Lemma 16] First notice that the L 2 (Π)-norm and the ∞-norm of σ i fm(x i ) n,sm (fm) can be evaluated by where the second line is shown by using the relation ( 26).Let C * := max m C sm where C sm is the constant appeared in Lemma 14.Thus Talagrand's inequality and Corollary 15 imply By setting t ← t + log(M ), we obtain for all t ≥ 0. Consequently the expectation of the max-sup term can be bounded as where we used t + 1 + log(M ) ≤ √ t + 1 + log(M ) and Lemma 17 Suppose the Basic Assumption (Assumption 1), the Spectral Assumption (Assumption 3) and the Embedded Assumption (Assumption 5) hold.Define φ = KL 2 C * + 1 + C 1 .If log(M ) √ n ≤ 1, then the following holds Proof [Proof of Lemma 17] By the contraction inequality (Ledoux and Talagrand, 1991, Theorem 4.12) and Lemma 16, we have where we used ǫ i ≤ L (Basic Assumption).Using this and Eq. ( 38) and Eq. ( 39), Talgrand's inequality gives Thus we have Therefore by the definition of φ and η(t), we obtain the assertion.
Lemma 18 Suppose the Basic Assumption (Assumption 1), the Spectral Assumption (Assumption 3) and the Embedded Assumption (Assumption 5) hold.Let φ′ where we used the contraction inequality in the last line (Ledoux and Talagrand, 1991, Theorem 4.12).Thus using Eq. ( 39), the RHS of the inequality (40) can be bounded as , where we used the relation for all a m ≥ 0 and b m ≥ 0 with a convention 0 0 = 0.By Lemma 16, the right hand side is upper bounded by 2C 1 √ n C * .Here we again apply Talagrand's concentration inequality, then we have where we substituted the following upper bounds of B and U .
where in the second inequality we used the relation and in the third and forth inequality we used Eq. ( 39) and Eq. ( 38) with Eq.( 41) respectively.
Appendix E. Proof of Theorem 4 (minimax learning rate)

Proof
[Proof of Theorem 4] The proof utilizes the techniques developed by Raskutti et al. (2009Raskutti et al. ( , 2010) ) that applied the information theoretic technique developed by Yang and Barron (1999) to the MKL settings.To simplify the notation, we write Here due to Theorem 15 of Steinwart et al. (2009), Assumption 6 yields log N (ε, H(1)) ∼ ε −2s . (42) We utilize the following inequality given by Lemma 3 of Raskutti et al. (2009): First we show the assertion for the ℓ ∞ -norm ball: (this is shown in Lemma 5 of Raskutti et al. (2010), but we give the proof in Lemma 19 for completeness).Using this expression, the minimax-learning rate is bounded as ) .
Here we choose ε n and δ n to satisfy the following relations: With ε n and δ n that satisfy the above relations ( 43) and ( 45), we have min By Eq. ( 42), the relation ( 43) can be rewritten as with a constant C. Since we have assumed that n > 44) can be satisfied if the constant C in Eq. ( 47) is taken sufficiently small so that we have The relation ( 45) can be satisfied by taking δ n = cε n with an appropriately chosen constant c.Thus Eq. ( 46) gives with a constant C.This gives the assertion for p = ∞.
Finally we show the assertion for general isotropic ψ-norm • ψ .To show that, we prove that H ℓ∞ (R 1 ψ * /(cM )) ⊂ H ψ (R).This is true if because of the second condition of the definition (15) of isotropic property.By the isotropic property, the ψ-norm of , (∵ Eq. ( 49)).
For this r, we obtain where we used s 2s 1+s ≤ 1 and 9 1 1+s ≤ 9 in the last inequality.Next we balance the terms β 2 1 and 1 M s 2 r 2 1 2 ψ * f * 2 ψ under the restriction that r m = r (∀m): More precisely, with r given in Eq. ( 55), the upper bound (56) of α 1 gives that, for n ≥ ( 1 ψ * f * ψ /M ) 4s 1−s , we have ( Thus by setting λ then Theorem 1 gives that for all n and t ′ that satisfy log(M ) √ n ≤ 1 and √ n η(t ′ ) ≤ 1 12 and for all t ≥ 1, we have 2s 1+s (57) with probability 1 − exp(−t) − exp(−t ′ ) where C is a sufficiently large constant depending on φ and κ M .Finally notice that the condition √ n η(t ′ ) ≤ 1 12 automatically gives log(M ) √ n ≤ 1, thus we can drop the condition log(M ) √ n ≤ 1.Then we obtain the assertion.
Therefore we obtain that On the other hand, if we set , then we have Therefore we obtain Combining Eqs.( 59),( 59 By re-setting φ ← 6 φ, we obtain the local Rademacher complexity upper bound.

M
m=1 f m (f m ∈ H m ) to the data by solving the following optimization problem:

Figure 1 :
Figure 1: The generalization error bound (19) of ℓ p -MKL with respect to p.

Figure 2 :
Figure 2: The expected generalization error E[ f − f * 2L 2 (Π) ] against the parameter p for ℓ p -MKL.Each broken line corresponds to one regularization parameter.The bold solid line shows the best generalization error among all the regularization parameters.

Table 1 :
Summary of the constants we use in this article.n