1 Introduction

In traditional machine learning, it is assumed that data are identically drawn from a single distribution. However, this assumption does not always hold in real-world applications. Therefore, it is imperative to develop methods that are capable of incorporating samples drawn from different distributions. In this case, transfer learning provides a general way to accommodate these situations. In transfer learning, apart from the few samples that are available related to an objective task, abundant samples from another domain that are not necessarily drawn from an identical distribution can be used. The domain related to the objective task is called the target domain and the other domain is called the source domain. Transfer learning aims to extract some useful knowledge from the source domain and apply this knowledge to achieve high task performance in the target domain.

Transfer learning is categorized in Pan and Yang (2010) into three areas: inductive transfer learning, transductive transfer learning, and unsupervised transfer learning. Inductive transfer learning corresponds to the setting in which labeled samples are available in the target domain. In addition, when labeled samples in the source domain are unavailable, the setting is called self-taught learning (Raina et al. 2007). In particular, self-taught learning can be applied to the case in which tasks are different in the source and target domains. Transductive transfer learning corresponds to the setting in which labeled samples are available only in the source domain. Then, tasks in both domains are typically assumed to be the same as in a covariate shift (Shimodaira 2000; Sugiyama et al. 2008) and sample selection bias (Zadrozny 2004; Huang et al. 2007). Domain adaptation (Daume and Marcu 2006; Blitzer et al. 2006) can be regarded as transfer learning in which tasks are the same in both domains; this is closely related to transductive transfer learning. Unsupervised transfer learning corresponds to the setting where labeled samples are unavailable in both domains. In this setting, the purpose is not to achieve high predictive performance but to perform an unsupervised task well in the target domain.

In accordance with the type of knowledge that is transferred, approaches for solving transfer-learning problems can be classified into types such as instance transfer, feature representation transfer, and parameter transfer (Pan and Yang 2010). In recent years, the parameter transfer approach has particularly attracted much attention in fine-tuning network weights of a deep neural network trained on source domains. In the setting of the parameter transfer approach, some kind of parametric models are supposed in both domains and the transferred knowledge is encoded into parameters. Biased regularization has been studied as a typical method in the parameter transfer approach, where the regularization term to an empirical loss has a non-zero center (e.g., \(\Vert {\mathbf{w}}-{\mathbf{w}}_{0}\Vert ^2\) instead of \(\Vert {\mathbf{w}}\Vert ^2\)) and the center is learnt on the source domain (Ben-David and Urner 2013; Pentina and Lampert 2014; Tommasi et al. 2014). Recently, generalization of the biased regularization was proposed and theoretically analyzed (Kuzborskij and Orabona 2013, 2017). Owing to its flexibility, the parameter transfer approach can be applied to other algorithms such as sparse coding (Raina et al. 2007; Maurer et al. 2013), multiple kernel learning (Duan et al. 2012), and deep learning (Yosinski et al. 2014).

As the parameter transfer approach typically requires many samples to accurately learn a suitable parameter in the source domain, unsupervised methods are often utilized for the learning process. In this sense, self-taught learning is compatible with the parameter transfer approach. The sparse coding-based method was used in Raina et al. (2007), in which self-taught learning was first introduced. Moreover, in this work, the parameter transfer approach was used with regard to a dictionary learnt from images as the parameter to be transferred. However, although self-taught learning has been studied in various contexts (Dai et al. 2008; Lee et al. 2009; Wang et al. 2013; Zhu et al. 2013) and many algorithms based on the parameter transfer approach have empirically demonstrated impressive performance in self-taught learning, some fundamental problems remain. First, the theoretical aspects of the parameter transfer approach have not been sufficiently studied. For example, in the context of the parameter transfer approach, the generalization error bound applicable to self-taught learning has not been considered except in a few studies (Kuzborskij and Orabona 2013, 2017). Furthermore, existing studies only treat restricted hypothesis sets, limiting applicability in areas such as sparse coding, multiple kernel learning, and neural network. Second, although it is believed that a large amount of unlabeled data helps improve the performance of the objective task in self-taught learning, the exact sample size has not been sufficiently clarified. Third, although sparsity-based methods are typically employed in self-taught learning, it is unknown how the sparsity works to guarantee the performance of self-taught learning.

In this study, we aimed to shed light on the above problems.Footnote 1 In this paper, we focus on inductive transfer learning and consider a general model of parametric feature mapping in the parameter transfer approach. We newly formulate the local stability of parametric feature mapping and the parameter transfer learnability for this mapping, and provide an excess risk bound for parameter transfer learning algorithms based on the notions. Furthermore, we consider the stability of sparse coding. Finally, we discuss parameter transfer learnability by dictionary learning under the sparse model. By applying the excess risk bound for parameter transfer learning algorithms, we derive an excess risk bound for the sparse coding algorithm in self-taught learning. Moreover, we show that the results of numerical experiments on handwritten digits datasets are in good agreement with the theoretical analysis of transfer learning with sparse coding. Note that our setting differs from the environment-based setting (Baxter 2000; Maurer 2009), where distribution over a set of distributions on labeled samples, known as an environment, is assumed. In our formulation, the existence of the environment is not assumed and presence of labeled data in the source domain is not required.

The remainder of the paper is organized as follows. In Sect. 2, we formulate the stability and parameter transfer learnability of the parametric feature mapping. Then, we present an excess risk bound for parameter transfer learning. In Sect. 3, we show the stability of sparse coding under perturbation of the dictionaries. By imposing sparsity assumptions on samples and considering dictionary learning, we derive the parameter transfer learnability for sparse coding. In particular, an excess risk bound is obtained for sparse coding in the setting of self-taught learning. Section 4 is devoted to numerical experiments of transfer learning with sparse coding. We conclude the paper with Sect. 5.

2 Excess risk bound for parameter transfer learning

2.1 Problem setting of parameter transfer learning

We formulate parameter transfer learning in this section. We first briefly introduce notations and terminology in transfer learning (Pan and Yang 2010). Let \(\mathcal {X}\) and \(\mathcal {Y}\) represent a sample space and label space, respectively. In addition, let \(\mathcal {H}=\{h:\mathcal {X}\rightarrow \mathcal {Y}\}\) be a hypothesis space and \(\ell :\mathcal {Y}\times \mathcal {Y}\rightarrow {\mathbb {R}}_{\ge 0}\) represent a loss function. Then, the expected risk and the empirical risk are defined as \(\mathcal {R}(h) :=\mathbb {E}_{({\mathbf{x}},y)\sim P} \left[ \ell (y, h({\mathbf{x}}))\right] \) and \(\widehat{\mathcal {R}}_n(h) :=\frac{1}{n}\sum _{j=1}^{n} \ell (y_j, h({\mathbf{x}}_j) )\), respectively. In the transfer learning setting, it is assumed that, apart from samples from a domain of interest (i.e., target domain), samples from another domain (i.e., source domain) are also available. We distinguish between the target and source domains by adding a subscript \(\mathcal {T}\) or \(\mathcal {S}\) to each notation introduced above, (e.g., \(P_{\mathcal {T}}\), \(\mathcal {R}_{\mathcal {S}}\)). The homogeneous setting (i.e., \(\mathcal {X}_{\mathcal {S}}=\mathcal {X}_{\mathcal {T}}\)) is not assumed in general, and thus, the heterogeneous setting (i.e., \(\mathcal {X}_{\mathcal {S}}\ne \mathcal {X}_{\mathcal {T}}\)) is used here. We note that self-taught learning, which is discussed in Sect. 3, corresponds to the case in which the label space \(\mathcal {Y}_{\mathcal {S}}\) in the source task is the set of a single element.

We consider the parameter transfer approach in which the knowledge to be transferred is encoded in a parameter. The parameter transfer approach aims to learn a hypothesis with low expected risk for the target task by obtaining some knowledge about an effective parameter in the source domain and transferring it to the target domain. We suppose that there are parametric models on both the source and target domains and their parameter spaces are partly shared. Our strategy is to learn an effective parameter in the source domain and then transfer a part of the parameter to the target domain. Next, we describe the formulation. In the target domain, we assume that \(\mathcal {Y}_{\mathcal {T}}\subset {\mathbb {R}}\) and there is a parametric feature mapping \(\psi _{{{\varvec{\theta }}}}:\mathcal {X}_{\mathcal {T}}\rightarrow {\mathbb {R}}^m\) on the target domain such that each hypothesis \(h_{\mathcal {T},{{\varvec{\theta }}},{\mathbf{w}}}:\mathcal {X}_{\mathcal {T}}\rightarrow \mathcal {Y}_{\mathcal {T}}\) is represented by

$$\begin{aligned} h_{\mathcal {T},{{\varvec{\theta }}},{\mathbf{w}}}({\mathbf{x}}):= \langle {\mathbf{w}}, \psi _{{{\varvec{\theta }}}}({\mathbf{x}}) \rangle , \end{aligned}$$
(1)

with parameters \({{\varvec{\theta }}}\in \varTheta \) and \({\mathbf{w}}\in \mathcal {W}_{\mathcal {T}}\), where \(\varTheta \) is a subset of a normed space with norm \(\Vert \cdot \Vert \) and \(\mathcal {W}_{\mathcal {T}}\) is a subset of \({\mathbb {R}}^m\). Then, the hypothesis set in the target domain is parameterized as

$$\begin{aligned} \mathcal {H}_{\mathcal {T}}=\{h_{\mathcal {T},{{\varvec{\theta }}},{\mathbf{w}}} |{{\varvec{\theta }}}\in \varTheta , {\mathbf{w}}\in \mathcal {W}_{\mathcal {T}}\}. \end{aligned}$$

In the following discussion, we simply denote \(\mathcal {R}_{\mathcal {T}}(h_{\mathcal {T},{{\varvec{\theta }}},{\mathbf{w}}})\) and \(\widehat{\mathcal {R}}_{\mathcal {T},n}(h_{\mathcal {T},{{\varvec{\theta }}},{\mathbf{w}}})\) by \(\mathcal {R}_{\mathcal {T}}({{\varvec{\theta }}}, {\mathbf{w}})\) and \(\widehat{\mathcal {R}}_{\mathcal {T},n}({{\varvec{\theta }}}, {\mathbf{w}})\), respectively. In the source domain, we suppose that there exists some kind of parametric model such as a sample distribution \(P_{\mathcal {S},{{\varvec{\theta }}},{\mathbf{w}}}\) or a hypothesis \(h_{\mathcal {S},{{\varvec{\theta }}},{\mathbf{w}}}\) with parameters \({{\varvec{\theta }}}\in \varTheta \) and \({\mathbf{w}}\in \mathcal {W}_{\mathcal {S}}\), and a part \(\varTheta \) of the parameter space is shared with the target domain. Then, let \({{\varvec{\theta }}}_{\mathcal {S}}^{*} \in \varTheta \) and \({\mathbf{w}}_{\mathcal {S}}^{*}\in \mathcal {W}_{\mathcal {S}}\) be parameters that are supposed to be effective in the source domain (e.g., the true parameter of the sample distribution, the parameter of the optimal hypothesis with respect to the expected risk \(\mathcal {R}_{\mathcal {S}}\)). Here, the parameters \({{\varvec{\theta }}}^*_S\) and \({\mathbf{w}}^*_S\) may be taken mathematically arbitrarily (i.e. there are no mathematical restrictions) and we do not use any specific property on \({{\varvec{\theta }}}^*_S\) and \({\mathbf{w}}^*_S\). Then, the parameter transfer algorithm treated in this paper is described as follows. Let N- and n-samples be available in the source and target domains, respectively. First, a parameter transfer algorithm outputs the estimator \(\widehat{{{\varvec{\theta }}}}_N\in \varTheta \) of \({{\varvec{\theta }}}_{\mathcal {S}}^{*}\) by using N-samples. Next, for the parameter

$$\begin{aligned} {\mathbf{w}}^{*}_{\mathcal {T}}:= & {} \underset{{\mathbf{w}}\in \mathcal {W}_{\mathcal {T}}}{\mathrm{argmin}} \mathcal {R}_{\mathcal {T}}\left( {{\varvec{\theta }}}^{*}_{\mathcal {S}}, {\mathbf{w}}\right) \end{aligned}$$
(2)

in the target domain, the algorithm outputs its estimator

$$\begin{aligned} \widehat{{\mathbf{w}}}_{N,n}:= & {} \underset{{\mathbf{w}}\in \mathcal {W}_{\mathcal {T}}}{\mathrm{argmin}}\widehat{\mathcal {R}}_{\mathcal {T},n}(\widehat{{{\varvec{\theta }}}}_N,{\mathbf{w}}) + \rho r({\mathbf{w}}) \end{aligned}$$
(3)

by using n-samples, where \(r({\mathbf{w}})\) is a 1-strongly convex function with respect to \(\Vert \cdot \Vert _2\) and \(\rho >0\). If the source domain relates to the target domain in some sense, the effective parameter \({{\varvec{\theta }}}^{*}_{\mathcal {S}}\) in the source domain is also expected to be useful for the target task. In the next section, we regard \(\mathcal {R}_{\mathcal {T}}\left( {{\varvec{\theta }}}^{*}_{\mathcal {S}}, {\mathbf{w}}^{*}_{\mathcal {T}}\right) \) as the baseline of predictive performance and derive an excess risk bound. The validity of the baseline is discussed in Sect. 2.2.

2.2 Excess risk bound based on stability and learnability

We introduce two new metrics, the local stability and parameter transfer learnability, as described below. These notions are essential to derive an excess risk bound in Theorem 1.

Definition 1

(Local Stability) A parametric feature mapping \(\psi _{{{\varvec{\theta }}}}\) is said to be locally stable if there exist \(\epsilon _{{{\varvec{\theta }}}}:\mathcal {X}\rightarrow {\mathbb {R}}_{>0}\) for each \({{\varvec{\theta }}}\in \varTheta \) and \(L_{\psi }>0\) such that, for \({{\varvec{\theta }}}'\in \varTheta \),

$$\begin{aligned} \Vert {{\varvec{\theta }}}- {{\varvec{\theta }}}'\Vert \le \epsilon _{{{\varvec{\theta }}}}({\mathbf{x}}) \Rightarrow \Vert \psi _{{{\varvec{\theta }}}}({\mathbf{x}}) - \psi _{{{\varvec{\theta }}}'}({\mathbf{x}})\Vert _{2} \le L_{\psi }\Vert {{\varvec{\theta }}}- {{\varvec{\theta }}}'\Vert . \end{aligned}$$

Local stability implies that the feature is not significantly affected by the parameter shift. We term \(\epsilon _{{{\varvec{\theta }}}}({\mathbf{x}})\) as the permissible radius of perturbation for \({{\varvec{\theta }}}\) at \({\mathbf{x}}\). For samples \({\mathbf{X}}^n=\{{\mathbf{x}}_1,\ldots {\mathbf{x}}_n\}\), we have \(\epsilon _{{{\varvec{\theta }}}}({\mathbf{X}}^n):=\min _{j\in [n]}\epsilon _{{{\varvec{\theta }}}}({\mathbf{x}}_j)\), where \([n]:=\{1,\ldots ,n\}\) for a positive integer n.

Next, we formulate the parameter transfer learnability based on the local stability.

Definition 2

(Parameter Transfer Learnability) Suppose that N-samples are available in the source domain and a sample \({\mathbf{x}}\) is available in the target domain. Let the parametric feature mapping \(\{\psi _{{{\varvec{\theta }}}}\}_{{{\varvec{\theta }}}\in \varTheta }\) be locally stable. For \(\bar{\delta }_N\in [0,1)\), \(\{\psi _{{{\varvec{\theta }}}}\}_{{{\varvec{\theta }}}\in \varTheta }\) is said to be parameter transfer learnable with probability \(1-\bar{\delta }_N\) if there exists an algorithm that depends only on N-samples in the source domain such that the output \(\widehat{{{\varvec{\theta }}}}_N\) of the algorithm satisfies

$$\begin{aligned} {\mathrm{Pr}}\left[ \Vert \widehat{{{\varvec{\theta }}}}_N - {{\varvec{\theta }}}^{*}_{\mathcal {S}}\Vert \le \epsilon _{{{\varvec{\theta }}}^{*}_{\mathcal {S}}}({\mathbf{x}}) \right] \ge 1-\bar{\delta }_N. \end{aligned}$$

The parameter \(\bar{\delta }_N\) is written as \(\bar{\delta }\) for short as long as no conflict arises.

The parameter transfer learnability describes whether the effective parameter is properly transformed on the target domain with high probability. For n-samples \({\mathbf{X}}^n=\{{\mathbf{x}}_1,\ldots {\mathbf{x}}_n\}\) in the target domain, the union bound ensures that the inequality \(\Vert \widehat{{{\varvec{\theta }}}}_N - {{\varvec{\theta }}}^{*}_{\mathcal {S}}\Vert \le \epsilon _{{{\varvec{\theta }}}^{*}_{\mathcal {S}}}({\mathbf{X}}^n)\) holds with probability greater than or equal to \(1-n\bar{\delta }_N\).

Given training samples \(\{({\mathbf{x}}_j,y_j):j=1,\ldots ,n\}\) in the target domain, let us consider the learning method

$$\begin{aligned} \min _{{\mathbf{w}}\in \mathcal {W}_{\mathcal {T}}}\ \frac{1}{n}\sum _{j=1}^{n}\ell (y_j,\,\langle {\mathbf{w}},\psi _{\widehat{\varvec{\theta }}_N}({\mathbf{x}}_j)\rangle )+\rho r({\mathbf{w}}), \end{aligned}$$

where \(\widehat{\varvec{\theta }}_N\) is the estimated parameter in the source domain using N training samples. The optimal parameter in \(\mathcal {W}_{\mathcal {T}}\) is denoted as \(\widehat{{\mathbf{w}}}_{N,n}\). Then, the following excess risk bound is obtained.

Theorem 1

(Excess Risk Bound) We assume the following conditions.

  1. 1.

    The parametric feature mapping \(\psi _\theta ({\mathbf{x}})\) is bounded and locally stable with the parameter \(L_\psi \). Suppose that \(\sup _{{\varvec{\theta }\in \varTheta },{\mathbf{x}}\in \mathcal {X}}\Vert \psi _{\varvec{\theta }}({\mathbf{x}})\Vert _2\le R_\psi \) holds for some positive constant \(R_\psi \).

  2. 2.

    The estimator \(\widehat{\varvec{\theta }}_N\) on the source domain satisfies the transfer learnability with probability \(1-\bar{\delta }\).

  3. 3.

    The non-negative loss \(\ell (\cdot ,\cdot )\) on the target domain is \(L_\ell \)-Lipschitz and convex in the second argument. Moreover, we assume that \(\sup _y\ell (y,0)\) is bounded above by a positive constant \(L_0\).

  4. 4.

    The non-negative regularization term \(r({\mathbf{w}})\) is 1-strongly convex and \(r({\varvec{0}})=0\) holds.

Then, the excess risk is bounded above by

$$\begin{aligned} \mathcal {R}_{\mathrm {excess}}&:= \mathcal {R}_{\mathcal {T}}(\widehat{\varvec{\theta }}_N,\,\widehat{{\mathbf{w}}}_{N,n}) - \mathcal {R}_{\mathcal {T}}({\varvec{\theta }}_{\mathcal {S}}^*,\,{\mathbf{w}}_{\mathcal {T}}^*) \nonumber \\&\le c\left\{ \frac{\Vert \widehat{\varvec{\theta }}_N-{\varvec{\theta }}_{\mathcal {S}}^*\Vert }{\sqrt{\rho }} +\frac{1}{\sqrt{n'\rho }} + \frac{\Vert \widehat{\varvec{\theta }}_N-{\varvec{\theta }}_{\mathcal {S}}^*\Vert ^{1/2}}{\rho ^{3/4}} + \frac{1}{n\rho }+\rho \right\} \end{aligned}$$
(4)

with probability \(1-\delta -(n+n')\bar{\delta }\), where \(n'\) is an arbitrary natural number and c is a positive constant expressed as a polynomial in \(L_\psi , R_\psi , L_\ell , L_0, r({\mathbf{w}}_{\mathcal {T}}^*)\), and \(\log (1/\delta )\).

Proof

In the proof, we define  \(c_i\,(i=1,2,3,4,5)\) as a positive number depending on \(L_\psi , R_\psi , L_\ell , L_0\), and \(\log (1/\delta )\).

Using the boundedness of the non-negative loss \(\ell (\cdot ,\cdot )\) and the strong convexity of \(r({\mathbf{w}})\) with some other conditions, we have

$$\begin{aligned} \frac{\rho }{2}\Vert \widehat{{\mathbf{w}}}_{N,n}\Vert ^2&\le \frac{1}{n}\sum _{j=1}^{n}\ell (y_j,\langle \widehat{{\mathbf{w}}}_{N,n},\,\psi _{\widehat{\varvec{\theta }}_N}({\mathbf{x}}_j)\rangle ) +\rho {r(\widehat{{\mathbf{w}}}_{N,n})}\\&\le \frac{1}{n}\sum _{j=1}^{n}\ell (y_j,0)+\rho r({\varvec{0}})\le L_0. \end{aligned}$$

Thus, \(\Vert \widehat{{\mathbf{w}}}_{N,n}\Vert \) is bounded above by \(\sqrt{2L_0/\rho }\). Let \(\widehat{{\mathbf{w}}}_n^*\) be the optimal solution of

$$\begin{aligned} \min _{{\mathbf{w}}\in \mathcal {W}_{\mathcal {T}}} \frac{1}{n}\sum _{j=1}^{n}\ell (y_j,\,\langle {\mathbf{w}},\psi _{{\varvec{\theta }}_{\mathcal {S}}^*}({\mathbf{x}}_j)\rangle )+\rho r({\mathbf{w}}). \end{aligned}$$

Likewise, we see that the norm of \(\widehat{{\mathbf{w}}}_n^*\) has the same upper bound.

The excess risk is decomposed to the following three terms.

$$\begin{aligned}&\mathcal {R}_{\mathcal {T}}(\widehat{\varvec{\theta }}_N,\,\widehat{{\mathbf{w}}}_{N,n}) - \mathcal {R}_{\mathcal {T}}({\varvec{\theta }}_{\mathcal {S}}^*,\,{\mathbf{w}}_{\mathcal {T}}^*) \\&\quad = \mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}}\left[ \ell (y,\langle \widehat{{\mathbf{w}}}_{N,n},\psi _{\widehat{\varvec{\theta }}_N}({\mathbf{x}})\rangle )\right] -\mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}}\left[ \ell (y,\langle \widehat{{\mathbf{w}}}_{N,n},\psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}})\rangle )\right] \\&\qquad +\mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}}\left[ \ell (y,\langle \widehat{{\mathbf{w}}}_{N,n},\psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}})\rangle )\right] -\mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}}\left[ \ell (y,\langle \widehat{{\mathbf{w}}}_n^*,\psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}})\rangle )\right] \\&\qquad +\mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}}\left[ \ell (y,\langle \widehat{{\mathbf{w}}}_n^*,\psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}})\rangle )\right] -\mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}}\left[ \ell (y,\langle {{\mathbf{w}}}_{\mathcal {T}}^*,\psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}})\rangle )\right] . \end{aligned}$$

Let us consider the upper bound of each term.

For the first term of the excess risk, the following inequality holds:

$$\begin{aligned}&\mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}}\left[ \ell (y,\langle \widehat{{\mathbf{w}}}_{N,n},\psi _{\widehat{\varvec{\theta }}_N}({\mathbf{x}})\rangle )\right] -\mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}}\left[ \ell (y,\langle \widehat{{\mathbf{w}}}_{N,n},\psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}})\rangle )\right] \nonumber \\&\quad \le L_\ell \sqrt{\frac{2L_0}{\rho }}\mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}} \left[ \Vert \psi _{\widehat{\varvec{\theta }}_N}({\mathbf{x}}) - \psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}}) \Vert \right] . \end{aligned}$$
(5)

Here, for an arbitrary natural number \(n'\), we introduce independent random variables \(\bar{{\mathbf{x}}}_1,...,\bar{{\mathbf{x}}}_{n'}\) (called ghost samples) such that the probability distribution of each \(\bar{{\mathbf{x}}}_j\) is the marginal distribution of \(P_{\mathcal {T}}\). Then, we have the following bound with probability greater than \(1-\delta /2\) by Hoeffding’s inequality:

$$\begin{aligned}&\mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}} \left[ \Vert \psi _{\widehat{\varvec{\theta }}_N}({\mathbf{x}}) - \psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}}) \Vert \right] \nonumber \\&\qquad \le \frac{1}{n'}\sum _{i=1}^{n'}\Vert \psi _{\widehat{\varvec{\theta }}_N}(\bar{{\mathbf{x}}}_i) - \psi _{\varvec{\theta }_{\mathcal {S}}^*}(\bar{{\mathbf{x}}}_i)\Vert + R_{\psi }\sqrt{\frac{2\log (2/\delta )}{n'}}. \end{aligned}$$
(6)

Moreover, since it holds that \(\Vert \psi _{\widehat{\varvec{\theta }}_N}(\bar{{\mathbf{x}}}_i) - \psi _{\varvec{\theta }_{\mathcal {S}}^*}(\bar{{\mathbf{x}}}_i)\Vert \le L_{\psi }\Vert \widehat{\varvec{\theta }}_N - \varvec{\theta }_{\mathcal {S}}^*\Vert \) with probability greater than \(1-\bar{\delta }\) by local stability and parameter transfer learnability, we have the following bound with probability greater than \(1-n'\bar{\delta }\) by the union bound:

$$\begin{aligned} \frac{1}{n'}\sum _{i=1}^{n'}\Vert \psi _{\widehat{\varvec{\theta }}_N}(\bar{{\mathbf{x}}}_i) - \psi _{\varvec{\theta }_{\mathcal {S}}^*}(\bar{{\mathbf{x}}}_i)\Vert \le L_{\psi } \Vert \widehat{\varvec{\theta }}_N - \varvec{\theta }_{\mathcal {S}}^*\Vert . \end{aligned}$$
(7)

From (5)–(7), with probability greater than \(1-\delta /2-n'\bar{\delta }\), we obtain

$$\begin{aligned}&\mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}}\left[ \ell (y,\langle \widehat{{\mathbf{w}}}_{N,n},\psi _{\widehat{\varvec{\theta }}_N}({\mathbf{x}})\rangle )\right] -\mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}}\left[ \ell (y,\langle \widehat{{\mathbf{w}}}_{N,n},\psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}})\rangle )\right] \\&\qquad \le L_\ell L_{\psi }\sqrt{\frac{2L_0}{\rho }} \Vert \widehat{\varvec{\theta }}_N - \varvec{\theta }_{\mathcal {S}}^*\Vert +L_\ell R_{\psi }\sqrt{\frac{2L_0}{\rho }} \sqrt{\frac{2\log (2/\delta )}{n'}}\\&\qquad = c_1 \frac{\Vert \widehat{\varvec{\theta }}_N - \varvec{\theta }_{\mathcal {S}}^*\Vert }{\sqrt{\rho }} + c_2 \frac{1}{\sqrt{n'\rho }}. \end{aligned}$$

Next, we provide an upper bound of the second term of the decomposed excess risk. To do so, we first provide an upper bound of \(\Vert \widehat{{\mathbf{w}}}_n^* - \widehat{{\mathbf{w}}}_{N,n}\Vert \) in the following. The 1-strong convexity of r leads to \(\rho \)-strong convexity of the empirical loss with the regularization term in the parameter \({\mathbf{w}}\). Hence, we have

$$\begin{aligned}&\frac{1}{n} \sum _{j=1}^n \ell (y_j,\langle \widehat{{\mathbf{w}}}_n^*,\psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}}_j)\rangle ) +\rho r(\widehat{{\mathbf{w}}}_n^*) + \frac{\rho }{2} \Vert \widehat{{\mathbf{w}}}_n^* - \widehat{{\mathbf{w}}}_{N,n}\Vert ^2\\&\qquad \le \frac{1}{n} \sum _{j=1}^n \ell (y_j,\langle \widehat{{\mathbf{w}}}_{N,n},\psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}}_j)\rangle ) +\rho r(\widehat{{\mathbf{w}}}_{N,n}) \end{aligned}$$

and

$$\begin{aligned}&\frac{1}{n} \sum _{j=1}^n \ell (y_j,\langle \widehat{{\mathbf{w}}}_{N,n},\psi _{\widehat{\varvec{\theta }}_N}({\mathbf{x}}_j)\rangle ) +\rho r(\widehat{{\mathbf{w}}}_{N,n}) + \frac{\rho }{2} \Vert \widehat{{\mathbf{w}}}_n^* - \widehat{{\mathbf{w}}}_{N,n}\Vert ^2\\&\qquad \le \frac{1}{n} \sum _{j=1}^n \ell (y_j,\langle \widehat{{\mathbf{w}}}_n^*,\psi _{\widehat{\varvec{\theta }}_N}({\mathbf{x}}_j)\rangle ) +\rho r(\widehat{{\mathbf{w}}}_n^*). \end{aligned}$$

Summing up the above two inequalities, we have

$$\begin{aligned} \rho \Vert \widehat{{\mathbf{w}}}_n^* - \widehat{{\mathbf{w}}}_{N,n}\Vert ^2\le & {} \frac{1}{n} \sum _{j=1}^n \left( \ell (y_j,\langle \widehat{{\mathbf{w}}}_n^*,\psi _{\widehat{\varvec{\theta }}_N}({\mathbf{x}}_j)\rangle ) - \ell (y_j,\langle \widehat{{\mathbf{w}}}_n^*,\psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}}_j)\rangle ) \right) \\&+\,\frac{1}{n} \sum _{j=1}^n \left( \ell (y_j,\langle \widehat{{\mathbf{w}}}_{N,n},\psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}}_j)\rangle ) - \ell (y_j,\langle \widehat{{\mathbf{w}}}_{N,n},\psi _{\widehat{\varvec{\theta }}_N}({\mathbf{x}}_j)\rangle ) \right) \\\le & {} 2L_{\ell }L_{\psi }\sqrt{\frac{2 L_0}{\rho }} \left\| \widehat{\varvec{\theta }}_N - \varvec{\theta }_{\mathcal {S}}^* \right\| . \end{aligned}$$

The last inequality holds with probability greater than or equal to \(1-n\bar{\delta }\) owing to the parameter transfer learnability and local stability. Thus, \(\Vert \widehat{{\mathbf{w}}}_n^* - \widehat{{\mathbf{w}}}_{N,n}\Vert \le 2^{3/4} (L_\ell L_\psi L_0^{1/2})^{1/2} \Vert \widehat{\varvec{\theta }}_N - \varvec{\theta }_{\mathcal {S}}^*\Vert ^{1/2}/ \rho ^{3/4}\) holds. Hence, the second term of the decomposed excess risk is bounded above by

$$\begin{aligned}&\mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}}\left[ \ell (y,\langle \widehat{{\mathbf{w}}}_{N,n},\psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}})\rangle )\right] -\mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}}\left[ \ell (y,\langle \widehat{{\mathbf{w}}}_n^*,\psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}})\rangle )\right] \\&\qquad \le L_\ell R_\psi \Vert \widehat{{\mathbf{w}}}_{N,n}-\widehat{{\mathbf{w}}}_n^*\Vert \\&\qquad \le 2^{3/4} L_\ell ^{3/2} L_\psi ^{1/2} L_0^{1/4} R_\psi \frac{\Vert \widehat{\varvec{\theta }}_N-{\varvec{\theta }}_{\mathcal {S}}^*\Vert ^{1/2}}{\rho ^{3/4}} =c_3 \frac{\Vert \widehat{\varvec{\theta }}_N-{\varvec{\theta }}_{\mathcal {S}}^*\Vert ^{1/2}}{\rho ^{3/4}} \end{aligned}$$

with probability \(1-n\bar{\delta }\).

For the third term of the excess risk, we obtain the following upper bound with probability \(1-\delta /2\) by Theorem 1 of Sridharan et al. (2009):

$$\begin{aligned}&\mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}}\left[ \ell (y,\langle \widehat{{\mathbf{w}}}_n^*,\psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}})\rangle )\right] -\mathbb {E}_{({\mathbf{x}},y)\sim {P_\mathcal {T}}}\left[ \ell (y,\langle {{\mathbf{w}}}_{\mathcal {T}}^*,\psi _{\varvec{\theta }_{\mathcal {S}}^*}({\mathbf{x}})\rangle )\right] \\&\qquad \le \frac{8L_\ell ^2 R_\psi ^2(32+\log (2/\delta ))}{n\rho }+\rho r({\mathbf{w}}_{\mathcal {T}}^*) = \frac{c_4}{n\rho }+c_5\rho . \end{aligned}$$

Combining the above results, we obtain

$$\begin{aligned} \mathcal {R}_{\mathrm {excess}} \le c\left\{ \frac{\Vert \widehat{\varvec{\theta }}_N-{\varvec{\theta }}_{\mathcal {S}}^*\Vert }{\sqrt{\rho }} +\frac{1}{\sqrt{n'\rho }} + \frac{\Vert \widehat{\varvec{\theta }}_N-{\varvec{\theta }}_{\mathcal {S}}^*\Vert ^{1/2}}{\rho ^{3/4}} + \frac{1}{n\rho }+\rho \right\} \end{aligned}$$

with probability of at least \(1-\delta -(n+n')\bar{\delta }\). \(\square \)

We mention the relation between our formulation and a fast rate result of the excess risk in “Appendix A”. The optimal \(\rho \) is obtained by minimizing the upper bound of the excess risk.

Corollary 1

Suppose that the conditions 1, 3, and 4 in Theorem 1 hold. In addition, we assume that there exist a real number \(\beta \ge 1\) and a sequence \(\tau _N\) such that

$$\begin{aligned} \mathbb {E}\big [\epsilon _{\varvec{\theta }_{S}^*}({\mathbf{x}})^{-\beta }\big ]<\infty , \quad \text {and}\quad \mathbb {E}[\Vert \widehat{\varvec{\theta }}_N-{\varvec{\theta }}_{\mathcal {S}}^*\Vert ^\beta ] \le \tau _N^\beta \longrightarrow 0\ \ (N\rightarrow \infty ). \end{aligned}$$
(8)

When \(n\tau _N^\beta \) is sufficiently small, the asymptotic upper bound of the excess risk is given as

$$\begin{aligned} \mathcal {R}_{\mathrm {excess}} \le c \max \{n^{-1/2},\,\tau _N^{2/7}\}, \end{aligned}$$

by setting \(\rho =\varTheta (\max \{n^{-1/2},\,\tau _N^{2/7}\})\).

Proof

The assumptions (8) and Markov’s inequality lead to \({\mathrm{Pr}}\left[ \Vert \widehat{{{\varvec{\theta }}}}_N - {{\varvec{\theta }}}^{*}_{\mathcal {S}}\Vert /\tau _N \ge a \right] \le a^{-\beta }\) and

$$\begin{aligned} {\mathrm{Pr}}\left[ \Vert \widehat{{{\varvec{\theta }}}}_N - {{\varvec{\theta }}}^{*}_{\mathcal {S}}\Vert \ge \epsilon _{{{\varvec{\theta }}}^{*}_{\mathcal {S}}}({\mathbf{x}}) \right] \le C\tau _N^\beta , \end{aligned}$$
(9)

where C is a positive constant. Here, the independence of the source and target samples is used. The second inequality denotes that parameter transfer learnability holds by setting \(\bar{\delta }_N=C\tau _N^\beta \). From the first inequality, we have \(\Vert \widehat{{{\varvec{\theta }}}}_N-{{\varvec{\theta }}}^{*}_{\mathcal {S}}\Vert /\sqrt{\rho }=O_p(\tau _N/\sqrt{\rho })\) and \(\Vert \widehat{{{\varvec{\theta }}}}_N-{{\varvec{\theta }}}^{*}_{\mathcal {S}}\Vert ^{1/2}/\rho ^{3/4}=O_p(\tau _N^{1/2}/\rho ^{3/4})\), where \(O_p\) denotes the probabilistic order. Let \(\delta \) be a small positive constant, and define \(n'\) by \(n'=\delta /\bar{\delta }_N=\delta /(C\tau _N^\beta )\). We have

$$\begin{aligned} \frac{\Vert \widehat{{{\varvec{\theta }}}}_N-{{\varvec{\theta }}}^{*}_{\mathcal {S}}\Vert }{\sqrt{\rho }}+\frac{1}{\sqrt{n'\rho }} = O_p(\tau _N^{\min \{1,\beta /2\}}/\sqrt{\rho }). \end{aligned}$$

Suppose that \(\rho \rightarrow 0\) and \(n\rho \rightarrow \infty \) hold as \(n\rightarrow \infty \) and \(\tau _N\rightarrow 0\) while keeping \(n \bar{\delta }_N=Cn\tau _N^\beta \) sufficiently small. For large n and small \(\tau _N\), we have \(\tau _N^{\min \{1,\beta /2\}}/\sqrt{\rho }\le \tau _N^{1/2}/\rho ^{3/4}\). Hence, we obtain

$$\begin{aligned} \mathcal {R}_{\mathrm {excess}}\le c\left\{ \frac{\tau _N^{1/2}}{\rho ^{3/4}} + \frac{1}{n\rho } + \rho \right\} \end{aligned}$$

with probability greater than \(1-2\delta -n\bar{\delta }_N\). Substituting

$$\begin{aligned} \rho =\varTheta \left( \max \left\{ n^{-1/2},\,\tau _N^{2/7}\right\} \right) \end{aligned}$$

which satisfies the above condition, we have \(\mathcal {R}_{\mathrm {excess}}\le c\max \{n^{-1/2},\, \tau _N^{2/7}\}\) with high probability. \(\square \)

The upper bound of the excess risk is expressed by the bias term \(\tau _N\) induced from the source domain and the sample complexity bound on the target domain. If \(\tau _N\) is large, additional training samples on the target domain will not help attain high prediction accuracy. On the contrary, when the bias term \(\tau _N\) is sufficiently small, the excess risk is bounded above by \(\mathcal {O}(n^{-1/2})\), which is the standard asymptotic order of the supervised learning using n i.i.d. samples.

Remark 1

Suppose that the bias \(\tau _N\) on the source domain is of the order \(N^{-1/2}\), which is the standard order in the parameter estimation.Footnote 2 When \(n\tau _N^\beta \) is sufficiently small for some \(\beta \ge 1\), we have \(n=\mathcal {O}(N^{\beta /2})\). If \(c' N^{2/7}\le n\le c'' N^{\beta /2}\) holds for some constants \(c',c''\), the excess risk is of the order \(\mathcal {O}(N^{-1/7})\). For \(n=\mathcal {O}(N^{2/7})\), we have \(\mathcal {R}_{\mathrm {excess}}=\mathcal {O}(n^{-1/2})\). Given an acceptable level of the excess risk, the above result provides a rough estimate of the required sample size on both the source and target domains.

One can regard \(\mathcal {R}_{\mathcal {T}}^{*}=\min _{{{\varvec{\theta }}},{\mathbf{w}}}\mathcal {R}_{\mathcal {T}}\left( {{\varvec{\theta }}}, {\mathbf{w}}\right) \) as the baseline instead of \(\mathcal {R}_{\mathcal {T}}\left( {{\varvec{\theta }}}^{*}_{\mathcal {S}}, {\mathbf{w}}^{*}_{\mathcal {T}}\right) \). In this case, the risk bound is decomposed into

$$\begin{aligned} \mathcal {R}_{\mathcal {T}}(\widehat{{{\varvec{\theta }}}}_N, \widehat{{\mathbf{w}}}_{N,n})-\mathcal {R}_{\mathcal {T}}^{*} = \underbrace{ \big ( \mathcal {R}_{\mathcal {T}}(\widehat{{{\varvec{\theta }}}}_N, \widehat{{\mathbf{w}}}_{N,n}) - \mathcal {R}_{\mathcal {T}}({{\varvec{\theta }}}_{\mathcal {S}}^*,{\mathbf{w}}^{*}_{\mathcal {T}}) \big )}_{\mathcal {R}_{\mathrm {excess}}} + \underbrace{\big ( \mathcal {R}_{\mathcal {T}}({{\varvec{\theta }}}_{\mathcal {S}}^*, {\mathbf{w}}^{*}_{\mathcal {T}}) -\mathcal {R}_{\mathcal {T}}^{*} \big )}_{\mathcal {R}_{\mathrm {gap}}}. \end{aligned}$$

The first term, \(\mathcal {R}_{\mathrm {excess}}\), denotes the excess risk to transfer learning with optimal parameter on the source domain and its upper bounded is presented in Theorem 1 and Corollary 1. The second term, \(\mathcal {R}_{\mathrm {gap}}\), is interpreted as the difference between the source and target domains.

In an ideal situation, transfer learning is regarded as a method to reduce the bias of the model; this is explained next. Suppose that \(\mathcal {R}_{\mathrm {gap}}\) is close to zero and N is sufficiently large. Then, self-taught learning with the optimal parameter \({{\varvec{\theta }}}_{\mathcal {S}}^{*}\) is approximately realized. However, in the common learning setup using samples from only the target domain, the optimal feature representation \(\psi _{{{\varvec{\theta }}}_{\mathcal {S}}^{*}}\) will not be available. This is thought to be the main reason why transfer learning is advantageous over the standard learning methods.

On the contrary, if \(\mathcal {R}_{\mathrm {gap}}\) is much larger than \(\mathcal {R}_{\mathrm {excess}}\), negative transfer can occur easily, i.e., transfer learning actually decreases the prediction performance. This is because the parameter \({{\varvec{\theta }}}\) that is superior to \({{\varvec{\theta }}}_{\mathcal {S}}^{*}\) will be effortlessly found.

Example 1

As an example of \(\mathcal {R}_{\text {gap}}\), let us consider the regression analysis using the basis function \(\psi _{{{\varvec{\theta }}}}\). We assume that the labels in source and target domains are given as \(y={\mathbf{w}}_{\mathcal {S}}^{\top }\psi _{{{\varvec{\theta }}}_\mathcal {S}}+\xi \) and \(y={\mathbf{w}}_{\mathcal {T}}^{\top }\psi _{{{\varvec{\theta }}}_\mathcal {T}}+\epsilon \) respectively, where \(\xi \) and \(\epsilon \) are noise random variables with mean 0. In addition, let the loss function be \(\ell (y,y'):=|y-y'|\) and effective parameters in source domain be \({{\varvec{\theta }}}^*_\mathcal {S}={{\varvec{\theta }}}_\mathcal {S}\), \({\mathbf{w}}^*_\mathcal {S}={\mathbf{w}}_\mathcal {S}\). Then, it holds that

$$\begin{aligned} \mathcal {R}_{\text {gap}} :=&\,\mathcal {R}_{\mathcal {T}}({{\varvec{\theta }}}^*_\mathcal {S},{\mathbf{w}}^*_\mathcal {T}) - \mathcal {R}^*_{\mathcal {T}}\\ =&\,\, \mathbb {E}_{\mathcal {T}}[|{\mathbf{w}}_{\mathcal {T}}^{\top }\psi _{{{\varvec{\theta }}}_\mathcal {T}}({\mathbf{x}}) +\epsilon - {\mathbf{w}}_{\mathcal {T}}^{*\top }\psi _{{{\varvec{\theta }}}_\mathcal {S}}({\mathbf{x}})|] - \mathbb {E}_{\mathcal {T}}[|{\mathbf{w}}_{\mathcal {T}}^{\top }\psi _{{{\varvec{\theta }}}_\mathcal {T}}({\mathbf{x}}) +\epsilon - {\mathbf{w}}_{\mathcal {T}}^{\top }\psi _{{{\varvec{\theta }}}_\mathcal {T}}({\mathbf{x}})|]\\ \le&\,\, \mathbb {E}_{\mathcal {T}}[|{\mathbf{w}}_{\mathcal {T}}^{\top }\psi _{{{\varvec{\theta }}}_\mathcal {T}}({\mathbf{x}}) - {\mathbf{w}}_{\mathcal {T}}^{*\top }\psi _{{{\varvec{\theta }}}_\mathcal {S}}({\mathbf{x}})|] + \mathbb {E}[|\epsilon |] - \mathbb {E}[|\epsilon |]\\ \le&\,\, \mathbb {E}_{\mathcal {T}}[|{\mathbf{w}}_{\mathcal {T}}^{\top }(\psi _{{{\varvec{\theta }}}_\mathcal {T}}({\mathbf{x}}) - \psi _{{{\varvec{\theta }}}_\mathcal {S}}({\mathbf{x}}))|] + \mathbb {E}_{\mathcal {T}}[|({\mathbf{w}}_{\mathcal {T}} - {\mathbf{w}}^*_{\mathcal {T}})^{\top } \psi _{{{\varvec{\theta }}}_\mathcal {S}}({\mathbf{x}})|] \\ \le&\,\, \Vert {\mathbf{w}}_\mathcal {T}\Vert \mathbb {E}_{\mathcal {T}}[\Vert \psi _{{{\varvec{\theta }}}_\mathcal {T}}({\mathbf{x}}) - \psi _{{{\varvec{\theta }}}_\mathcal {S}}({\mathbf{x}})\Vert ] + R_\psi \Vert {\mathbf{w}}_{\mathcal {T}} - {\mathbf{w}}^*_{\mathcal {T}}\Vert . \end{aligned}$$

Thus, it is found from this upper bound of \(\mathcal {R}_{\text {gap}}\) that, if the parameter \({{\varvec{\theta }}}_{\mathcal {S}}\) of the optimal feature map in source domain is distant from that \({{\varvec{\theta }}}_{\mathcal {T}}\) in target domain, the first term can be large, and accordingly, the second term can be also large since \({\mathbf{w}}^*_{\mathcal {T}}\) depends on \({{\varvec{\theta }}}_{\mathcal {S}}\).

A simple way to avoid the negative transfer is to assess the \(\mathcal {R}_{\mathrm {gap}}\). A naive statistic,

$$\begin{aligned} \widehat{\mathcal {R}}_{\mathrm {gap}}=\widehat{\mathcal {R}}_{\mathcal {T},n}(\widehat{{{\varvec{\theta }}}}_N, \widehat{{\mathbf{w}}}_{N,n}) -\min _{{{\varvec{\theta }}},{\mathbf{w}}}\widehat{\mathcal {R}}_{\mathcal {T},n}({{\varvec{\theta }}},{\mathbf{w}}), \end{aligned}$$

is available to estimate \(\mathcal {R}_{\mathrm {gap}}\). When \(\widehat{\mathcal {R}}_{\mathrm {gap}}\) is significantly larger than the order of \(\mathcal {O}(n^{-1/2})\), we will need more elaborate learning on the source domain or fine tuning (Goodfellow et al. 2016, Sec. 8.7.4) of the parameter \({{\varvec{\theta }}}\) using samples on the target domain. The domain adaptation is also another promising method to avoid a large \(\mathcal {R}_{\mathrm {gap}}\) when samples in the source and target domains are simultaneously available. We do not go into the details for this case here. In this paper, we assume that \(\mathcal {R}_{\mathrm {gap}}\) is sufficiently small and we focus on the excess risk \(\mathcal {R}_{\mathrm {excess}}\) via local stability and parameter transfer learnability.

3 Stability and learnability in sparse coding

In this section, we consider sparse coding in self-taught learning, where the source domain essentially consists of the sample space \(\mathcal {X}_{\mathcal {S}}\) without the label space \(\mathcal {Y}_{\mathcal {S}}\). We assume that the sample spaces in both domains are \({\mathbb {R}}^d\). Then, the sparse coding method considered here consists of a two-stage procedure, where a dictionary is learnt on the source domain, and then sparse coding with the learnt dictionary is used for a predictive task in the target domain.

First, we show that sparse coding satisfies the local stability in Sect. 3.1 and then explain how appropriate dictionary learning algorithms satisfy the parameter transfer learnability in Sect. 3.3. As a consequence of Theorem 1, we obtain the excess risk bound of self-taught learning algorithms based on sparse coding. We note that the results in this section are useful independent of transfer learning.

Next, we summarize the notations used in this section. Let \(\Vert \cdot \Vert _p\) be the p-norm on \({\mathbb {R}}^d\). We define \({\mathrm{supp}}({\mathbf{a}}):=\{i\in [m]|a_i\ne 0\}\) for \({\mathbf{a}}\in {\mathbb {R}}^m\). We denote the number of elements of a set S by |S|. When a vector \({\mathbf{a}}\) satisfies \(\Vert {\mathbf{a}}\Vert _0=|{\mathrm{supp}}({\mathbf{a}})|\le k\), \({\mathbf{a}}\) is said to be k-sparse. We set \(\mathcal {D}:= \{ {\mathbf{D}}=[{\mathbf{d}}_1,\ldots ,{\mathbf{d}}_m]\in {\mathbb {R}}^{d\times m}|~\Vert {\mathbf{d}}_j\Vert _2=1~(i=1,\ldots ,m)\}\) and each \({\mathbf{D}}\in \mathcal {D}\) represents a dictionary of size m.

Definition 3

(Induced matrix norm)Footnote 3 For an arbitrary matrix \({\mathbf{E}}=[{\mathbf{e}}_1,\ldots ,{\mathbf{e}}_m]\in {\mathbb {R}}^{d\times m}\), the induced matrix norm is defined by \(\Vert {\mathbf{E}}\Vert _{1,2} := \max _{i\in [m]} \Vert {\mathbf{e}}_i\Vert _2\).

We adopt \(\Vert \cdot \Vert _{1,2}\) to measure the difference in dictionaries since it is typically used in the framework of dictionary learning. We note that \(\Vert {\mathbf{D}}- \tilde{{\mathbf{D}}}\Vert _{1,2}\le 2\) holds for arbitrary dictionaries \({\mathbf{D}}, \tilde{{\mathbf{D}}}\in \mathcal {D}\).

3.1 Local stability of sparse representation

In this section, we show the local stability of sparse representation under a sparse model. A sparse representation with dictionary parameter \({\mathbf{D}}\) of a sample \({\mathbf{x}}\in {\mathbb {R}}^d\) is expressed as follows:

$$\begin{aligned} {\varphi }_{{\mathbf{D}}}({\mathbf{x}}) :=\underset{{\mathbf{z}}\in {\mathbb {R}}^m}{\mathrm{argmin}} \frac{1}{2}\Vert {\mathbf{x}}- {\mathbf{D}}{\mathbf{z}}\Vert _2^2 + \lambda \Vert {\mathbf{z}}\Vert _1, \end{aligned}$$
(10)

where \(\lambda >0\) is a regularization parameter that induces sparsity. This situation corresponds to the case where \({{\varvec{\theta }}}={\mathbf{D}}\) and \(\psi _{{{\varvec{\theta }}}} = {\varphi }_{{\mathbf{D}}}\) in the setting of Sect. 2.1.

We define some notions used in the discussion on stability of sparse representation. The following k-margin was introduced by Mehta and Gray (2013).

Definition 4

(k-margin) Given a dictionary \({\mathbf{D}}=[{\mathbf{d}}_1,\ldots ,{\mathbf{d}}_m]\in \mathcal {D}\) and a point \({\mathbf{x}}\in {\mathbb {R}}^d\), the k-margin of \({\mathbf{D}}\) on \({\mathbf{x}}\) is

$$\begin{aligned} {\mathcal {M}}_{k}({\mathbf{D}},{\mathbf{x}}) :=\max _{\mathcal {I}\subset [m], |\mathcal {I}|=m-k} \min _{j\in \mathcal {I}} \left\{ \lambda - |\langle {\mathbf{d}}_j, {\mathbf{x}}-{\mathbf{D}}{\varphi }_{{\mathbf{D}}}({\mathbf{x}})\rangle |\right\} . \end{aligned}$$

The following \(\mu \)-incoherence is not equal to the k-incoherence defined in Mehta and Gray (2013), although these are related to each other as stated in Remark 2.

Definition 5

(\(\mu \)-incoherence) A dictionary matrix \({\mathbf{D}}=[{\mathbf{d}}_1,\ldots ,{\mathbf{d}}_m] \in \mathcal {D}\) is said to be \(\mu \)-incoherent if \(|\langle {\mathbf{d}}_i, {\mathbf{d}}_j \rangle |\le \mu /\sqrt{d}\) for all \(i\ne j\).

Then, the following theorem is obtained.

Theorem 2

(Local Stability of Sparse Coding) Let \({\mathbf{D}}\in \mathcal {D}\) be \(\mu \)-incoherent for \(\mu <\sqrt{d}/k\) and \(\Vert {\mathbf{D}}-\tilde{{\mathbf{D}}}\Vert _{1,2}\le \lambda \). When

$$\begin{aligned} \Vert {\mathbf{D}}- \tilde{{\mathbf{D}}}\Vert _{1,2} \le \epsilon _{k,{\mathbf{D}}}({\mathbf{x}}) := \frac{{\mathcal {M}}_{k}({\mathbf{D}},{\mathbf{x}})^2 \lambda }{64\max \{1, \Vert {\mathbf{x}}\Vert \}^4}, \end{aligned}$$
(11)

the following stability bound holds.

$$\begin{aligned} \left\| {\varphi }_{{\mathbf{D}}}({\mathbf{x}}) - {\varphi }_{\tilde{{\mathbf{D}}}}({\mathbf{x}})\right\| _2 \le \frac{2\sqrt{k}\left( 1+{2 \Vert {\mathbf{x}}\Vert _2}/{\lambda } \right) \Vert {\mathbf{x}}\Vert _2}{ 1-\mu k/\sqrt{d}}\Vert {\mathbf{D}}- \tilde{{\mathbf{D}}}\Vert _{1,2} \end{aligned}$$

From Theorem 2, \(\epsilon _{k,{\mathbf{D}}}({\mathbf{x}})\) becomes the permissible radius of perturbation in Definition 1.

Remark 2

We mention the relation between the \(\mu \)-incoherence defined above and k-incoherence of a dictionary, which is the assumption of the sparse coding stability in Mehta and Gray (2013). For \(k\in [m]\) and \({\mathbf{D}}\in \mathcal {D}\), the k-incoherence \(s_k({\mathbf{D}})\) is defined as

$$\begin{aligned} s_k({\mathbf{D}}):=(\min \{\varsigma _k({\mathbf{D}}_{\varLambda })| \varLambda \subset [m], |\varLambda |=k\})^2, \end{aligned}$$

where \(\varsigma _k({\mathbf{D}}_{\varLambda })\) is the kth singular value of \({\mathbf{D}}_{\varLambda }=[{\mathbf{d}}_{i_1},\ldots ,{\mathbf{d}}_{i_k}]\) for \(\varLambda = \{i_1,\ldots ,i_k\}\). From Lemma 9 in “Appendix B”, when a dictionary \({\mathbf{D}}\) is \(\mu \)-incoherent, the k-incoherence of \({\mathbf{D}}\) satisfies

$$\begin{aligned} s_k({\mathbf{D}}) \ge 1 - \frac{\mu k}{\sqrt{d}}. \end{aligned}$$

Thus, a \(\mu \)-incoherent dictionary has positive k-incoherence when \(d> (\mu k)^2\). On the other hand, when \(k\ge 2\), if a dictionary \({\mathbf{D}}\) has positive k-incoherence \(s_k({\mathbf{D}})\), there is \(0<\mu <\sqrt{d}\) such that the dictionary is \(\mu \)-incoherent.Footnote 4 However, we note that positive k-incoherence \(s_k({\mathbf{D}})\) does not imply that \({\mathbf{D}}\) is \(\mu \)-incoherent and \(\mu <\sqrt{d}/k\) in general.Footnote 5

Here, we refer to the relation with the sparse coding stability (Theorem 4) of Mehta and Gray (2013) in which the difference of dictionaries was measured by \(\Vert \cdot \Vert _{2,2}\) instead of \(\Vert \cdot \Vert _{1,2}\) and the permissible radius of perturbation was given by \({\mathcal {M}}_{k}({\mathbf{D}},{\mathbf{x}})^2\lambda \) except for a constant factor. Applying the simple inequality \(\Vert {\mathbf{E}}\Vert _{2,2} \le \sqrt{m}\Vert {\mathbf{E}}\Vert _{1,2}\) for \({\mathbf{E}}\in {\mathbb {R}}^{d\times m}\), we can obtain a variant of the sparse coding stability with norm \(\Vert \cdot \Vert _{1,2}\). However, then the dictionary size m affects the permissible radius of perturbation and the stability bound of sparse coding stability. On the other hand, the factor of m does not appear in Theorem 2, and thus, the result is effective even for a large m. In addition, whereas \(\Vert {\mathbf{x}}\Vert \le 1\) is assumed in Mehta and Gray (2013), Theorem 2 does not assume that \(\Vert {\mathbf{x}}\Vert \le 1\) and clarifies the dependency for the norm \(\Vert {\mathbf{x}}\Vert \). The Lipschitz constant \(L_\psi \) is obtained independent of \({\mathbf{x}}\) for a bounded sample space.

In existing studies related to sparse coding, the sparse representation \({\varphi }_{{\mathbf{D}}}({\mathbf{x}})\) is modified as \({\varphi }_{{\mathbf{D}}}({\mathbf{x}}) \otimes {\mathbf{x}}\) (Mairal et al. 2009) or \({\varphi }_{{\mathbf{D}}}({\mathbf{x}}) \otimes ({\mathbf{x}}- {\mathbf{D}}{\varphi }_{{\mathbf{D}}}({\mathbf{x}}))\) (Raina et al. 2007), where \(\otimes \) is the tensor product. Owing to the stability of sparse representation (Theorem 2), it can be shown that such modified representations also have local stability.

3.2 Sparse modeling and margin bound

In this section, we assume a sparse structure for samples \({\mathbf{x}}\in {\mathbb {R}}^d\) and specify a lower bound for the k-margin used in (11). The result obtained in this section plays an essential role in demonstrating the parameter transfer learnability in Sect. 3.3.

Assumption 1

(Model) There exists a dictionary matrix \({\mathbf{D}}^{*}\) such that every sample \({\mathbf{x}}\) is independently generated by a representation \({\mathbf{a}}\) and noise \({{\varvec{\xi }}}\) as

$$\begin{aligned} {\mathbf{x}}= {\mathbf{D}}^{*}{\mathbf{a}}+ {{\varvec{\xi }}}. \end{aligned}$$

Moreover, we impose the following three assumptions on the above model.

Assumption 2

(Dictionary) The dictionary matrix \({\mathbf{D}}^{*}=[{\mathbf{d}}_1,\ldots ,{\mathbf{d}}_m] \in \mathcal {D}\) is \(\mu \)-incoherent.

Assumption 3

(Representation) The representation \({\mathbf{a}}\) is a random variable that is k-sparse (i.e., \(\Vert {\mathbf{a}}\Vert _0\le k\)) and the non-zero entries are lower bounded by \(C>0\) (i.e., \(a_i\ne 0\) satisfies \(|a_i|\ge C\)).

Assumption 4

(Noise) The noise \({{\varvec{\xi }}}\) is independent across coordinates and Gaussian with zero mean and a maximum variance \(\sigma ^2k/d\) on each component, where \(\sigma >0\) is a constant.

Remark 3

Note that Assumption 4 is valid under Assumptions 13 and the condition \(\mu \le \sqrt{d}/k\) if we assume a situation where dictionary learning is possible. To learn the true dictionary \({\mathbf{D}}^*\) and true signal \({\mathbf{a}}\) from a sample \({\mathbf{x}}\), it is necessary that the noise \({{\varvec{\xi }}}\) must be much smaller than the signal \({\mathbf{D}}^*{\mathbf{a}}\) with high probability. This condition is represented by \(\Vert {{\varvec{\xi }}}\Vert \le \Vert {\mathbf{D}}^*{\mathbf{a}}\Vert \). Here, it holds that

$$\begin{aligned} \Vert {{\varvec{\xi }}}\Vert ^2 \le \Vert {\mathbf{D}}{\mathbf{a}}\Vert ^2 \le {|{\mathbf{a}}|^{\top }\left( I+\frac{\mu }{\sqrt{d}}{\mathbf{1}}\right) |{\mathbf{a}}|} \le a_{max}^2 k\left( 1+\mu \frac{ k}{\sqrt{d}}\right) \le 2a_{max}^2k, \end{aligned}$$
(12)

where \(|{\mathbf{a}}|\) is the vector whose components are absolute values of those of \({\mathbf{a}}\) and \(a_{max} := \max _{1\le i\le m} |a_{i}|\). Then, each component \(\xi _i\) of \({{\varvec{\xi }}}\) approximately satisfies, with high probability,

$$\begin{aligned} |\xi _i|^2 \simeq {\frac{\Vert {{\varvec{\xi }}}\Vert ^2}{d}} = \tilde{\mathcal {O}}(k/d). \end{aligned}$$
(13)

Thus, since each component is Gaussian, its variance should be \(\tilde{\mathcal {O}}(k/d)\).

In transfer learning, samples on the source and target domains are not necessarily identically distributed. Indeed, independent but non-identical distributions are allowed under Assumptions 3 and 4. This is essential because samples in the source and target domains cannot be assumed to be identically distributed in transfer learning.

Theorem 3

(Margin Bound) Let \(0<t<1\). We set

$$\begin{aligned} \delta _{t,\lambda }:= & {} \frac{2\sigma \sqrt{k}m}{(1-t)\sqrt{d}\lambda }\exp \left( -\frac{(1-t)^2d\lambda ^2}{8\sigma ^2k} \right) + \frac{2\sigma \sqrt{k}m}{\sqrt{d}\lambda }\exp \left( -\frac{d\lambda ^2}{8\sigma ^2k}\right) \nonumber \\&+\, \frac{4\sigma k^{3/2}}{C\sqrt{d(1-\mu k/\sqrt{d})}}\exp \left( -\frac{C^2 d (1 -\mu k/\sqrt{d}) }{8\sigma ^2k}\right) \nonumber \\&+ \,\frac{8\sigma \sqrt{k}(d-k)}{\sqrt{d}\lambda }\exp \left( -\frac{d\lambda ^2}{32\sigma ^2k}\right) . \end{aligned}$$
(14)

We assume that \(d\ge \left\{ \left( 1+\frac{6}{(1-t)}\right) \mu k\right\} ^2\) and \(\lambda = d^{-\tau }\) for arbitrary \(1/4\le \tau \le 1/2\). Under Assumptions 14, the following inequality holds with a probability of at least \(1-\delta _{t,\lambda }\).

$$\begin{aligned} {\mathcal {M}}_{k}({\mathbf{D}}^{*},{\mathbf{x}})\ge t\lambda \end{aligned}$$
(15)

We provide the proof of Theorem 3 in “Appendix C”.

Note that the failure probability of the margin bound in (14) decreases as the dimension increases since the variance of the noise gets smaller because of Assumption 4.

Next, we analyze the regularization parameter \(\lambda \). An appropriate reflection of the sparsity of samples requires the regularization parameter \(\lambda \) to be set suitably. This is according to Theorem 4 of Zhao and Yu (2006)Footnote 6 when samples follow the sparse model as in Assumptions 14 and \(\lambda \cong d^{-\tau }\) for \(1/4\le \tau \le 1/2\). The representation \({\varphi }_{{\mathbf{D}}}({\mathbf{x}})\) reconstructs the true sparse representation \({\mathbf{a}}\) of sample \({\mathbf{x}}\) with a small error. In particular, when \(\tau =1/4\) (i.e., \(\lambda \cong d^{-1/4}\)) in Theorem 3, the failure probability \(\delta _{t,\lambda }\cong e^{-\sqrt{d}}\) on the margin is guaranteed to become sub-exponentially small with respect to dimension d and is negligible for the high-dimensional case. On the other hand, the typical choice \(\tau =1/2\) (i.e., \(\lambda \cong d^{-1/2}\)) does not provide a useful result because \(\delta _{t,\lambda }\) is not small at all.

3.3 Parameter transfer learnability for dictionary learning

When a true dictionary \({\mathbf{D}}^*\) exists as in Assumption 1, we show that the output \(\widehat{{\mathbf{D}}}_N\) of a suitable dictionary learning algorithm from N-unlabeled samples satisfies the parameter transfer learnability for the sparse coding \({\varphi }_{{\mathbf{D}}}\). Then, Theorem 1 guarantees the excess risk bound in self-taught learning since the discussion in this section does not assume the label space in the source domain. This situation corresponds to the case where \({{\varvec{\theta }}}_{\mathcal {S}}^*={\mathbf{D}}^*\), \(\widehat{{{\varvec{\theta }}}}_N=\widehat{{\mathbf{D}}}_N\) and \(\Vert \cdot \Vert =\Vert \cdot \Vert _{1,2}\) in Sect. 2.1.

We show that an appropriate dictionary learning algorithm satisfies parameter transfer learnability for the sparse coding \({\varphi }_{{\mathbf{D}}}\) by focusing on the permissible radius of perturbation in (11) under some assumptions. When Assumptions 14 hold and \(\lambda = d^{-\tau }\) for \(1/4\le \tau \le 1/2\), the margin bound (15) for \({\mathbf{x}}\in \mathcal {X}\) holds with probability \(1-\delta _{t,\lambda }\), and we have

$$\begin{aligned} \epsilon _{k,{\mathbf{D}}^{*}}({\mathbf{x}}) ~\ge ~ \frac{t^2\lambda ^3}{64\max \{1,\Vert {\mathbf{x}}\Vert \}^4} ~=~ \varTheta (d^{-3\tau }). \end{aligned}$$

Thus, if a dictionary learning algorithm outputs the estimator \(\widehat{{\mathbf{D}}}_{N}\) such that

$$\begin{aligned} \Vert \widehat{{\mathbf{D}}}_{N} - {\mathbf{D}}^{*}\Vert _{1,2} ~=~ \mathcal {O}(d^{-3\tau }) \end{aligned}$$
(16)

with probability \(1-\delta _N\), the estimator \(\widehat{{\mathbf{D}}}_{N}\) of \({\mathbf{D}}^{*}\) satisfies the parameter transfer learnability for the sparse coding \({\varphi }_{{\mathbf{D}}}\) with probability \(\bar{\delta }_N:=\delta _N+\delta _{t,\lambda }\). Then, by local stability of sparse representation and parameter transfer learnability of such a dictionary learning, Theorem 1 guarantees that sparse coding in self-taught learning satisfies the excess risk bound in (4). For n-samples \({\mathbf{X}}^n=\{{\mathbf{x}}_1,\ldots {\mathbf{x}}_n\}\) in the target domain, detailed analysis reveals that the inequality \(\Vert \widehat{{{\varvec{\theta }}}}_N - {{\varvec{\theta }}}^{*}_{\mathcal {S}}\Vert \le \epsilon _{{{\varvec{\theta }}}^{*}_{\mathcal {S}}}({\mathbf{X}}^n)\) holds with probability \(1-(\delta _N+n\delta _{t,\lambda })\), which is sharper than \(1-n\bar{\delta }_N\).

Theorem 1 applies to any dictionary learning algorithm as long as (16) is satisfied. For example, from Theorem 12 in Arora et al. (2015), when some conditionsFootnote 7 including Assumptions 14 are assumed, there is an iterative algorithm [Algorithm 5 in Arora et al. (2015)] whose output \({{\mathbf{D}}}^s\) at iteration s satisfies

$$\begin{aligned} \Vert {{\mathbf{D}}}^s - {\mathbf{D}}^{*}\Vert _{1,2}^2 ~\le ~ \gamma ^s \Vert {{\mathbf{D}}}^0 - {\mathbf{D}}^{*}\Vert _{1,2}^2 + \mathcal {O}(d^{-2}) \end{aligned}$$
(17)

for some \(1/2<\gamma <1\). When \(s\ge C\log d\) for a large constant C and dimension d is large enough, it holds that

$$\begin{aligned} \Vert {{\mathbf{D}}}^s - {\mathbf{D}}^{*}\Vert _{1,2} ~=~ \mathcal {O}(d^{-1}). \end{aligned}$$

We note that the algorithm requires infinite number of samples at each iteration. However, modifying Appendix G of Arora et al. (2015), it is expected that there is a large constant \(C'\) and an alternative stochastic algorithm whose output \(\widehat{{\mathbf{D}}}^s\cong {\mathbf{D}}^s\) at each iteration \(s\ge C'\log d\) satisfies (16) for \(1/4<\tau <1/3\).

Note that, although we imposed the hard-sparsity assumption (Assumption 3) as in Arora et al. (2015), we focused on the LASSO-based encoder \({\varphi }_{{\mathbf{D}}}\) instead of the hard-sparsity encoder treated in Arora et al. (2015). Under the hard-sparsity assumption, we could derive the lower bound of the permissible radius of perturbation \(\epsilon _{k,{\mathbf{D}}}\) about the LASSO-based encoder and use the result about the estimation error in dictionary learning in Arora et al. (2015).

4 Numerical experiments

We report numerical experiments using US postal service (USPS) and MNIST handwritten digits datasets and compare the results with our theoretical conclusions. Especially, we intensively investigate the relationship among \(N, n, \Vert \widehat{{{\varvec{\theta }}}}_{N}-{{\varvec{\theta }}}_{\mathcal {S}}^{*}\Vert , \rho \), and the prediction performance of the transfer learning.

The USPS dataset is composed of \(d=256\) dimensional 7291 training images and 2007 test images, and each element of data vectors ranges from \(-\,1\) to 1. The MNIST data set has 60,000 training images and 10,000 test images of dimension \(d=784\), and each element of the data vectors range from 0 to 255. In numerical experiments, the MNIST data is scaled such that each element takes a value in the interval [0, 1]. In both datasets, each image with the \(\ell _\infty \)-norm 1 has a label in \(\{0,1,\ldots ,9\}\).

4.1 Prediction accuracy of learning with sparse representation

Let us describe the setup of numerical experiments in which the self-taught learning was applied to the USPS dataset. N images out of USPS training images were randomly chosen as training samples on the source domain. In the experiments, N was set to 3000. The data matrix of the source domain was represented by \({\mathbf{X}}_{\mathcal {S}}=(\widetilde{{\mathbf{x}}}_1,\ldots ,\widetilde{{\mathbf{x}}}_{N})\in \mathbb {R}^{d\times {N}}\).

The dictionary \(\widehat{{\mathbf{D}}}\in \mathbb {R}^{d\times m}\) was obtained by solving the problem

$$\begin{aligned} \min _{{\mathbf{D}},{\mathbf{Z}}}\ \frac{1}{2}\Vert {\mathbf{X}}_{\mathcal {S}}-{\mathbf{D}}{\mathbf{Z}}\Vert _2^2+\lambda \Vert {\mathbf{Z}}\Vert _1,\quad \mathrm {s.t.}\ {\mathbf{D}}\in \mathcal {D}, \ {\mathbf{Z}}\in \mathbb {R}^{m\times N}, \end{aligned}$$

where \(\Vert {\mathbf{A}}\Vert _p\) was \((\sum _{i,j}|a_{ij}|^p)^{1/p}\) for the matrix \({\mathbf{A}}=(a_{ij})\). The feature map was defined by the sparse representation of \({\mathbf{x}}\in \mathbb {R}^{d}\) in (10). The regularization parameter \(\lambda \) for the sparse representation was set to \(\lambda =1\). The dimension m of the dictionary varied from 16 to 512. The problem on the target domain was a 10-class digits classification of the image data. In experiments, n images were randomly chosen out of the remaining 4291 USPS training images. Here, n was varied from 500 to 3000. The prediction accuracy of the classifier was evaluated using all test images. The sparse representation of the training images, \(\{({\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}}_i),\,y_i):i=1,\ldots ,n\}\), was used to train the linear SVM (Huang and Aviyente 2006; Yang et al. 2009). The Lasso-type sparse representation (10) was employed, since it is quite popular in the dictionary learning Zhang et al. (2015). In addition, a computationally efficient implementation of the dictionary learning called spams is available as an R package (Mairal et al. 2009). The classifier was provided by kernlab package of R language (Karatzoglou et al. 2004; R Core Team 2016), and a one-vs-one strategy was used to deal with multi-class classification problems.

In addition, we evaluated the influence of the dictionary shift by analyzing how the estimation error on the source domain affected learning results on the target domain. A shifted dictionary of \(\widehat{{\mathbf{D}}}\) was obtained by adding a random matrix \({\mathbf{M}}\) to \(\widehat{{\mathbf{D}}}\). In experiments, each element of \({\mathbf{M}}\) was assumed to be an i.i.d. copy of Gaussian noise with mean 0 and standard deviation \(\sigma \). Each column of the perturbed matrix \(\widehat{{\mathbf{D}}}+{\mathbf{M}}\) was normalized to obtain the shifted dictionary \(\widetilde{{\mathbf{D}}}\in \mathcal {D}\). The feature map \({\varphi }_{\widetilde{{\mathbf{D}}}}({\mathbf{x}})\) with the shifted dictionary was used to obtain the sparse representation. Numerical experiments were conducted to reveal the relation among the test error, regularization parameter \(\rho \), and noise level \(\sigma \).

For the MNIST dataset, N was set to 30,000 and n was varied from 500 to 10,000. The dimension of the dictionary, m, was varied from 16 to 1568. All test images were used to evaluate the test error on the target domain. Furthermore, the effect of the dictionary shift was evaluated.

The results are shown in Figs. 1 and 2. The test error on the target domain is plotted in the left column of each figure. Furthermore, the test error of the linear SVM using only samples on the target domain are reported. The regularization parameter \(\rho \) in the linear SVM was set to an optimal value that achieved the smallest test error. In the right column, the optimal \(\rho \) of learning with sparse representation is shown as a function of the sample size n.

When large dictionaries were used, we found that the test error of the SVM using sparse representation was smaller than that got by implementing standard SVM that used only target samples without the sparse representation. Hence, samples on the source domain were effectively used to improve the prediction accuracy.

Fig. 1
figure 1

Plot of test errors (left column) and optimal regularization parameters (right column) on USPS dataset. The dimension of dictionary \({\mathbf{D}}\in \mathbb {R}^{d\times {m}}\) with \(d=256\) was set to \(m=16, 64\), and 512 and the noise level, \(\sigma \), was varied from 0 to 0.1. Curves “svm” in the left column show test error of SVM using only target samples

Fig. 2
figure 2

Plot of test errors (left column) and optimal regularization parameters (right column) on MNIST dataset. The dimension of dictionary \({\mathbf{D}}\in \mathbb {R}^{d\times {m}}\) with \(d=784\) was set to \(m=16, 196\), and 1568, and the noise level, \(\sigma \), was varied from 0 to 0.1. Curves “svm” in the left column show test error of SVM using only target samples

Furthermore, we investigated the usefulness of samples on the source domain by comparing with a variant of supervised dictionary learning (SDL) proposed by Mairal et al. (2009). In the common SDL, the dictionary and classifier are simultaneously optimized based on only samples from the target domain. In the experiments, we employed a simple variant of the SDL to reduce the computational cost. In the simplified SDL, the dictionary \({\mathbf{D}}\) and feature \({\mathbf{Z}}\) were obtained using the data matrix of the target domain instead of the source domain. Then, the sparse feature representation, \({\mathbf{Z}}\), was fed into the linear SVM as training input with the output label.

Table 1 shows the result. The column “source and target” shows the test error of transfer learning using information on the source and target domains. On the other hand, the column “target (SDL)” shows the results of simplified SDL using only samples on the target domain. For the MNIST dataset, the SDL with \(m=1568\) was dropped, because the computational cost of training \({\mathbf{D}}\) in each iteration was too high. Overall, learning using both the source and target domains performed better than the simplified SDL, especially for small n. Therefore, transfer learning using \({\mathbf{D}}\) trained on the source domain is expected to be practically useful.

Table 1 Test errors and standard deviation of transfer learning in percent. The dictionary \({\mathbf{D}}\) and feature \({\mathbf{Z}}\) are trained using samples on “source and target” domain or only “target” domain. In the latter method, samples on the target domain are used to learn the dictionary

4.2 Setting of regularization parameters

Next, we investigate the relationship between the dictionary shift on the source domain and the regularization parameter \(\rho \) on the target domain. For \(m=16\) of USPS data in Fig. 1, a large optimal regularization parameter \(\rho \) was required to deal with the large noise level \(\sigma \). Theoretical analysis in Sect. 2.2 showed that a large regularization parameter was needed to suppress the large perturbation of the feature map. Hence, numerical results for small m agreed with our theoretical results. On the contrary, when \(m=512\), the small optimal regularization parameter efficiently worked to suppress the large noise level \(\sigma \). The same tendency was observed in the result on MNIST dataset in Fig. 2. In summary, when the dictionary is small, a larger regularization parameter is required to deal with larger noise level. For a large dictionary, the opposite is true.

Next, we explain the relation between the noise level \(\sigma \) and regularization parameter \(\rho \). Theorem 2 shows the relation between the sensitivity of the sparse representation and the dictionary shift. The upper bound of the sensitivity depends mainly on the degree of incoherence and amount of dictionary shift. Let \(\mu _{{\mathbf{D}}}\) be the degree of incoherence for dictionary \({\mathbf{D}}\); see Definition 5. For \({\mathbf{D}}=[{\mathbf{d}}_1,\ldots ,{\mathbf{d}}_m ]\), \(\mu _{{\mathbf{D}}}\) is computed as \(\sqrt{d}\max \{|\langle {\mathbf{d}}_i,{\mathbf{d}}_j\rangle |\,:\,i\ne {j}\}\). Then, the difference \(\Vert {\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}})-{\varphi }_{\widetilde{{\mathbf{D}}}}({\mathbf{x}})\Vert _2\) is bounded above by \(\Vert \widehat{{\mathbf{D}}}-\widetilde{{\mathbf{D}}}\Vert _{1,2}/(1-\mu _{\widetilde{{\mathbf{D}}}}/\sqrt{d})\) up to a positive factor depending on \({\mathbf{x}}\) when the 1-margin is used. We numerically confirmed that the upper bound using \(\mu _{\widetilde{{\mathbf{D}}}}\) is tighter than the upper bound using \(\mu _{\widehat{{\mathbf{D}}}}\). This is because \(\widehat{{\mathbf{D}}}\) includes some similar column vectors and the random noise relaxes such similarity. As shown in the proof of Theorem 1, the shift of the feature map \(\Vert {\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}})-{\varphi }_{\widetilde{{\mathbf{D}}}}({\mathbf{x}})\Vert _2\) directly affects the upper bound of the generalization ability. Figure 3 depicts the average value of \(\Vert \widehat{{\mathbf{D}}}-\widetilde{{\mathbf{D}}}\Vert _{1,2}/(1-\mu _{\widetilde{{\mathbf{D}}}}/\sqrt{d})\) over 20 different random matrices as a function of \(\sigma \). Generally, larger \(\sigma \) leads to larger \(\Vert \widehat{{\mathbf{D}}}-\widetilde{{\mathbf{D}}}\Vert _{1,2}\) and smaller \(\mu _{\widetilde{{\mathbf{D}}}}\). When the size of the dictionary, m, is small, the effect of \(\Vert \widehat{{\mathbf{D}}}-\widetilde{{\mathbf{D}}}\Vert _{1,2}\) dominates that of \(\mu _{\widetilde{{\mathbf{D}}}}\). On the contrary, for a large dictionary, the decrease of \(\mu _{\widetilde{{\mathbf{D}}}}\) dominates the increase of \(\Vert \widehat{{\mathbf{D}}}-\widetilde{{\mathbf{D}}}\Vert _{1,2}\). As a result, the upper bound of \(\Vert {\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}})-{\varphi }_{\widetilde{{\mathbf{D}}}}({\mathbf{x}})\Vert _2\) becomes small for large noise level.

Based on the above consideration, the numerical results in Figs. 1 and 2 can be interpreted as follows. Let \({\mathbf{D}}^*\) be the true dictionary. The difference \(\Vert {\varphi }_{{\mathbf{D}}^*}({\mathbf{x}})-{\varphi }_{\widetilde{{\mathbf{D}}}}({\mathbf{x}})\Vert _2\) will affect the test error and optimal regularization parameter. Let us consider the upper bound of the difference, \(\Vert {\varphi }_{{\mathbf{D}}^*}({\mathbf{x}})-{\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}})\Vert _2+\Vert {\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}})-{\varphi }_{\widetilde{{\mathbf{D}}}}({\mathbf{x}})\Vert _2\). In numerical experiments, the first term \(\Vert {\varphi }_{{\mathbf{D}}^*}({\mathbf{x}})-{\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}})\Vert _2\) is fixed and the second term \(\Vert {\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}})-{\varphi }_{\widetilde{{\mathbf{D}}}}({\mathbf{x}})\Vert _2\) affects the learning results. As shown above, the effect of the dictionary shift presented in Figs. 1 and 2 agree with our theoretical findings.

Fig. 3
figure 3

Plot of average of upper bound, \(\Vert \widehat{{\mathbf{D}}}-\widetilde{{\mathbf{D}}}\Vert _{1,2}/(1-\mu _{\widetilde{{\mathbf{D}}}}/\sqrt{d})\), for USPS and MNIST datasets. Horizontal axis denotes the noise level \(\sigma \). The size of the dictionary was varied from 16 to 512 for USPS and from 16 to 1568 for MNIST

5 Conclusion

We derived an excess risk bound (Theorem 1) for a parameter transfer learning problem based on local stability and parameter transfer learnability, which were newly introduced in this paper. By applying the proposed model to a sparse coding-based algorithm under a sparse model (Assumptions 14), we obtained the first theoretical guarantee of excess risk bound in self-taught learning. Numerical experiments in Sect. 4 showed that transfer learning with appropriate regularization parameter worked efficiently to achieve high prediction accuracy on the target domain. Moreover, we confirmed that the theoretical analysis of local stability for sparse coding was useful for understanding the relationship between size of the dictionary and regularization parameter and prediction accuracy.

The framework of parameter transfer learning included not only sparse coding, but also other promising algorithms such as multiple kernel learning and deep neural networks. Our results are expected to be effective in analyzing the theoretical performance of these algorithms. Finally, we noted that our excess risk bound could be applied to applications other than self-taught learning because Theorem 1 included a case in which labeled samples were available in the source domain.