A Transfer Learning Method for Deep Networks with Small Sample Sizes

Transfer learning is that a machine learning model learns knowledge from more than one domain, and it is applied to the context of small sample size. Some of approaches concentrate on the correlation determination among all domains while some pay more attention on knowledge transfer among all domains. In this paper, on the basic of SVM with hinge loss, a new regularized transfer learning deep network with a specific regularization is proposed, in which a deep network learns high level representation with respect to the given samples. And a part of parameters in SVM are shared such that the similarity of data distribution can be well captured. Besides, a modified regularized SVM is exploited such that the gradient based method is feasible, which yields a parallel implementation of the proposed method. After that, in the experiment part, the comparison of our approach with state-of-the-art approaches manifests the competitive performance and the feasibility in classification.


Introduction
Classification algorithms are used in variety of areas, including images classification [1] and text categorization [2]. These classification methods are based on the assumption that all of the training data and test data are drawn independently from identically distribution [1]. The number of training samples is sufficient for us to construct a predictive classifier. However, it is worth noting that this assumption may not be feasible in applications. For example, the training samples may be not sufficient enough to construct a classifier or the existing training data are outdated [3]. The reasons behind these cases probably include the follow three folds. First, annotating data is usually an expensive labor process, and experts are not willing to annotate all images [1]. Secondly, it is extremely costly to obtain sufficient training samples, such as medical image analysis [4]. Thirdly, it is unrealistic for us to obtain sufficient training samples, like visual object tracking [5].
To address this problem, people have proposed a kind of methods, called transfer learning [1,3]. These methods require training data is drawn from multiple domains, including target domain and related source domains. It is noteworthy that the source domain data is relative to the target domain data. The target domain data is small size. These methods exploit the training data from source domains to assist to construct the task learning in the target domain. For example, in Ref. [2], a novel transfer learning framework called TrAdaBoost has been presented, which extends Adaboost and TrAdaBoost and allows users to utilize a small amount of newly labeled data to leverage the old data to construct a high-quality classification model for the new data. Besides, some other transfer learning methods straddles both multi-task [6] and transfer learning, which is referred as multi-task transfer learning [7,8]. In Ref. [ learning. In multi-task learning, people care about the performance of each task, while in multi-task transfer learning, people care about the result of task learning in the target domain [7,8]. In addition, some deep network model-based transfer learning methods are also studied, such as fully convolutional networks (FCN) [9], weakly-shared deep transfer networks (weakly-shared DTN) [10] and Learning structure and strength of CNN filters (SSF-CNN) [11].
In this paper, motivated by the multi-task transfer learning methods [9,10], we propose a deep regularized transfer learning method named Dratle to solve the problem of training deep network with small sample sizes. In the proposed Dratle method, we construct a support vector machine (SVM) model for each task with respect to each domain. These SVMs are embedded in multi-task framework such that the source domain data can assist to construct the predictive SVM in the target domain. In our approach, the SVM is improved to make the SVM and the deep network can be simultaneous optimized. Besides, a regularization term is constructed for the deep network in order that the similarity of target data distribution and source data distribution can be well captured. The basic contributions of the paper can be summarized as follows: (1) We build a revised SVM for transfer learning such that the SVM model can be optimized by gradient-based method. Moreover, the SVM model and deep network can be optimized simultaneously. (2) We propose a regularization for deep network and shared parameter for the SVM such that the relationship between the source domain and the target domain can be well determined. (3) We conduct experiments to investigate the performance of our proposed Dratle method. And the comparison of Dratle with existing approaches manifests the feasibility and the competitive performance in classification.

Multi-task Transfer Learning
Multi-task transfer learning is that the data is generated from multiple domains, including source domains and target domain. For the existing multi-task transfer learning methods, we can summarize them into two groups, including the non-deep network-related methods and the deep network-related methods.
In non-deep network-related methods, people modify the shadow models for the transfer learning setting, including the logistic regression-based method [7,12] and the SVM-based method [8], Bayesian method [12]. In Ref. [7], Saha et. al. proposed a multi-task transfer learning (MTTL) method to augment the data from the source domain to assist the classification task in the target domain. In Ref. [8], Zheng et. al. proposed a multi-task-based transfer learning with dictionary learning (DMTTL). In DMTTL method, the dictionary learning model is exploited to learn a discriminative sparse code to enhance the classification accuracy.
The deep network-related methods embed the deep network into the multi-task transfer learning framework. For example, in Ref. [13], Kandemir et. al. adopted a two-layer feed-forward deep Gaussian process as the task learner of source and target domains. Based on the pre-training and fine-tune strategy, some transfer learning methods have been proposed, including [9,11,14]. Besides, some parameter sharing methods are also proposed such as Weakly-shared DTN [10] and SSF-CNN [11]. SSF-CNN [13] is a method to learn structure and strength of CNN filters based on the pre-training model, where it fine-tunes coefficients for each filter respectively.
The proposed method is the deep network-related methods but it differs from the existing deep network-related methods. We construct the regularized deep network such that the relationship between the source domain data and the target domain data can be well determined. Besides, we construct a set of SVM models for each task with respect to each domain, which are embedded in multi-task framework. This multi-task framework yields the parameter sharing such that the source domain data can assist to construct the predictive SVM in the target domain.

Support Vector Machine
Support vector machine is firstly proposed in Ref. [15], which is served as a binary classifier. And many modifications are proposed to improve the performance of SVM such as introduction of kernel function [16]. In binary SVM, the optimal hyperplane in feature space is formulated by w and b. And the objective of SVM is min , , where relaxes the hard margin constrain. There are a variety of SVM extensions, for example, a regularized multi-task SVM is proposed for multi-task learning setting in Ref. [6]. For parallelization, SVM with optimizing methods based on gradient have been proposed, such as Pegasos [17], P-packSVM [18], where Pegasos [17] is a method that considers the sub-gradient for optimization and has been proposed with convergence analysis and complexity analysis. P-packSVM [18] has embraced the best known stochastic gradient descent method to optimize the primal objective which achieves a parallel implementation.
In this paper, we exploit a set of SVM models for each task with respect to each domain, which are embedded in multi-task framework. This multi-task framework yields the parameter sharing. The shared parameters are to determine the similarity among multiple domains samples and the data from source domain can assist to construct the predictive SVM in the target domain. In addition, we modify the objective function of the SVM model so that the simultaneous optimization for deep network and SVM models are available.
Given the arbitrary -th sample from the target domain , its corresponding feature representation is = ( ). Then the classification task in the target domain is set as SVM-based binary classifier as follow.
where is the shared parameter for source tasks and target task, while is the specific parameter for the target task. For the classification task in the source domain is similar to (2) with the shared parameter and specific parameter for the source task . The motivation of above formulations is presented as follow. Firstly, in this method, given the sample, the deep network can learn a high-level feature representation so as to improve the classification accuracy [5]. Besides, considering the multi-task transfer learning setting, there is a relationship between the source domain and the target domain. Moreover, the classifier corresponding to each domain is of similarity, and the shared parameter is to well capture this consistency [8]. Considering the variety of all tasks, the parameter and are constructed to capture the own data distribution characteristic of each domain. Besides, the similarity of data from two domains is also important, we take the regularization term ‖̅ − ̅ ‖ 2 into consideration. ̅ is the average value over all and ̅ is the average value over all . The motivation is that the high similarity of and can help to construct the classifier w.r.t. the target task. We also exploit the -2 norm regularization to limit the complexity of the model. The Dratle model is optimized by integrating deep networks, SVM and the regularization terms mentioned above. Then, we have the following expression. • 2 ‖ ‖ 2 + 2 ‖ ‖ 2 + 2 ‖ ‖ 2 + (∑ + ∑ ) where , , , , 0 and 1 are the trade-off parameters to balance the effect of those respective regularizations such that all these regularizations are in the same order of magnitude.

Optimization and Pseudo-Codes
In this section, the optimization of Dratle is presented. The initial parameters are set as random values, including , , , , , . And an end-to-end optimization is utilized to minimize the objective. Consider the hinge loss and the idea of mini-batch gradient descent. The objective in (3)

15: end for
Given a mini-batch of samples ℬ drawn from , and ℬ drawn from . The gradient of ℒ with respect to represents as following expression.
Once the gradient of ℒ( , , , , , , ) with respect to is visible, we can implement the backpropagation algorithm to compute the gradient with respect to parameters in latent layers in deep networks. And finally, the gradient-based optimization method can be implemented to optimize all parameters in Dratle method. Here, we adopt the mini-batch Adam method [19] to update parameters at each iteration, where Adam is an optimizer based on gradient descent and adaptive estimates of lower-order moments. Besides, the Adam method is straightforward to implement and has computational efficiency for little space complexity. Finally, the pseudo-codes of the proposed Dratle method is as Algorithm 1.
For the above formulas, the optimal value of ̃,̃ and are denoted as ̃ * ,̃ * and * respectively. We can conclude that the regret bound of the proposed Dratle method with Adam ). Similar to works in Ref. [17], given the loss value , the complexity of runtime is ( 1 2 ).
In the experiment, we study the performance of Dratle based on transfer learning data sets such as 20 Newsgroups and Reuters. The detail information about the data set is shown in table 1. The 20 Newsgroups is a popular data set for experiment in text classification. It is comprised of 20 sub-classes which is grouped into 7 classes, including comp, rec, sci, misc, talk, alt, and soc. Here, we exploit 4 classes, including comp, rec, sci and talk. Besides, these 4 classes achieve the first 3 settings in table 1.
As for Reuters data set, there are 5 classes, such as Exchanges, Orgs, People Places and Topics. Each class has a number of sub-classes. Here, we exploit three classes, including Orgs, People and Places. Also, these 3 classes achieve the second 3 settings in table 1. In addition, we also conduct transfer learning in the image data sets, including MNIST and USPS. These data sets are composed of digit images with ten labels from 0 to 9. Among them, MNIST and USPS are grayscale image set. Here we conduct the experiment between them. We randomly select images belong to the labels from 0 to 9 as the positive classes, while the rest classes are negative classes. These 2 classes achieve the last 4 settings in table 1.
The first six settings in table 1 have the same explanation as follows. For the first setting C v.s. R_C, the alphabet C denotes Comp class and R_C is the rest classes including rec, sci and talk. The target domain is a sub-class in comp, rec, sci and talk, and rest sub-classes are set as source domain. In the last four settings, M0 v.s. R_M0 denotes that the positive class is 0 in MINIST dataset while the rest classes 1 to 9 are set as negative class. Besides, the source domain and target domain are highlighted respectively in table 1.

Parameters Settings
In this experiment, we exploit the five-fold cross validation method to search for the optimal trade-off parameters settings. The data from the data set is normalized into a range from 0 to 1. For the baselines, we following their parameter settings, including the trade-off parameters searching interval and their parameter settings optimization methods. As for deep network-related methods, like SVM, WSDTL and the proposed Dratle method, they share the same deep network structure as shown in table 2, where fc denotes the fully connected layer. The settings of proposed method are presented as follows. The regularization parameters for SVMs, , , and , are searched in the set {1 −3 , 1 −2 , 1 −1 , 1 0 , 1 1 , 1 2 , 1 3 , }. The 0 to regularize the deep network is searched in the set {1 −5 , 1 −4 , 1 −3 , 1 −2 , 1 −1 }. To enhance the similarity of and , the optimal value of 1 is searched in the set {2 −4 , 2 −3 , 2 −2 , 2 −1 , 2 0 , 2 1 , 2 2 , 2 3 }.

Experiment Result
Let denote the percentage of used training samples from the target domain. The results of all baselines with all data sets is shown in table 3. Furthermore, we implement these methods with different sizes of source training set. Besides, we use the setting, 0 v.s. R_0, with increasing the from 0.01 to 0.5, and the accuracy is shown in figure 1. Based on these works, we have the follow four observations.
(1) The proposed Dratle method delivers the highest accuracy in most cases. For example, when is 0.05, we can see that in the C v.s. R_C setting, the accuracy of Dratle is 83.7%. Moreover, the accuracy of O v.s. R_O is 85.7% and that of M0 v.s. R_M0 is 87.4%. The outperformance of Dratle method manifest the advance of the proposed Dratle method. The reason is that, in the Dratle method, the shared parameter and the similarity regularization as (4) works well.
(2) Deep network related methods achieve better performance than non-deep networks methods. We can see that the SVM, WSDTL, and the proposed Dratle method can achieve higher accuracy than DMTTL, MTSVM and MTTL. The reason is that, although they are fed with the same feature, these deep network-related methods are able to learn a high-level feature representation, which can improve the classification accuracy.  Figure 1. Accuracy curves on 0 v.s. R_0 for 6 methods.
(3) The proposed Dratle outperforms in ablation experiment. Comparing Dratle and SVM method, we can easily to conclude that the shared parameter and the similarity regularization as (4) is able to improve the performance of classification task in the target domain. The reason is that these term in (4) is able to utilize the source domain data to assist constructing a predictive classifier in the target domain. The outperformance of Dratle over WSDTL manifest the feasibility of similarity regularization in (4). The reason is that although both Dratle and WSDTL exploit the parameter sharing mechanism, but the Dratle also exploit the similarity regularization in (4) so that the relationship between the source domain and the target domain can be well determined.
(4) From figure 1, we can see that if the ratio increases, the accuracy of the Dratle also increases. The reason behind this is that target domain data will contain more information and the classifier is able to effectively capture the target domain data distribution. These data assist to construct a more predictive classifier in the target domain and the generalization ability of target domain classifier is enhanced. In addition, with different training samples, Dratle always outperforms over other methods.

Conclusion and Future Work
In this paper, we proposed a multi-task transfer learning method called Dratle based on the SVM and deep network. In the proposed Dratle, we use the sharing parameter and similarity regularization method to well determine the relationship between the source domain and the target domain. We also revised the SVM so as that the gradient-based optimization method is feasible, which yields the end-to-end optimization for this transfer learning based deep network. Besides, in the experiment, the proposed method performs better in the benchmark transfer learning data set. In the future, we will pay more attention on Dratle method with outlier detection and data stream application.