Abstract

The assumption that the training and testing samples are drawn from the same distribution is violated under covariate shift setting, and most algorithms for the covariate shift setting try to first estimate distributions and then reweight samples based on the distributions estimated. Due to the difficulty of estimating a correct distribution, previous methods can not get good classification performance. In this paper, we firstly present two types of covariate shift problems. Rather than estimating the distributions, we then desire an effective method to select a maximum subset following the target testing distribution based on feature space split from the auxiliary set or the target training set. Finally, we prove that our subset selection method can consistently deal with both scenarios of covariate shift. Experimental results demonstrate that training a classifier with the selected maximum subset exhibits good generalization ability and running efficiency over those of traditional methods under covariate shift setting.

1. Introduction

Traditional classification methods, such as Support Vector Machines (SVMs) [1, 2], decision tree [3, 4], and neural networks [5, 6], are always based on the assumption that the training and testing samples are drawn from the same distribution. In many classification scenarios, such as the class imbalance problem [7, 8], concept drift [9, 10], and covariate shift [11], however, this assumption is violated. An informal description on covariate shift is as follows. Covariate shift refers to the learning settings in which source data sets and target data sets have the same feature attributes, label attribute, and the conditional probabilities of but have different feature distributions. In this paper, we mainly study classifying under covariate shift. Below, we give two types of classification scenarios belonging to covariate shift. They are what we focus on in this paper.

Scenario 1. Classification problems contain training samples, testing samples, and auxiliary samples, where training samples and testing samples are drawn from the same distribution, while the auxiliary samples are drawn from another distribution. In addition, the training set size is very small.
In real world, lots of classification problems belong to Scenario 1. For example, suppose we want to construct a Web page classification model. The Web data used in training a Web page classification model can be easily outdated when applied to the Web sometime later, because the topics on the web change frequently. Often, new data are expensive to label and thus their quantities are limited due to cost issues. How to accurately classify the new test data by making the maximum use of the old data becomes a critical problem.

Scenario 2. Classification problems contain training samples and testing samples, where training samples and testing samples are drawn from different distributions. There are no auxiliary samples.
For example, suppose we are using a learning method to induce a model that predicts the side effects of a treatment for a given patient. Because the treatment is not given randomly to individuals in the general population, the available training samples are not a random sample from the population. Therefore, the training samples and testing samples are drawn from different distributions. How to accurately classify the testing data by employing the training data becomes a critical problem.

In this paper, we address the two types of covariate shift problems by training on a newly constructed set following approximately the target distribution. The rest of this paper is organized as follows. In Section 2, we formally define covariate shift in machine learning terms and describe the related works on this problem. In Section 3, we propose the MIDS construction method by matching sample numbers between target set and auxiliary set in each feature subspace. In Section 4, we present the corresponding data correction methods with respect to the two scenarios, propose the corresponding classification algorithms, and analyze the time complexity of the algorithms. Our experimental results and discussion are shown in Section 5. Section 6 summaries the main contribution of this paper and gives some future works.

2.1. Related Concepts

In this section, we introduce some notations and definitions that are used in this paper. First of all, we give the definitions of a “domain” and a “task,” respectively.

Definition 1 (domain). In this paper, a domain consists of two components: a feature space and a marginal probability distribution , where .

Definition 2 (task). Given a specific domain——a task consists of two components: a label space and an objective predictive function (denoted by ), which is not observed but can be learned from the training data, which consist of pairs , where and . The function can be used to predict the corresponding label, , of a new instance . From a probabilistic viewpoint, can be written as .

In this section, we denote the source domain data by , where is the data instance and is the corresponding class label. Similarly, we denote the target domain data by , where the input is in and is the corresponding output. In most cases, .

We now give a formal definition of covariate shift.

Definition 3 (covariate shift). Covariate shift refers to the learning settings that have the following features: (1) source domain and target domain have the same feature and label spaces; that is, and . (2) Source domain and target domain have different feature distribution; that is, . (3) Source domain and target domain have the same concept; that is, or .

It is worthwhile to note that there can be multiple auxiliary data sets in classification problems under covariate shift and their feature distributions can be different. In addition, from the definition of covariate shift, we can see that the two scenarios described in Section 1 do belong to covariate shift, because, for Scenario 2, testing samples can be considered as target samples and training samples can be considered as auxiliary samples.

2.2. Related Works

As described before, covariate shift includes the two scenarios described above. With respect to Scenario 1, auxiliary samples are utilized to improve the performance of classifiers. In previous works, Wu and Dietterich [12] proposed an image classification algorithm using both inadequate training data and plenty of low quality auxiliary data. They demonstrated some improvement by using the auxiliary data. However, they did not give a quantitative study using different auxiliary examples. Liao et al. [13] improved learning with auxiliary data using active learning. Rosenstein et al. [14] proposed a hierarchical naive Bayes approach for transfer learning using auxiliary data and discussed when transfer learning would improve or decrease the performance. Dai et al. [15] proposed a covariate shift-related algorithm, TrAdaBoost, which is an extension of the AdaBoost algorithm, to address the inductive transfer learning problems. TrAdaBoost assumes that the source and target domain data use exactly the same set of features and labels, but the distributions of the data in the two domains are different. In addition, TrAdaBoost assumes that, due to the difference in distributions between the source and the target domains, some of the source domain data may be useful in learning for the target domain but some of them may not and could even be harmful. It attempts to iteratively reweight the source domain data to reduce the effect of the “bad” source data while encouraging the “good” source data to contribute more to the target domain. For each round of iteration, TrAdaBoost trains the base classifier on the weighted source and target data. The error is only calculated on the target data. Furthermore, TrAdaBoost uses the same strategy as AdaBoost [16] to update the incorrectly classified examples in the target domain while using a different strategy from AdaBoost to update the incorrectly classified source examples in the source domain. However, TrAdaBoost can not deal with the case where there are multiple auxiliary data sets coming from different distributions.

With respect to Scenario 2, unlabeled testing samples are utilized to improve the performance of classifiers. Unlike semisupervising learning problem [17], for Scenario 2, the unlabeled testing samples are under a different distribution from the training samples and are used to correct the sample selection bias. In previous works, most approaches intend to estimate the importance . If we can estimate the importance for each instance, we can solve the learning problems under covariate shift. There exist various ways to estimate .

Zadrozny [18] proposed to estimate the terms and independently by constructing simple classification problems and then estimate the importance by taking the ratio of the estimated densities. However, estimating densities is known to be a hard problem particularly in high-dimensional cases. Therefore, this approach may not be effective.

Huang et al. [19] proposed a kernel-mean matching (KMM) algorithm to learn directly by matching the means between the source domain data and the target domain data in a reproducing kernel Hilbert space (RKHS). KMM is shown to work well if tuning parameters such as the kernel width are chosen appropriately. Thus, the importance estimation problem is now relocated to the model selection problem. Standard model selection methods such as cross-validation, however, are heavily biased under covariate shift. Therefore, KMM can not be directly applied in the cross-validation [20] framework.

Unlike KMM, Sugiyama et al. [21] proposed an algorithm known as Kullback-Leibler importance estimation procedure (KLIEP), which is equipped with a natural model selection procedure. KLIEP can be integrated with cross-validation to perform model selection automatically in two steps: (1) estimating the weights of the source domain data; (2) training models on the reweighted data.

In this paper, we propose a novel method by constructing a MIDS to deal with classification problems under covariate shift. The formal definition of MIDS and its construction method will be given in the next section. Unlike previous transfer learning methods, our method can consistently deal with both scenarios and the cases where there are multiple auxiliary data sets coming from different distributions. Furthermore, unlike the above sample reweighting techniques, we do not estimate distributions but match sample numbers between target set and auxiliary set in each feature subspace; we do not reweight samples but construct a new training set following approximately the target distribution.

3. The MIDS Construction Method

In this section, we use two data sets, target set and auxiliary set. Our objective is to design a method that can construct MIDS from auxiliary samples according to target distribution. First of all, we give the formal definitions of identical distribution subset (IDS) and MIDS.

Definition 4 (identical distribution subset, IDS). Let be a target sample set and a source sample set. Assuming that they follow different distributions and have the same feature and label space—that is, and . Identical distribution subset is a subset of and follows the same distribution with .

Definition 5 (maximum identical distribution subset, MIDS). A proper identical distribution subset is called a maximal identical distribution subset if there exists no other proper identical distribution subset with a bigger size than that of .

3.1. Basic Idea of MIDS Construction Method

Our basic idea of MIDS construction is first to partition the feature space into several subspaces and then construct MIDS by matching sample numbers between target set and auxiliary set in each feature subspace; that is, select a maximum amount of auxiliary samples from each subspace to compose the MIDS according to the proportion of target samples in each subspace. The detailed process of the MIDS construction method will be presented in Section 3.2.

3.2. The Detailed MIDS Construction Process

Let be the -dimensional target set drawn from the distribution and the -dimensional source set drawn from the distribution .

(1) Partitioning the Feature Space into Several Subspaces. Firstly, compute the mean of the target set by the following formula:

Then, partition the -dimensional space into subspaces. In detail, let be any vector in the feature space, where denotes the th-dimensional value of the vector . Compare the th-dimensional value of the vector with the th-dimensional value of , and we can obtain inequalities. We use an -dimensional binary vector to represent these inequalities; that is to say, if , we label the th-dimensional value of the binary vector with 0, otherwise 1. Thus we can divide the feature space into subspaces, corresponding to binary vectors from to , respectively. We number the subspace with the decimal numbers corresponding to the binary vectors.

(2) Computing the Proportion of the Target Samples in Each Subspace. Compute the number of the target samples in each subspace, and so we can obtain the proportion of samples in each subspace.

(3) Extracting Samples from the Auxiliary Set. We first compute the numbers of auxiliary samples in each subspace. Then, according to the proportion and the numbers, we select a maximum amount of samples from each subspace to compose the MIDS, noting that the proportion of the auxiliary samples selected from each subspace should be consistent with the proportion of the target samples in each subspace.

Thus we obtain the MIDS, and this subset can be considered to follow approximately the target distribution.

3.3. The Description of the MIDS Construction Algorithm and Its Time Complexity Analysis

The pseudocode of the MIDS construction algorithm is described in Algorithm 1. Firstly, we define two 2-dimensional arrays, and . The first dimension of Array records the subspace number of samples in the target sample set. It is worth noting that Array only has one record for samples with the same subspace number. The second dimension of Array records the number of samples in the corresponding subspace. Array is the same as Array , but for the source sample set.

Require: the source sample set , the target sample set , dimension ,
 size of the source sample set , size of the target sample set
Ensure: the MIDS
;
 Clarifying two 2-dimensional Arrays, and ;
     /The first dimension of Array records the subspace
     number of samples in the target sample set. Note that
     Array only has one record for samples with the same
     subspace number. The second dimension of Array
    records the number of samples in the corresponding subspace.
     Array is the same as , but for the source sample set/
  /obtain Array from and /
 /obtain Array from and /
   /select a maximum amount of samples from
          each subspace to compose the MIDS according
          to Array and Array /

Then we obtain Array from and . We compute the subspace number of samples by real number comparison operation. To obtain the number of samples in the corresponding subspace, we need to scan the whole data set . It is the same with array .

Finally, we select a maximum amount of samples from each subspace to compose the MIDS according to Array and Array .

We split the whole space into subspaces, and for large the number of subspaces is enormous, which would cause the curse of dimensionality if we compute the number of samples in each subspace. Luckily it is not necessary to do that, as the samples are always sparse in a high-dimensional space. Thus we only need to compute the number of samples in subspaces which consist of samples.

Therefore the time complexity is mainly composed of two parts, corresponding to calculating the proportion of the target samples in certain subspaces and calculating the numbers of the source samples in certain subspaces, respectively. It is worth noting that we only need to compute the number of samples in subspaces which consist of samples.

Thus if we define one-time real number comparison operation as one-time basic operation, we need to do times operations, where denotes the size of the source sample set and denotes the size of the target sample set.

4. Classification Methods under Covariate Shift by Constructing the MIDS

In Section 3, we present a general MIDS construction method by matching sample numbers between target set and auxiliary set in each feature subspace. In this section, we will propose the special MIDS construction methods corresponding to Scenarios 1 and 2, respectively. Furthermore, we will propose the classification methods for the two scenarios.

4.1. The MIDS Construction of Scenario 1

Let be the target training set, the target testing set, and the auxiliary training set. Assume that and follow the same distribution and follow another distribution . In this section, we will present two kinds of MIDS construction methods, where one is direct and the other is indirect. Moreover, we will prove that the effect of the indirect method is equivalent to that of the direct method.

4.1.1. The Direct MIDS Construction Method of Scenario 1

With respect to the direct MIDS construction, we consider feature vector and label as a joint vector. We consider and as the target sample set and the source sample set, respectively. Thus we can use the above algorithm directly to obtain the MIDS. The MIDS construction method by considering feature vector and label as a joint vector is called the direct MIDS construction method.

Since feature vector and label are considered to be one joint vector, the dimension of samples will be increased. As described in Section 3.3, with the increase of the dimension, the running time of the MIDS algorithm will increase correspondingly. In the next section, we will present the indirect MIDS construction method, for which it is not necessary to consider and collectively and the MIDS is constructed according to feature vector alone. Thus the indirect method can reduce effectively the running time and moreover it can be applied to the case where the target testing set contains only feature vectors.

4.1.2. The Indirect MIDS Construction Method of Scenario 1

With respect to the case where there are no class labels in the target testing set, we can construct the MIDS according to feature vector alone. The MIDS construction method by considering only feature vector is called the indirect MIDS construction method. Now let be target testing set. First of all, we present the detailed process of the indirect MIDS construction method.

Process  1. Remove all the labels of samples in , and label the set composed by the remaining feature vectors as . Similarly, remove all the labels of samples in , and label the set composed by the remaining feature vectors as .

Process  2. Use the MIDS construction algorithm to obtain a subset of .

Process  3. Add the class labels removed in Process 1 to each sample of correspondingly, and thus we obtain a subset of .

Below we will prove that the effect of the indirect method is equivalent to that of the direct method.

Theorem 6. The subset obtained from the auxiliary set by the indirect MIDS method follows the same distribution with the target set ; that is, , where and denote the distributions of and , respectively.

Proof. Let and denote the distributions of and , respectively. From the definition of conditional distribution, we haveAs described in Process  3, is obtained by adding the original class labels to correspondingly. Thus the conditional probability is unchanged; that is, . From Definition 3, we know that . Thus we can obtain that . Moreover since is a MIDS of , we can obtain that . Therefore we have .

4.2. The MIDS Construction of Scenario 2

Let be target training set and target testing set. Assume that and follow distributions and , respectively. We consider and as the target sample set and the source sample set, respectively. Thus we also can use the above algorithm directly to obtain the MIDS.

With respect to the case where there are no class labels in the target testing set, we can also use indirect method to construct the MIDS. Now let be target testing set. The detailed process is as follows.

Process 1. Remove all the labels of samples in , and label the set composed by the remaining feature vectors as .

Process 2. Use the MIDS construction algorithm to obtain a subset of .

Process 3. Add the class labels removed in Process 1 to each sample of correspondingly, and thus we obtain a subset of .

4.3. Classification Algorithms

With the help of the MIDS construction method, we can make effective classification by traditional classification method. With respect to Scenario 1, we first construct the MIDS from the auxiliary set and then train a model on the set composed by the target training set and the MIDS. The pseudocodes of the two classification algorithms corresponding to the direct and indirect MIDS construction methods are shown in Algorithms 2 and 3, respectively. With respect to Scenario 2, we first construct the MIDS from the target training set and then train a model on this MIDS. The pseudocodes of the two classification algorithms corresponding to the direct and indirect MIDS construction methods are shown in Algorithms 4 and 5, respectively.

Require: the target training set , the target testing set ,
 the auxiliary sample set , a base learning algorithm Learner
Ensure: classification function
;        / denotes the target sample set/
;     /constructing the MIDS/
;     / denotes the new training set/
;    / denotes the function implemented by a base
        learning algorithm trained on the new training set /
Require: the target training set , the target testing set ,
 the auxiliary sample set , a base learning algorithm Learner
Ensure: classification function
;     / denote the feature vectors set of /
;    /  denote the target sample set/
;   / denote the feature vectors set of /
;    /constructing the MIDS/
;  /add the corresponding labels to the feature
             vectors of /
   / denotes the new training set/
;  / denotes the function implemented by a base
        learning algorithm trained on the new training set /
Require: the target training set , the target testing set
, a base learning algorithm Learner
Ensure: classification function
;     /constructing the MIDS/
;  / denotes the function implemented by a base
        learning algorithm trained on the new training set /
Require: the target training set , the target testing set
, a base learning algorithm Learner
Ensure: classification function
;     / denote the feature vectors set of /
;  /constructing the MIDS/
;  /add the corresponding labels to the feature
            vectors of /
;  / denotes the function implemented by a base
        learning algorithm trained on the new training set /

5. Experiments

In this section, we perform experiments to test the performance of the proposed classification algorithms. As proven in Section 4, the effect of the indirect method is equivalent to that of the direct method. Thus, we just test the performance of the two indirect classification algorithms. The experiment data in this section come from the UCI Machine Learning Repository [22]. All experiments are run on 2.00 GHz, Intel (R) Core (TM) i5-4200U CPU with 4 GB main memory under window 8.

5.1. The Experiment on Scenario 1

(1) Experimental Data Construction. This experiment is performed on 20 data sets, from which the target training set, the target testing set, and the auxiliary training set are constructed by the following principles.

Principle 1. The target training set and the target testing set should follow the same distribution.

Principle 2. The auxiliary training set and the target set should follow different distributions.

Principle 3. The size of the target training set is far less than that of the auxiliary training set.

We select auxiliary samples using a deliberately biased procedure (as in [19]). To describe our biased selection scheme, we need to define an additional random variable for each point in the pool of possible training samples, where means the th sample is included and indicates an excluded sample. In this paper, we discuss the classification problems under covariate shift, so we only consider the situation . Below, we present the detailed method of experimental data construction. First of all, we select some samples randomly from the original data set to compose the target set, of the data used for training and for testing. Then, in the remaining samples, we consider a biased sampling scheme based on the input features to construct the auxiliary set. For convenience, in this paper, we only consider a biased sampling scheme based on one input feature. For example, with respect to breast cancer data set, there are nine features, with integer values from 1 to 10. We consider a biased sampling scheme based on the first feature. Since smaller feature values predominate in the unbiased data, we consider a biased sampling scheme according to and , where is the value of the first feature.

(2) Experimental Methods. In the following, we compare our method (the indirect classification of Scenario 1, which is denoted by IDC1) against two other methods: the traditional classification algorithm (training on the set composed by target training samples and auxiliary samples) and the TrAdaBoost algorithm proposed in [15].

We select -SVC [23] and Radial Basis Function (RBF) [1],as the basic classification algorithm and kernel function, respectively, for the above three methods, where is a penalty factor, is a width parameter, and and are -dimensional vectors in the original feature space. With respect to the multiclass data sets of the 20 selected data sets, we select one-against-all (1-v-r) approach [24], which is to transform a -class problem into -two-class problems, where one class is separated from the remaining ones. In this experiment, the best and are obtained by 10-fold cross-validation.

(3) Result Analysis. The three methods are compared on the selected 20 data sets. Five runs of 10-fold cross-validation are performed for each algorithm, and the average result is reported in Table 1, where the numbers following “” are the standard deviations. The running time and parameter values of different algorithms are shown in Table 2, where denote the number of iterations, is a penalty factor, is a width parameter, and (ms) denotes the running time. is set to 100 according to the parameter setting in [15], and the best and are obtained by 10-fold cross-validation.

As is shown in Table 1, the precision given by SVM is strictly lower than IDC1 and TrAdaBoost. Intuitively, this is true because, unlike SVM, IDC1 and TrAdaBoost are learning techniques designed for classification of Scenario 1. Furthermore Table 1 shows that IDC1 outperforms TrAdaBoost. In detail, pairwise two-tailed -tests indicate that there are 16 data sets (Australian, balance, breast, Cleveland, credit, diabetes, ionosphere, iris, page, sonar, thyroid, voting, waveform40, wine, wdbc, and wpbc) where IDC1 is significantly more accurate than TrAdaBoost, while there is no significant difference on the remaining 4 data sets. We believe that the auxiliary set contain not only good samples, but also noisy data that caused the distribution of the auxiliary set different from that of the target set. The reason why IDC1 outperforms TrAdaBoost might be that ICD1 always employs the most important samples, which is included in the MIDS, to help the learners, while TrAdaBoost sometimes can not avoid using bad samples to help the learners.

Moreover, as shown in Table 2, the running time of IDC1 is the least, while the running time of TrAdaBoost is the most. Thus IDC1 is the most effective model. The reason is below. In this experiment, traditional SVM algorithm uses the target sample set and the whole auxiliary sample set for training, while IDC1 employs the target sample set and only a subset of the auxiliary samples set for training. With respect to TrAdaBoost, the reason why its running time is the most is that it needs to do repeated iteration for a better performance.

5.2. The Experiment on Scenario 2

(1) Experimental Data Construction. Unlike the first experiment, in this experiment, we only construct the target training set and the target testing set. We should guarantee that the target training set and the target testing set follow different distributions. Like Scenario 1, we also use a deliberately biased procedure to construct the experimental data.

(2) Experimental Method. In the following, we compare our method (the indirect classification of Scenario 2, which is denoted by IDC2) against two other methods: the traditional classification algorithm and the KLIEP algorithm proposed in [21]. We also select -SVC and RBF as the basic classification algorithm and kernel function, respectively, and select 1-v-r approach for the multiclass data sets.

(3) Result Analysis. Five runs of 10-fold cross-validation are performed for each algorithm, and the average result is reported in Table 3, where the numbers following “” are the standard deviations. Also the running time and parameter values of different algorithms are shown in Table 4.

As is shown in Table 3, the precision given by SVM is strictly lower than IDC2 and KLIEP. Like the analysis of Scenario 1, this is true because, unlike SVM, IDC2 and KLIEP are learning techniques designed for classification of Scenario 2. Furthermore Table 3 shows that IDC2 is comparable to KLIEP that is a state-of-the-art algorithm. In detail, pairwise two-tailed -tests indicate that there are 4 data sets (Australian, credit, page, and waveform21) where IDC2 outperforms KLIEP, and there are 3 data sets (heart, vehicle, and wpbc) where IDC2 performs a little worse than KLIEP, while there is no significant difference on the remaining 13 data sets. Therefore, ICD2 exhibits a new way of classification learning of Scenario 2.

As shown in Table 4, IDC2 costs less time than SVM and KLIEP. Like the case in Experiment  1, the reason is that in the training process it uses only a subset of the training samples, while SVM and KLIEP have to employ the whole training samples for training.

6. Conclusion

In this paper, we first propose a MIDS construction method by matching sample numbers between the target set and auxiliary set in each feature subspace, and then we propose a novel approach for classification under covariate shift by training on a new data set. Our basic idea is to train a model on a newly constructed data set following approximately the target distribution. Our approach consists of two methods, including a direct method and an indirect one. The theoretical analysis shows that the indirect method is equivalent to the direct method for the MIDS construction, but with less running time. In our experiments, the two indirect algorithms, ICD1 and ICD2, demonstrate better classification abilities than traditional learning techniques. In addition, our method can consistently deal with both scenarios of covariate shift and the cases where there are multiple auxiliary data sets coming from different distributions.

We note that our method assumes that the source domain and the target domain have the same concept and can not deal with the case where they have different concepts, that is, the problem of concept drifts. In the future, we will try to extend the proposed method to address this issue.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is sponsored by the National Natural Science Foundation of China (nos. 61402246, 61402126, 61370083, 61370086, 61303193, and 61572268), a Project of Shandong Province Higher Educational Science and Technology Program (no. J15LN38), the National Research Foundation for the Doctoral Program of Higher Education of China (no. 20122304110012), the Natural Science Foundation of Heilongjiang Province of China (no. F201101), the Science and Technology Research Project Foundation of Heilongjiang Province Education Department (no. 12531105), and Heilongjiang Province Postdoctoral Research Start Foundation (no. LBH-Q13092).