Target contrastive pessimistic risk for robust domain adaptation

In domain adaptation, classifiers with information from a source domain adapt to generalize to a target domain. However, an adaptive classifier can perform worse than a non-adaptive classifier due to invalid assumptions, increased sensitivity to estimation errors or model misspecification. Our goal is to develop a domain-adaptive classifier that is robust in the sense that it does not rely on restrictive assumptions on how the source and target domains relate to each other and that it does not perform worse than the non-adaptive classifier. We formulate a conservative parameter estimator that only deviates from the source classifier when a lower risk is guaranteed for all possible labellings of the given target samples. We derive the classical least-squares and discriminant analysis cases and show that these perform on par with state-of-the-art domain adaptive classifiers in sample selection bias settings, while outperforming them in more general domain adaptation settings.


Introduction
Generalization in supervised learning relies on the fact that future samples originate from the same underlying data-generating distribution as the ones used for training. However, this is not the case in settings where data is collected from different locations, different measurement instruments are used or there is only access to biased data [25] . In these situations the labeled data does not represent the distribution of interest. This problem setting is referred to as a domain adaptation setting, where the distribution of the labeled data is called the source domain and the distribution of interest is called the target domain [3,15] . Most often, data in the target domain is not labeled and adapting a source domain classifier, i.e., changing predictions to suit the target domain, is the only means by which one can make accurate predictions. Unfortunately, depending on the domain dissimilarity, adaptive classifiers can easily perform worse than non-adaptive ones. We formulate a conservative adaptive classifier that always performs at least as well as the non-adaptive one. 1 ✩ Handle by Associate Editor Francesco Tortorella. * Corresponding author.
E-mail address: w.m.kouw@tue.nl (W.M. Kouw). 1 A shortened, preliminary version was accepted for S+SSPR [16] . The current version offers a significant extension with a clearer exposition, additional technical de-In the general setting, domains can be arbitrarily different, which means generalization will be extremely difficult. However, there are cases where the problem setting is more structured: in the covariate shift setting, the marginal data distributions differ but the posterior distributions are equal [5,9,28] . In such cases, a correctly specified adaptive classifier will converge to the same solution as the target classifier [9] . One way to carry out adaptation is by weighing each source sample by how important it is under the target distribution and training on the importance-weighted labeled source data. However, such a classifier can perform poorly when applied to settings where the covariate shift assumption is false, i.e., where the posterior distributions from both domains are not equal [8,19] . In that case, one often observes that a few samples are given large weights and all other samples are given nearzero weights, which greatly reduces the effective sample size [ 23 ,Chapter 8]. Sensitivity to domain relationship assumptions is not restricted to covariate shift. Another adaptive algorithm, Transfer Component Analysis (TCA), assumes the existence of a latent representation common to both domains. When that does not hold, mapping both source and target data onto transfer components tails and references, more experiments, and a comprehensive analysis and discussion. will result in mixing of the class-conditional distributions and performance will deteriorate [24] .
Since the validity of the aforementioned assumptions is difficult -if not impossible -to check, it is of interest to design robust classifiers. Robustness to uncertainty is often achieved through minimax optimization [17] . An example of a robust adaptive classifier is Robust Covariate Shift Adjustment (RCSA), which first maximizes risk with respect to the importance-weights and subsequently minimizes risk with respect to the classifier parameters [32] . It attempts to account for estimation errors in importanceweights. Another example is the Robust Bias-Aware (RBA) classifier, which plays a game between a risk minimizing target classifier and a risk maximizing target posterior distribution [19] . The adversary is constrained to pick posteriors that match the moments of the source distribution statistics, to avoid posterior probabilities that result in degenerate classifiers (e.g. assign all posterior probabilities to 1). Matching moments means that RBA classifiers lose predictive power in areas of feature space where the source distribution has limited support. Note that both robust methods still rely on assuming covariate shift.
Our main contribution is a parameter estimator that produces estimates with a risk that is always lower or equal to the risk of the source classifier, with respect to the given target samples. It does so without making domain relationship assumptions such as covariate shift but by constructing a specific type of risk that can be considered transductive in the sense originally defined by by Vapnik and Chervonenkis [see 30] . Furthermore, we show that in the case of discriminant analysis, the estimator will produce strictly smaller risks on the target data. To the best of our knowledge, such performance guarantees compared to the source classifier have not been shown before.
The paper is outlined as follows: Section 3 presents the formulation of our method, with discriminant analysis in Section 4 . Section 5.1 shows experiments on two data sets involving geographical sampling bias, indicating that our estimator consistently performs among the best. We conclude with limitations and a discussion in Section 6 . To start with, the next section briefly introduces the specific domain adaptation setting that we consider and comments on the transductive nature of our particular approach.

Domain adaptation and transduction
A domain is defined here as a particular joint probability distribution over a D -dimensional input space X ⊆ R D and a K- [15] . Let S mark a source domain, with n samples x = (x 1 , . . . , x n ) with corresponding labels y = (y 1 , . . . , y n ) ∈ Y n drawn from the source domain's joint distribution. Similarly, let T mark a target domain, with m samples z = (z 1 , . . . , z m ) with corresponding labels u = (u 1 , . . . , u m ) drawn from the target domain's joint distribution. The target labels u are unknown at training time and the goal is to predict them, using only the unlabeled target samples z and the labeled source samples (x, y ) .

The meaning of transduction
Given that the primary performance measure in this work is specifically the risk on the unlabeled data of the target domain that is available to us, our objective is essentially transductive [see 15 ]. This is in line with the original definition of transduction as proposed by Vapnik and Chervonenkis [see 30] .
It should be pointed out that, confusingly, what is referred to as transductive for most transfer learning and domain adaptation methods, just means that there is labeled data available for the source but not for the target domain [see also 15 ]. The classifiers considered in papers such as [1,10,13] , like most papers in  15 ]. Works like [27,29] exploit graph methods that do not have a ready out-ofsample extension and are therefore transductive in the sense of Vapnik and Chervonenkis. As Section 3 shows, our method focuses particularly on the risk obtained on the given target data and is, as such, transductive. As it turns out, it is specifically this approach that can provide us with performance guarantees, where other techniques cannot.
We should note that, typically, our target classifiers can still be used for classifying new and unseen target domain samples. That is, they can also be used for inductive inference. This is especially the case if the samples from the target domain can be considered representative of that domain. In that case, the performance on those particular target domain instances can equally well be interpreted as a regular empirical risk, used in standard empirical risk minimization [26,31] . Just as in the supervised learning setting, it is then assumed that having a small empirical risk carries over to a small generalization error and that the classifier can be successfully employed inductively.
As a final remark, we like to state that the benefits of transduction over induction, or vice versa, are not always easily identified. Especially because in many settings, inductive classifiers can be used for transduction and the other way around. Refer to Chapter 25 in [6] for further views and considerations. Fig. 1 visualizes some concepts used throughout the paper. On the left is shown samples from the source domain, labeled as points (red) versus crosses (blue). These were drawn from isotropic Gaussians centered at [ −2 , 0] and [+2 , 0] , respectively. The black lines are a contour plot of the posterior probabilities of a classifier trained on the source data. On the right is shown data from the target domain, as well as the source classifier applied to the target data. These target samples were drawn from two Gaussian distributions, both with covariance matrix [3 , 2 ; 2 , 4] but one with a mean of [ −1 , 2] and one with a mean of [+2 , 1] . The source and target domains are therefore related to each other through an affine transformation. Note that the source classifier does not fit the target data well.

Robust estimator for target domain
In the following, we present the construction of our estimator. First, we discuss the risk of the classifier in the target domain. Secondly, we compare the target risk of a proposal classifier with the target risk incurred by the source classifier and thirdly, we assume a worst-case labeling for the given target samples.

Target risk
The empirical risk of a classifier in the source domain is computed as the average loss with respect to source samples (x, y ) : where h is the classification function mapping input to labels and is a loss function comparing the classifier's prediction h (x i ) with the true label y i at training time. Since the classification error, or 0 − 1 loss, cannot be directly optimized over, it is customary to choose surrogate loss functions, such as the quadratic loss (h (x i ) − y i ) 2 [11] . The source classifier is the classifier found by minimizing the empirical risk with respect to source samples: where H refers to the hypothesis space.
Since the source classifier does not incorporate any part of the target domain, it is essentially entirely naive of it. But source domains are chosen for a reason -often because they are the most similar data available -and source classifiers are subsequently regarded as the best alternative for classifying the target domain. To evaluate ˆ h S in the target domain, the risk of the classifier with respect to target samples (z, u ) , is computed: We argue that adaptive classifiers should never perform worse than source classifiers. In other words, they should never achieve a larger target risk.

Contrast
We formalize the desire to never achieve a larger target risk by directly comparing the target risk of a potential alternative classifier with the target risk of the source classifier. If we subtract the target risk of the source classifier, then we can argue that the resulting function should never be positive: If this contrast between risk functions is used as a minimization objective, i.e., ˆ , then the target risk of the resulting classifier is bounded above by the risk of the source classifier: ˆ Classifiers that lead to larger target risks are not valid outcomes of this minimization procedure.

Robustness
Eq. (4) still relies on target labels u , which are unknown during training. Instead of u we use a worst-case labeling, achieved by maximizing risk with respect to a hypothetical labeling q . For any classifier h , the risk with respect to this worst-case labeling will always be larger than the risk with respect to the true target labeling: Maximizing over a set of discrete labels is a combinatorial problem and, unfortunately, this one is computationally expensive. To avoid this, we apply a relaxation by considering a soft labeling, . This means that q j is a vector of K elements that sum to 1. In other words, a point on a K − 1 simplex, K−1 . For m samples, the Cartesian product of m simplices is taken: By optimizing with respect to a worstcase labeling, the estimator will be more robust to uncertainty over target labels [17] .

Target Contrastive Pessimistic risk
Combining the contrast between risk functions from (4) with the worst-case labeling q from (5) produces the following risk function: We refer to the risk in Eq. (6) as the Target Contrastive Pessimistic risk (TCP). Minimizing with respect to a classifier h and maximizing with respect to a hypothetical labeling q , produces the new TCP target classifier: Note that the TCP risk only considers the performance on the target domain. More precisely, it considers the performance on the given samples from the target domain and is, in this sense, a transductive approach [12,30] . It is different from the risk formulations in [19,32] , and those mentioned in Section 2 , because those incorporate performance on the source domain as well. Our formulation focuses purely on the performance gain we can achieve over the source classifier, in terms of target risk.

Optimization
If the loss function is restricted to be globally convex and the hypothesis space H to be a convex set, then the TCP risk will be globally convex with respect to h and there will be a unique optimum for h . The TCP risk is linear with respect to q and the optimum need not be unique for q . But the combined minimax objective will be globally convex-linear, which guarantees the existence of a saddle point, i.e., a unique optimum with respect to both h and q [7] .
Finding this saddle point can be done through first performing a gradient descent step according to the partial derivative with respect to h , followed by a gradient ascent step according to the partial derivative with respect to q . However, this last step causes the updated q to leave the simplex. In order to enforce the constraint, the updated q is projected back onto the simplex. The projection, P, maps a point outside the simplex, a , to the point, b, that is the closest point on the simplex in terms of Euclidean distance: P (a ) = arg min b∈ a − b 2 [22] . Unfortunately, the projection step complicates the computation of the step size, which we replace by a learning rate α t , decreasing over iterations t. This results in the overall update: A gradient descent-ascent procedure for globally convex-linear objectives is guaranteed to converge to a saddle point (c.f. proposition 4.4 and corollary 4.5 of [7] ).

Discriminant analysis
Interestingly, for classical discriminant analysis (DA), it can be shown that the TCP risk produces parameter estimates with strictly smaller risks than that of the source classifier. Discriminant analysis models the data from each class with a Gaussian distribution, weighted proportional to a class prior: π k N (x | μ k , k ) [11] . We use the following shorthand notation for the parameters: θ k = (π k , μ k , k ) . The model is expressed as an empirical risk minimization formulation by taking the negative log-likelihood as a loss

Quadratic discriminant analysis
If each class is modeled with a separate covariance matrix, the resulting classifier is a quadratic function of the difference in means and covariances, and is hence called quadratic discriminant analysis (QDA). For target data z and probabilistic labels q , the loss is formulated as: Note that the loss is now expressed in terms of classifier parameters θ , as opposed to the classifier h . Plugging the loss from (9) into (6) , the TCP-QDA risk becomes: where the estimate itself is: Minimization with respect to θ has a closed-form solution for discriminant analysis models. For each class, the parameter estimates are: Keeping θ fixed, the gradient with respect to q jk is: Fig. 2 visualizes the difference between the source classifier and our TCP-QDA classifier. On the left is shown the source classifier applied to the target data from Section 2.2 . On the right is shown the TCP-QDA classifier applied to the same data. Note that it has shifted upwards to better fit the target samples, achieving a smaller risk than the source classifier.

Regularization
One of the properties of a discriminant analysis model is that it requires the estimated covariance matrix k to be non-singular. It is possible for the maximizer over q in TCP-QDA to assign less samples than dimensions to one of the classes, causing the covariance matrix for that class to be singular. To prevent this, we regularize its estimation by enforcing a lower bound on the eigenvalues of the estimated covariance matrix.

Linear discriminant analysis
If the model is constrained to share a covariance matrix between classes, the resulting classifier is a linear function of the difference in means and is hence termed linear discriminant analysis (LDA). This constraint is imposed through the weighted sum over class covariance matrices = K k π k k .

Performance guarantee
For the discriminant analysis model, the TCP parameter estimator obtains a strictly smaller risk. In other words, this parameter estimator is guaranteed to improve its performance -on the given target samples, and in terms of risk -over the source classifier. This is the first domain adaptation parameter estimator for which such a guarantee can be provided.
The reader is referred to Appendix A for the proof. It follows similar steps as a guarantee for discriminant analysis in semisupervised learning [20] . Note that as long as the same amount of regularization is added to both the source and the TCP estimator, the strictly smaller risk also holds for a regularized model.

Experiments
We see the TCP risk formulation from Section 3 , together with Theorem 1 , as our main contributions. Of course, it is still of interest to see how other approaches compare to ours. We compare 2 the performance of our classifiers with that of some well-known and robust domain-adaptive classifiers. We implemented Transfer Component Analysis (TCA) [24] , Kernel Mean Matching (KMM) [14] , Robust Covariate Shift Adjustment (RCSA) [32] and the Robust Bias-Aware (RBA) classifier [19] . TCA and KMM make explicit assumptions: TCA assumes that there are latent factors on which the data can be projected such that the distributions are more similar, while the original properties such as class separability are preserved. We trained a logistic regressor to the source data mapped onto the transfer components. KMM assumes that the posterior distributions in each domain are equal and that the support of the target distribution is contained within the support of the source distribution. We trained both a weighted logistic regressor and a weighted least-squares classifier using the importance-weights estimated by KMM. We report the best performing of the two, namely least-squares. RCSA also assumes equal posterior distributions, but employs worst-case importance-weight estimation to be robust to weight estimation errors. We used the authors' implementation, which trains a weighted support vector machine using the estimated worst-case weights. RBA assumes that the moments of the source classifier's predictions match that of the target classifier. In our implementation, only the first moments are constrained to match. As baselines, we included a non-adaptive linear (S-LDA) and quadratic (S-QDA) discriminant analysis model trained on the source domain. All target samples are given -unlabeled -to the adaptive classifiers. The classifiers make predictions for those given target samples and their performance is evaluated with respect to those target samples' true labels. Performance is measured in terms of Area Under the ROC-curve (AUC). All methods are trained using L 2regularization. Since there is no labeled target data available for validation, we set the regularization parameter to a small value, namely 0.01.

Data sets
We performed a set of experiments on two data sets that are geographically split into domains. In the first problem, the goal is to predict whether it will rain the following day, based on 22 features including wind speed, humidity, and sunshine (data set is part of the R package Rattle [33] ). The measurements are taken over a period of 200 days from Australian weather stations located in Darwin, Perth, Brisbane, and Melbourne. Each station can be considered a domain because the feature spaces are equal but the underlying data-generating distributions are different. For instance, the average temperature is several degrees higher in Darwin than in Melbourne.
The second data set is from the UCI machine learning repository [18] . The goal is to predict heart disease in patients from 4 different hospitals. These are located in Hungary (294 patients), Switzerland (123 patients), California (200 patients) and Ohio (303 patients). Each hospital can be considered a domain because patients are measured on the same clinical features but the local patient populations differ. For example, patients in Hungary are on average younger than patients from Switzerland (48 versus 55 years). Heart disease is predicted from 13 clinical features such as age, sex, cholesterol level and chest pain type. Both data sets are preprocessed by first imputing missing values with zeros and then zscoring each feature. Table 1 compares the AUCs of various classifiers on the WeatherAUS data set. All combinations of using one station as the source domain and another station as the target domain, are taken.

Results
Firstly, as a collective, the robust methods (TCP-QDA, TCP-LDA, RBA, RCSA) rather consistently outperform the non-robust methods (TCA , KMM, S-LDA , S-QDA), though it certainly is not the case that every robust method outperforms every non-robust one. Also, there is one exception where S-QDA actually performs best of all. Secondly, RCSA outperforms KMM in all cases, indicating that it is either difficult to estimate appropriate importance weights or that it is difficult to train the importance-weighted classifier given KMM's weights. Thirdly, in eight out of twelve cases TCP-LDA outperforms S-LDA. TCP-QDA is better than S-QDA in eleven of the twelve. Lastly, S-LDA occasionally outperform the non-TCP, adaptive classifiers, where this most notably happens in the three cases when S = M. For S-QDA this happens in all cases except for S = D. When S = M and T = P, we find that S-QDA performs best overall. Particularly where S-LDA is concerned, these results indicate that adaptation strategies can also be detrimental to performance. Table 2 lists AUCs of each classifier in the heart disease data set. Overall, the AUC's are lower here, indicating that these settings are more difficult than those of the weather stations. Firstly, TCP-LDA generally outperforms TCP-QDA here, indicating that most problem settings are linearly separable and the additional flexibility of QDA is not helpful. Secondly, the differences in performance between S-LDA and S-QDA and their TCP versions is clearly less appreciable. In most cases the differences seem insignificant. Exceptions occur when S = S and T = O, in which case the original methods actually perform clearly better and when S = S and T = H, in which case the TCP adaptations do so. Thirdly, RCSA does not always outperform KMM, but since both KMM and RCSA perform worse than chance on a few occasions, it does seem that the assumption of equivalent posterior distributions is invalid in many cases. Fourthly, TCA's performance also varies around chance level, which means that it is difficult to recover a common latent representation here. Lastly, S-LDA and S-QDA outperform the adaptive classifiers on a number of occasions again.

Discussion
Although, by construction, the TCP classifiers are never worse than the source classifier in terms of empirical risk , they will not automatically lead to improvements in the error rate. This is due to the fact that a surrogate loss function is used during training: the classifier that minimizes the surrogate loss need not be the classifier that minimizes the 0/1-loss [2,4,21] . Similar performance guarantees as we have given with respect to empirical risk, cannot be given with respect to classification error, because the 0 − 1 loss cannot be directly optimized.
Although our TCP estimator is guaranteed to never perform worse than the source classifier, it may not perform well if the source classifier is a poor choice to begin with. Of course, if no decent source classifiers can be formed, then one can wonder whether any kind of adaptation will be able to construct a satisfactory target classifier, unless particularly reliable assumptions can be made. Given that reliable assumptions can be made, our TCP estimator could still be useful. Rather than the original supervised source classifier, one can, in principle, use any adaptive classifier in combination with TCP parameter estimation. In that case, the TCP parameter estimator would still retain its guarantee to not perform worse that the classifier it is compared against, which in this case is the adaptive classifier. Potentially, this may of course lead to even better parameter estimates. A wide range of standard classifiers that rely on the optimization of a convex loss can be incorporated, such as least-squares or support vector machines, meaning that TCP could be combined with many adaptive classifiers. Nonconvex losses, as widely employed in this era of deep learning, are a challenge and, as yet, it is an open and interesting research question to what extent our theoretical results can be salvaged in that setting.
Another possible extension to the current estimator is to use multiple source domains. Perhaps our TCP estimator could produce better estimates than the best source estimates. One could envision contrasting the proposal classifier with the classifier producing the lowest risk from among a set of source classifiers, each trained on its own source domain. Finding the best one from among the set of source classifiers would require an additional minimization step over source domains, which would increase the computational cost. Selecting a subset of source domains in advance, could limit this increase in cost and make such an approach feasible.

Conclusion
We have designed a risk minimization formulation for a domain-adaptive classifier whose performance, in terms of empirical target risk, is always at least as good as that of the nonadaptive source classifier, without making assumptions on the relationship between domains. This is something that no other method can guarantee. Furthermore, for the discriminant analysis case, its performance is always strictly better. As demonstrated, our Target Contrastive Pessimistic discriminant analysis model performs consistently strong among other robust classifiers.

Declaration of Competing Interest
The authors state that they hold no conflict of interests.

Acknowlgedgment
A word of thanks goes out to the two anonymous reviewers whose feedback helped us improve the presentation of our work. We gladly acknowledge their constructive remarks and comments.

Appendix A
Proof of Theorem 1. Let { (x i , y i ) } n i =1 be a data set of size n drawn iid from a continuous distribution defined over in- be a data set of size m , drawn iid from another continuous distribution defined over X × Y.
Consider a discriminant analysis model parameterized with θ = (π 1 , . . . , π K , μ 1 , . . . , μ K , 1 , . . . , K ) with empirical risk defined by: The sample covariance matrix, k , is required to be non-singular, which is guaranteed when there are more unique samples than features for every class. Let ˆ θ S be the parameters estimated on labeled source data: and let ( ˆ θ T , q * ) be the parameters and worst-case labeling estimated by mini-maximizing the Target Contrastive Pessimistic risk: Firstly, keeping q fixed, the minimization over the contrast between the target risk of the proposal parameters θ and the source parameters ˆ θ S is upper bounded by 0, because both sets of parameters are elements of the same parameter space, θ , ˆ θ S ∈ : for all choices of q . Since θ can always be set to ˆ θ S , values for θ that would result in a larger target risk than that of ˆ θ S are not valid minimizers of the contrast. Considering that the contrast is upper bounded for any labeling q , it is also upper bounded by 0 for the worst-case labeling: and since ˆ θ T is the minimizer of the left-hand side of (A.5) : Secondly, keeping θ fixed, the empirical risk with respect to the true labeling u is always less than or equal to the empirical risk with respect to the worst-case labeling: Since q * is the maximizer for ˆ θ T as parameters, we can write: (A.8) Combining Inequalities A.6 and A.8 gives: Bringing the second term on the left-handside to the righthandside shows that the target risk of the TCP estimate is always less than or equal to the target risk of the source classifier's: (A.10) Equality in (A.10) occurs with probability 0, which can be shown through the parameter estimators. The total mean for the source classifier consists of the weighted combination of the class means, resulting in the overall source sample average: The total mean for the TCP-QDA estimator is similarly defined, resulting in the overall target sample average: Note that since q * consists of probabilities, the sum over classes