Interpolation Consistency Training for Semi-Supervised Learning

We introduce Interpolation Consistency Training (ICT), a simple and computation efficient algorithm for training Deep Neural Networks in the semi-supervised learning paradigm. ICT encourages the prediction at an interpolation of unlabeled points to be consistent with the interpolation of the predictions at those points. In classification problems, ICT moves the decision boundary to low-density regions of the data distribution. Our experiments show that ICT achieves state-of-the-art performance when applied to standard neural network architectures on the CIFAR-10 and SVHN benchmark datasets. Our theoretical analysis shows that ICT corresponds to a certain type of data-adaptive regularization with unlabeled points which reduces overfitting to labeled points under high confidence values.


Introduction
Deep learning achieves excellent performance in supervised learning tasks where labeled data is abundant (LeCun et al., 2015). However, labeling large amounts of data is often prohibitive due to time, financial, and expertise constraints. As machine learning permeates an increasing variety of domains, the number of applications where unlabeled data is voluminous and labels are scarce increases. For instance, recognizing documents in extinct languages, where a machine learning system has access to a few labels, produced by highly-skilled scholars (Clanuwat et al., 2018).
The goal of Semi-Supervised Learning (SSL) (Chapelle et al., 2010) is to leverage large amounts of unlabeled data to improve the performance of supervised learning over small datasets. Often, SSL algorithms use unlabeled data to learn additional structure about the input distribution. For instance, the existence of cluster structures in the input distribution could hint the separation of samples into different labels. This is often called the cluster assumption: if two samples belong to the same cluster in the input distribution, then they are likely to belong to the same class. The cluster assumption is equivalent to the lowdensity separation assumption: the decision boundary should lie Fig. 1. Interpolation Consistency Training (ICT) applied to the ''two moons'' dataset, when three labels per class (large dots) and a large amount of unlabeled data (small dots) are available. When compared to supervised learning (red), ICT encourages a decision boundary traversing a low-density region that would better reflect the structure of the unlabeled data. Both methods employ a multilayer perceptron with three hidden ReLU layers of twenty neurons. Best viewed in colors in the printed version.
This additional computation makes VAT (Miyato et al., 2018) and other related methods such as Park et al. (2018) less appealing in situations where unlabeled data is available in large quantities. Furthermore, recent research has shown that training with adversarial perturbations can hurt generalization performance (Nakkiran, 2019;Tsipras et al., 2018).
Our experimental results on the benchmark datasets CIFAR10 and SVHN and neural network architectures CNN-13 (Laine & Aila, 2017;Luo et al., 2018;Miyato et al., 2018;Park et al., 2018;Tarvainen & Valpola, 2017) and WRN28-2 (Oliver et al., 2018) outperform (or are competitive with) the state-of-the-art methods. ICT is simpler and more computation efficient than several of the recent SSL algorithms, making it an appealing approach to SSL. Fig. 1 illustrates how ICT learns a decision boundary traversing a low density region in the ''two moons'' problem.
After the publication of an earlier version of this paper (Verma et al., 2019), similar ideas have been explored in Berthelot, Carlini et al. (2019) and Berthelot et al. (2020), with some additional techniques in comparison to that proposed in Verma et al. (2019). Although all of these methods work well in practice, the theoretical underpinning of these methods is not well understood. In this work, we additionally provide a novel theory of ICT to understand how and when ICT can succeed or fail to effectively utilize unlabeled points.

Interpolation consistency training
Given a mixup  operation: Interpolation Consistency Training(ICT) trains a prediction model f θ to provide consistent predictions at interpolations of unlabeled points: where θ is a moving average of θ (Fig. 2). But, why do interpolations between unlabeled samples provide a good consistency perturbation for semi-supervised training?
To begin with, observe that the most useful samples on which the consistency regularization should be applied are the samples near the decision boundary. Adding a small perturbation δ to such low-margin unlabeled samples u j is likely to push u j + δ over the other side of the decision boundary. This would violate the low-density separation assumption, making u j + δ a good place to apply consistency regularization. These violations do not occur at high-margin unlabeled points that lie far away from the decision boundary.
Back to low-margin unlabeled points u j , how can we find a perturbation δ such that u j and u j + δ lie on opposite sides of the decision boundary? Although tempting, using random perturbations is an inefficient strategy, since the subset of directions approaching the decision boundary is a tiny fraction of the ambient space.
Instead, consider interpolations u j + δ = Mix λ (u j , u k ) towards a second randomly selected unlabeled examples u k . Then, the two unlabeled samples u j and u k can either: 1. lie in the same cluster, 2. lie in different clusters but belong to the same class, 3. lie on different clusters and belong to the different classes.
Assuming the cluster assumption, the probability of (1) decreases as the number of classes increases. The probability of (2) is low if we assume that the number of clusters for each class is balanced. Finally, the probability of (3) is the highest. Then, assuming that one of (u j , u k ) lies near the decision boundary (it is a good candidate for enforcing consistency), it is likely (because of the high probability of (3)) that the interpolation towards u k points towards a region of low density, followed by the cluster of the other class. Since this is a good direction to move the decision, the interpolation is a good perturbation for consistency-based regularization.
Our exposition has argued so far that interpolations between random unlabeled samples are likely to fall in low-density regions. Thus, such interpolations are good locations where consistency-based regularization could be applied. But how should we label those interpolations? Unlike random or adversarial perturbations of single unlabeled examples u j , our scheme involves two unlabeled examples (u j , u k ). Intuitively, we would like to push the decision boundary as far as possible from the class boundaries, as it is well known that decision boundaries with large margin generalize better (Shawe-Taylor et al., 1996). In the supervised learning setting, one method to achieve largemargin decision boundaries is mixup . In mixup, the decision boundary is pushed far away from the class boundaries by enforcing the prediction model to change linearly in between samples. This is done by training the model f θ to predict Mix λ (y, y ) at location Mix λ (x, x ), for random pairs of labeled samples ((x, y), (x , y )). Here we extend mixup to the semisupervised learning setting by training the model f θ to predict the ''fake label'' Mix λ (f θ (u j ), f θ (u k )) at location Mix λ (u j , u k ). In order to follow a more conservative consistent regularization, we encourage the model f θ to predict the fake label Mix Interpolation Consistency Training (ICT) learns a student network f θ in a semi-supervised manner. To this end, ICT uses a mean-teacher f θ , where the teacher parameters θ are an exponential moving average of the student parameters θ. During training, the student parameters θ are updated to encourage consistent at location Mix λ (u j , u k ), where θ is a moving average of θ, also known as a mean-teacher (Tarvainen & Valpola, 2017).
We are now ready to describe in detail the proposed Interpolation Consistency Training (ICT). Consider access to labeled samples (x i , y i ) ∼ D L , drawn from the joint distribution P XY (X , Y ). Also, consider access to unlabeled samples u j , u k ∼ D UL , drawn from the marginal distribution P X (X ) = P XY (X ,Y ) P Y |X (Y |X) . Our learning goal is to train a model f θ , able to predict Y from X . By using stochastic gradient descent, at each iteration t, update the parameters θ to minimize where L S is the usual cross-entropy supervised learning loss over labeled samples D L , and L US is our new interpolation consistency regularization term. These two losses are computed on top of (labeled and unlabeled) minibatches, and the ramp function w(t) increases the importance of the consistency regularization term L US after each iteration. To compute L US , sample two minibatches of unlabeled points u j and u k , and compute their fake labelsŷ j = f θ (u j ) andŷ k = f θ (u k ), where θ is an moving average of θ (Tarvainen & Valpola, 2017). Second, compute the interpolation u m = Mix λ (u j , u k ), as well as the model prediction at that location,ŷ m = f θ (u m ). Third, update the parameters θ as to bring the prediction y m closer to the interpolation of the fake labels Mix λ (ŷ j ,ŷ k ). The discrepancy between the predictionŷ m and Mix λ (ŷ j ,ŷ k ) can be measured using any loss; in our experiments, we use the mean squared error. Following Zhang et al. (2018), on each update we sample a random λ from Beta(α, α).
In sum, the population version of our ICT term can be written as: ICT is summarized in Fig. 2 and Algorithm 1.

Datasets
We follow the common practice in semi-supervised learning literature (Laine & Aila, 2017;Luo et al., 2018;Miyato et al., 2018;Park et al., 2018;Tarvainen & Valpola, 2017) and conduct experiments using the CIFAR-10, SVHN, and CIFAR-100 datasets, where only a fraction of the training data is labeled, and the remaining data is used as unlabeled data. We followed the standardized procedures laid out by Oliver et al. (2018) to ensure a fair comparison.
The CIFAR-10 dataset consists of 60 000 color images each of size 32 × 32, split between 50K training and 10K test images.
This dataset has ten classes, which include images of natural objects such as cars, horses, airplanes and deer. The CIFAR-100 is similar to the CIFAR-10 dataset, except it has 100 classes with 600 images in each class. The SVHN dataset consists of 73 257 training samples and 26 032 test samples each of size 32 × 32.
Each example is a close-up image of a house number (the ten classes are the digits from 0-9). We adopt the standard data-augmentation and pre-processing scheme which has become standard practice in the semi-supervised learning literature (Athiwaratkun et al., 2019;Laine & Aila, 2017;Luo et al., 2018;Miyato et al., 2018;Sajjadi et al., 2016;Tarvainen & Valpola, 2017). More specifically, for CIFAR-10, we first zero-pad each image with 2 pixels on each side. Then, the resulting image is randomly cropped to produce a new 32 × 32 image. Next, the image is horizontally flipped with probability 0.5, followed by per-channel standardization and ZCA preprocessing. For SVHN, we zero-pad each image with 2 pixels on each side and then randomly crop the resulting image to produce a new 32 × 32 image, followed by zero-mean and unit-variance image whitening.

Models
We conduct our experiments using CNN-13 and Wide-Resnet-28-2 architectures. The CNN-13 architecture has been adopted as the standard benchmark architecture in recent state-of-theart SSL methods (Laine & Aila, 2017;Luo et al., 2018;Miyato et al., 2018;Park et al., 2018;Tarvainen & Valpola, 2017). We use its variant (i.e., without additive Gaussian noise in the input layer) as implemented in Athiwaratkun et al. (2019). We also removed the Dropout noise to isolate the improvement achieved through our method. Other SSL methods in Tables 1 and 2 use the Dropout noise, which gives them more regularizing capabilities. Despite this, our method outperforms other methods in several experimental settings. Oliver et al. (2018) performed a systematic study using Wide-Resnet-28-2 (Zagoruyko & Komodakis, 2016), a specific residual network architecture, with extensive hyperparameter search to compare the performance of various consistency-based semisupervised algorithms. We evaluate ICT using this same setup as a mean towards a fair comparison to these algorithms.
Sample labeled minibatch Sample two unlabeled minibatchs Compute pseudo labels Compute interpolation e.g., mean squared error e.g. SGD, Adam end for return θ Table 1 Error rates (%) on CIFAR-10 using CNN-13 architecture. We ran three trials for ICT.

Implementation details
We used the SGD with nesterov momentum optimizer for all of our experiments. For the experiments in Tables 1 and 2, we run the experiments for 400 epochs. For the experiments in Table 3, we run experiments for 600 epochs. The initial learning rate was set to 0.1 for CIFAR-10 and SVHN and 0.25 for CIFAR-100, which is then annealed using the cosine annealing technique proposed in Loshchilov and Hutter (2016) and used by Tarvainen and Valpola (2017). The momentum parameter was set to 0.9. We used an L2 regularization coefficient 0.0001 and a batch-size of 100 in our experiments.
In each experiment, we report mean and standard deviation across three independently run trials.
The consistency coefficient w(t) is ramped up from its initial value 0.0 to its maximum value at one-fourth of the total number of epochs using the same sigmoid schedule of Tarvainen and Valpola (2017). We used MSE loss for computing the consistency loss following Laine and Aila (2017) and Tarvainen and Valpola (2017). We set the decay coefficient for the mean-teacher to 0.999 following Tarvainen and Valpola (2017).
We conduct hyperparameter search over the two hyperparameters introduced by our method: the maximum value of the consistency coefficient w(t) ( Table 3 Results on CIFAR-10 (4000 labels) and SVHN (1000 labels) (in test error %).
All results use the same standardized architecture (WideResNet-28-2). Each experiment was run for three trials. We did not conduct any hyperparameter search and used the best hyperparameters found in the experiments of Tables 1  and 2 for CIFAR-10 (4000 labels) and SVHN (1000 labels). Beta(α, α) (we searched over the values in {0.1, 0.2, 0.5, 1.0}). We select the best hyperparameter using a validation set of 5000 and 1000 labeled samples for CIFAR-10 and SVHN respectively. This size of the validation set is the same as that used in the other methods compared in this work. We note the in all our experiments with ICT, to get the supervised loss, we perform the interpolation of labeled sample pair and their corresponding labels (as in mixup Zhang et al., 2018). To make sure, that the improvements from ICT are not only because of the supervised mixup loss, we provide the direct comparison of ICT against supervised mixup and Manifold Mixup training in Tables 1 and 2.

Results
We provide the results for CIFAR10 and SVHN datasets using CNN-13 architecture in Tables 1 and 2, respectively.
To justify the use of a SSL algorithm, one must compare its performance against the state-of-the-art supervised learning algorithm (Oliver et al., 2018). To this end, we compare our method against two state-of-the-art supervised learning algorithms (Verma et al., 2018;Zhang et al., 2018), denoted as Supervised(Mixup) and Supervised(Manifold Mixup), respectively in Tables 1 and 2. ICT method passes this test with a wide margin, often resulting in a two-fold reduction in the test error in the case of CIFAR10 (Table 1) and a four-fold reduction in the case of SVHN (Table 2) Furthermore, in Table 1, we see that ICT improves the test error of other strong SSL methods. For example, in the case of 4000 labeled samples, it improves the test error of best-reported method by ∼ 25%. The best values of the hyperparameter maxconsistency coefficient for 1000, 2000 and 4000 labels experiments were found to be 10.0, 100.0 and 100.0 respectively and the best values of the hyperparameter α for 1000, 2000 and 4000 labels experiments were found to be 0.2, 1.0 and 1.0 respectively. In general, we observed that for less number of labeled data, lower values of max-consistency coefficient and α obtained better validation errors.
For SVHN, the test errors obtained by ICT are competitive with other state-of-the-art SSL methods ( Table 2). The best values of the hyperparameters max-consistency coefficient and α were found to be 100 and 0.1 respectively, for all the ICT results reported in Table 2. Oliver et al. (2018) performed extensive hyperparameter search for various consistency regularization SSL algorithms using the WRN-28-2 and they report the best test errors found for each of these algorithms. For a fair comparison of ICT against these SSL algorithms, we conduct experiments on WRN-28-2 architecture. The results are shown in Table 3. ICT achieves improvement over other methods both for the CIFAR10 and SVHN datasets.
We note that unlike other SSL methods of Tables 1-3, we do not use Dropout regularizer in our implementation of CNN-13 and WRN-28-2. Using Dropout along with the ICT may further reduce the test error.

Ablation study
• Effect of not using the mean-teacher in ICT: We note that Πmodel, VAT and VAdD methods in Tables 1 and 2 do not use a mean-teacher to make predictions on the unlabeled data. Although the mean-teacher (Tarvainen & Valpola, 2017) used in ICT does not incur any significant computation cost, one might argue that a more direct comparison with Π-model, VAT and VAdD methods requires not using a mean-teacher. To this end, we conduct an experiment on the CIFAR10 dataset, without the mean-teacher in ICT, i.e. the prediction on the unlabeled data comes from the network We obtain test errors of 19.56 ± 0.56%, 14.35 ± 0.15% and 11.19 ± 0.14% for 1000, 2000, 4000 labeled samples respectively (we did not conduct any hyperparameter search for these experiments and used the best hyperparameters found in the ICT experiments of Table 1). This shows that even without a mean-teacher, ICT has major an advantage over methods such as VAT (Miyato et al., 2018) and VAdD (Park et al., 2018) that it does not require an additional gradient computation yet performs on the same level of the test error.
• Effect of not having the mixup supervised loss: In Section 3.3, we noted that to get the supervised loss, we perform the interpolation of labeled sample pair and their corresponding labels (mixup supervised loss as in Zhang et al. (2018)). Will the performance of ICT be significantly reduced by not having the mixup supervised loss? We conducted experiments with ICT on both CIFAR10 and SVHN with the vanilla Table 4 Results on CIFAR-100 with 100 labels per class using the CNN-13 architecture. We ran three trials for ICT. ICT was implemented with the ReLU activation and the softplus activation function.

Results with a larger number of classes
The main experiments above are conducted on CIFAR-10 and SVHN. Each of these datasets has 10 classes for the classification task. Accordingly, we conducted additional experiments using CIFAR-100, and reported the results with the CNN-13 architecture in Table 4. Following the previous work on semi-supervised learning with CIFAR-100 and CNN-13 (Laine & Aila, 2017), we used 100 labeled points per class in this experiment. We did not conduct any hyperparameter search and used the best hyperparameters found in the experiments of CIFAR-10 as reported in Section 3.4. As can be seen in Table 4, ICT with the ReLU activation function outperformed the previous methods as expected from the above study with the CIFAR-10 and SVHN datasets. Moreover, ICT worked the best after replacing the ReLU activation function with the softplus activation function (Dugas et al., 2001): φ(z) = ln(1 + exp(κz))/κ with κ = 100. Here, the value of κ was set without any hyperparameter search and simply taken from the previous study for supervised learning with a different architecture (pre-activation ResNet with 18 layers) (Kawaguchi & Sun, 2021).

Theoretical analysis
In this section, we establish mathematical properties of ICT for binary classification with f θ (u) ∈ R. We begin in Section 4.1 with additional notation, an introduction of real analytic functions, and a property of the Euclidian norm of the Kronecker product. Using the real analyticity and the property of the Kronecker product, we show in Section 4.2 that ICT regularizes higher-order derivatives. We conclude in Section 4.3 by showing how ICT can reduce overfitting and lead to better generalization behaviors than those without ICT.

Preliminaries
In our experiments, we did not use mean-teacher models f θ in the ablation study for CIFAR-10 and the two moon dataset in all of Figs. 1 and 3-7. Those experiments consistently show the advantages of ICT even without the mean-teacher models. Indeed, the mean teacher is not a necessary mechanism of consistencyregularization in general (e.g., see equation (1) of Han et al., 2021). Thus, to understand and disentangle the essential mechanisms of ICT, we consider the same setting as those experiments: i.e., this section focuses on the mean square loss for the unlabeled data without the mean teacher as an empirical measure on the finite unlabeled data points, we can write L US by where σ (a) = 1 1+e −a is the sigmoid function and h θ (u) represents the pre-activation output of the last layer of a deep neural network.
The theory developed in this section requires the function f θ to be real analytic, which is satisfied for a large class of deep neural networks used in practice. Since a composition of real analytic functions is real analytic, we only need to require that each operation in each layer satisfies the real analyticity. The convolution, affine map, skip connection, batch normalization and average pooling are all real analytic functions. Therefore, the composition of these operations preserve real analyticity. Furthermore, many common activation functions are real analytic. For example, sigmoid σ , hyperbolic tangents and softplus activations φ(z) = ln(1+exp(κz))/κ are all real analytic functions, with any hyperparameter κ > 0. Here, the softplus activation can approximate the ReLU activation for any desired accuracy as where relu represents the ReLU activation. Indeed, the ReLU activation and the softplus activation behave similarly to each other in numerical experiments with a large value of κ such as κ = 100, which is demonstrated in the previous studies (e.g., see Kawaguchi & Sun, 2021). As a further demonstration of this fact, we used the softplus activation in Fig. 3 (and Fig. 4) and the ReLU activation in Fig. 7 (and Fig. 5): we can see that with both activation functions, ICT works well and the numerical results are consistent with our theoretical predictions. This is expected because the softplus activation function can approximate the ReLU function arbitrarily well by varying the value of κ ∈ (0, ∞) and our theoretical analyses hold for any κ ∈ (0, ∞). Moreover, Table 4 shows that ICT with the softplus activation works similarly to and can outperform ICT with the ReLU activation. Overall, the requirement on the real analyticity of the function f θ is easily satisfied in practice without degrading practical performances.
We close this subsection by introducing additional notation and proving a theoretical property of the Kronecker product.
be a dataset with unlabeled data points and S = ((x i , y i )) m i=1 be a dataset with labeled points. Here,Ŝ contains unlabeled data points only and is not necessarily the whole dataset used for the unsupervised loss; e.g., one can use the concatenation ofŜ and (x i ) n i=1 for the unsupervised loss. Accordingly,Ŝ and S are independent of each other. DefineR m (F) to be the empirical Rademacher complexity of the set of functions F , whereas R m (F) is the Rademacher complexity of the set F . We adopt the standard convention that min a∈∅ Ψ (a) = ∞ for any function Ψ with the empty set ∅. We define the kth order tensor For example, ∂ 1 f θ (u) is the gradient of f θ evaluated at u, and ∂ 2 f θ (u) is the Hessian of f θ evaluated at u. For any kth order For an vector a ∈ R d , we define a ⊗k = a ⊗ a ⊗ · · · ⊗ a ∈ R d k where ⊗ represents the Kronecker product. The following lemma proves that the Euclidean norm of the Kronecker products a ⊗k is the Euclidean norm of the vector a to the kth power.
Lemma 1. Let d ∈ N + and a ∈ R d . Then, for any k ∈ N + , a ⊗k 2 = a k 2 Proof. We prove this statement by induction over k ∈ N + . For the base case with k = 1, a ⊗1 2 = a 2 = a 1 2 , as desired. For the inductive step, we show the statement to hold for k + 1: 2 . Here, the inductive hypothesis of a ⊗k 2 = a k 2 implies that a ⊗k 2 2 = a 2k 2 , and thus a ⊗k+1 2 2 = a 2 2 a 2k 2 = a 2(k+1) 2 . This implies that a ⊗k+1 2 = a k+1 2 , which completes the inductive step.

Understanding ICT as regularizer on higher-order derivatives
Using the real analyticity and the Kronecker product, we show how ICT can act as a regularizer on higher-order derivatives. More concretely, Theorem 1 states that for any K ∈ N + , the ICT loss can be written as 2 ) → 0 as K → ∞ if we normalize the input so that u − u 2 < 1. Here, Theorem 1 holds for any u, u ∈ R d : for example, we can replace u by any u i or x i , where x i represents the input part of the labeled dataset ((x i , y i )) n i=1 .
We define the function ϕ by ).
Since f θ is real analytic and a → u +a(u −u) is real analytic, their composition ϕ is also real analytic. We first observe that Using Eq. (8), Using Taylor's theorem with the Cauchy remainder, we have the following: for any K ∈ N + , there exists ζ ∈ [0,λ] such that On the other hand, using Eq. (8), Using Taylor's theorem with the Cauchy remainder, we have the following: for any K ∈ N + , there exists ζ ∈ [0, 1] such that Since ϕ(0) = f θ (u), plugging this formula of f θ (u ) into Eq. (10) yields Using Eqs. (9) and (11), Sinceλ k −λ = 0 when k = 1, this implies that We now derive the formula of ϕ (k) (a). By the chain rule, with Δ = u − u and b = u + aΔ, we have that Based on this process, we consider the following formula of ϕ (k) (a) as a candidate to be proven by induction over k ∈ N + : For the base case, we have already shown that ϕ (1) (a) = d t 1 =1 ∂f θ (b) ∂bt 1 Δ t 1 as desired. For the inductive step, by using the inductive hypothesis, as desired. Therefore, we have proven that for any k ∈ N + , Then, by using the vectorization of the tensor vec we can rewrite Eq. (13) as where Δ ⊗k = Δ⊗Δ⊗· · ·⊗Δ ∈ R d k and Δ = u −u. By combining Eqs. (12) and (14), By using the Cauchy-Schwarz inequality, where the last line follows from Lemma 1.
Theorem 1 shows that ICT acts as a regularizer on derivatives of all orders when the confidence value 1 n n i=1 | 1 2 − f θ (u i )| of the prediction on unlabeled points (u i ) n i=1 is high. 1 This is because 1 Here, the confidence value at an unlabeled point u i is defined by | 1 2 − f θ (u i )|, because f θ (u i ) ∈ (0, 1) is the output of the sigmoid function at the last layer of increasing confidence in ICT tends to decrease the norm of the first-order derivatives due to the following observation. By the chain rule, the first-order derivatives can be written as ∂f θ (u) = ∂σ (h θ (u))∂h θ (u).

Theorem 1 along this observation suggests that ICT works well when the confidence on unlabeled data is high (which can be enforced by using pseudo-labels), because that is when ICT acts as a regularizer on derivatives of all orders.
To confirm this theoretical prediction, we conducted numerical simulations with the ''two moons'' dataset by intentionally decreasing confidence values on the unlabeled data points. Fig. 3 shows the results of this experiment with the high confidence case and the low confidence case. Here, the confidence value is defined by 1 for the unlabeled dataset (u i ) n i=1 , which is intentionally reduced in Fig. 3(b). We used the same settings as those in Fig. 1, except that we use the softplus activation function φ(z) = ln(1 + exp(κz))/κ with κ = 100 and we did not use mixup for the supervised loss, in order to understand the essential mechanism of ICT (whereas we used mixup for the supervised loss in Fig. 1). As can be seen in Fig. 3, the numerical results are consistent with our theoretical prediction. The qualitatively same behaviors were also observed with the ReLU activation as shown in Appendix B.3.

On overfitting
In the previous subsection, we have shown that ICT acts as a regularizer on the derivatives of all orders at unlabeled data points. In this subsection, we provide a theoretical explanation regarding how regularizing the derivatives of all orders at unlabeled points help reducing overfitting at labeled points.
We first recall the important lemma, Lemma 2, that bounds a possible degree of overfitting at labeled points with the Rademacher complexity of a hypothesis space (Bartlett & Mendelson, 2002;Mohri et al., 2012). Since our notion of a hypothesis space differs from standard ones, we include a proof for the sake of completeness. The proof is based on an argument in Bartlett and Mendelson (2002). A key observation in Lemma 2 is that we can define a hypothesis space based on the unlabeled dataset , which is to be used later to relate the regularization at unlabeled points to the degree of overfitting at labeled points. a deep neural network. With f θ (u i ) = 1 2 , we have the minimum confidence, where the confidence value | 1 2 − f θ (u i )| is zero as desired. The confidence value is maximized towards 0.5 when we let f θ (u i ) → 0 or 1. Lemma 2. Let FŜ be a set of maps x → f (x) that depends on an unlabeled datasetŜ. Let q → (q, y) be a C -uniformly bounded function for any q ∈ {f (x) : f ∈ FŜ , x ∈ X } and y ∈ Y. Then, for any δ > 0, with probability at least 1 − δ over an i.i.d. draw of m samples S = ((x i , y i )) m i=1 , the following holds: for all maps f ∈ FŜ , To apply McDiarmid's inequality to ϕ(S), we compute an up- are two labeled datasets differing by exactly one point of an arbitrary index i 0 ; i.e., S i = S i for all i = i 0 and S i 0 = S i 0 . Then, where we used that fact thatŜ and S are independent. Similarly, Notice that these steps fail if FŜ depends on S.
Thus, by McDiarmid's inequality, for any δ > 0, with probability at least 1 − δ, Moreover, where the first line follows the definitions of each term, the second line uses the Jensen's inequality and the convexity of the supremum function, and the third line follows that for each Whereas Lemma 2 is applicable for a general class of loss functions, we now recall a concrete version of Lemma 2 for binary classification. We can write the standard 0-1 loss (i.e., classification error) for binary classification by For any y ∈ {−1, +1}, we define the margin loss ρ (f (x), y) = ρ (y(2f (x) − 1)) as follows: Note that for any ρ > 0, the margin loss ρ (yf (x)) is an upper bound on the 0-1 loss: i.e., ρ (f (x), y) ≥ 01 (f (x), y). We can instantiate Lemma 2 for this concrete choice of the loss functions by using the arguments in Mohri et al. (2012): Lemma 3. Let FŜ be a set of maps x → f (x) that depends on an unlabeled datasetŜ. Fix ρ > 0. Then, for any δ > 0, with probability at least 1 − δ over an i.i.d. draw of m samples ((x i , y i )) m i=1 , each of the following holds: for all maps f ∈ FŜ , Proof. By combining Lemma 2 and the fact that ρ (f (x), y) ≥ 01 (f (x), y) and ρ (f (x), y) ≤ 1, we have that for any δ > 0, with probability at least 1 − δ,

Using Lemma 5 in Appendix A and the fact that
This proves the first statement. For the second statement, we replace δ by δ/2, use Lemma 4 in Appendix A, and take a union bound, yielding that with probability at least 1−δ/2−δ/2 = 1−δ, This proves the second statement.
Given these lemmas, we are ready to investigate how the overfitting at labeled points can be mitigated by using the hypothesis space FŜ ,τ with the regularization on the derivatives of all orders at unlabeled points: where τ = {τ¯S ∈ R :S ⊆Ŝ} and τ¯S measures the norm of the derivatives of all orders at unlabeled points. For eachS ⊆Ŝ, we writeS = (uS 1 , . . . , uS |S| ). We define R S,S = max x∈S min u∈S x −u 2 , The following theorem shows that regularizing τ¯S -the norm of the derivatives of all orders at unlabeled points -can help reducing overfitting at labeled points: Theorem 2. Fix FŜ ,τ and ρ > 0. Then, for any δ > 0, with probability at least 1−δ over an i.i.d. draw of m samples ((x i , y i )) m i=1 , each of the following holds: for all maps f ∈ FŜ ,τ , Proof. LetS ⊆Ŝ be arbitrary such that R S,S < 1. Using Lemma 1 and the Cauchy-Schwarz inequality, Thus, for any (f , u) ∈ FŜ ,τ ×S, Therefore, combining above inequalities, for any (f , u) Lett(x) = argmin t∈{1,...|S|} x − uS t 2 . Then,

Thus, Eq. (18) implies that
where the series converges based on Eq. (18) since min u∈S Using the definition of the function class FŜ ,τ and the fact that where the last line follows the linearity of the expectation. By using Jensen's inequality for the concave function, where the last line follows the fact that the Rademacher variables Similarly, by using Jensen's inequality for the concave function, where the last line follows the fact that Rademacher variables ξ 1 , . . . , ξ m are independent. Since E ξ [ξ 2 i ] = 1, using Lemma 1, Combining inequalities (19)- (21) yields Since R S,S < 1, the geometric series converges as Therefore, SinceS ⊆Ŝ was arbitrary such that R S,S < 1, this inequality holds for anyS such that R S,S < 1, yielding that for any S = ((x 1 , y 1 ), . . . , (x m , y m )), By combining Lemma 3 and inequality (22) and taking expectation over S, we obtain the statement of this theorem.
Since minS ∈S S,Ŝ g(S) ≤ g(S ) for anyS ∈ S S,Ŝ and any function g, Theorem 2 implies that with probability at least 1 − δ, for anȳ S ∈ S S,Ŝ , To further understand Theorem 2, let us consider a simple case where the input space is normalized such that the maximum distance between some unlabeled point u c ∈Ŝ and the labeled points S is bounded as max x∈S x − u c 2 < 1/2. Then, by setting S = {u c }, Theorem 2 implies that with probability at least 1 − δ,

R S,S
, where CS measures the confidence values at unlabeled points. Thus, Theorem 3 also shows the benefit of increasing confidence values at unlabeled points, which is consistent with our observation in Fig. 3.
Then, we can write the 0-1 loss of the classification as By using Lemma 2 with the 0-1 loss, we have that for any δ > 0, with probability at least 1 − δ, for all f ∈ FŜ ,τ , By using Lemma 4 and taking a union bound, we have that Using Eq. (24), LetS ⊆Ŝ be arbitrary such that R S,S < 1 and τ¯S < Then, for any f ∈ FŜ ,τ , the proof of Theorem 2 shows that we can write where the last line follows the condition onS that τ¯S < (since τ¯S is defined independently of S, this condition does not necessarily hold for someS ⊆Ŝ. If this does not hold for allS ⊆Ŝ, then the statement holds vacuously with the convention that min a∈∅ Ψ (a) = ∞ for any function Ψ ). Since |f (uS t By using Jensen's inequality for the concave function, SinceS ⊆Ŝ was arbitrary such that R S,S < 1 and τ¯S < CS (1−R S,S )

R S,S
, this inequality holds for anyS such that R S,S < 1 and τ¯S < Taking expectation and combining with Eqs. (25)-(26) yields that for any δ > 0, with probability at least 1−δ, each of the following holds for all f ∈ FŜ ,τ : .
Since minS ∈S * S,Ŝ g(S) ≤ g(S ) for anyS ∈ S * S,Ŝ and any function g, Theorem 3 implies that with probability at least 1 − δ, for anȳ as |S| = 1 and |I S,S t | = m for the singleton setS = {u c }.
Therefore, if we increase the confidence at unlabeled points and regularize the norm of the derivatives of all orders at unlabeled points, we can reduce overfitting at labeled points: i.e., the classification error 1 m m i=1 01 (f (x), y) at labeled points approaches E x,y [ 01 (f (x), y)] in the rate of O( √ ln(1/δ)/m). Finally, we remark that our theories are reflecting the fact that the prediction at unlabeled points could be wrong, although the confidence can be high. Indeed, in all of our experiments and theoretical analyses, we do not assume that the pseudo labels at unlabeled points are correct or of good quality. Instead, our proofs analyze the possible cases of the bad quality of pseudo-labels and show the following: a prediction of ICT at an unlabeled point is likely correct if it has a high confidence at the unlabeled point and if the unlabeled point is near a correctly classified labeled point. Intuitively, this is because when the derivatives of all orders are regularized, the prediction cannot change a lot with respect to the input. This is one of ways how labeled and unlabeled points are interacted in our new proof techniques in this paper. We expect our new proof techniques to be proven useful in future studies of semi-supervised learning.
In Theorems 2 and 3, we can see that if all unlabeled points are too far way from the labeled points, then the bounds degrade linearly in the distance between the labeled and unlabeled points. This is because the prediction could be more likely inaccurate as this distance increases. Here, by normalizing the input space, we can guarantee for this distance to be sufficiently small. As an example, the discussions after Theorem 2 provide the case where the input space is normalized such that the maximum distance between a center unlabeled point u c ∈Ŝ and labeled points S is bounded as max x∈S x − u c 2 < 1/2. More importantly, the distance between the labeled and unlabeled points that is used in Theorems 2 and 3 tends to decrease as we increase only the number of unlabeled points. This is because the distance is defined by R S,Ŝ = max x∈S min u∈Ŝ x − u 2 with the minimum over all the unlabeled points u ∈Ŝ, for example, withS =Ŝ. Therefore, the quality of pseudo labels at unlabeled points can improve only by increasing the number of unlabeled points. This theoretical prediction is consistent with our main experimental results in Section 3 and is further validated via the additional experiments in Appendix B.2.

Related work
This work builds on two threads of research: consistencyregularization for semi-supervised learning and interpolationbased regularizers.
Consistency-regularization semi-supervised learning methods (Athiwaratkun et al., 2019;Laine & Aila, 2017;Luo et al., 2018;Miyato et al., 2018;Sajjadi et al., 2016;Tarvainen & Valpola, 2017) encourage that realistic perturbations u + δ of unlabeled samples u should not change the model predictions f θ (u). Motivated by the low-density separation assumption (Chapelle et al., 2010), these methods push the decision boundary towards the low-density regions of the input space, achieving larger classification margins. ICT differs from these approaches in two aspects. First, ICT chooses perturbations in the direction of another randomly chosen unlabeled sample, avoiding expensive gradient computations.
When interpolating between distant points, the regularization effect of ICT applies to larger regions of the input space.
Interpolation-based regularizers (Tokozume et al., 2018;Verma et al., 2018;Zhang et al., 2018) have been recently proposed for supervised learning, achieving state-of-the-art performances across a variety of tasks and network architectures. While Tokozume et al. (2018) and Zhang et al. (2018) proposed to perform interpolations in the input space, Verma et al. (2018) proposed to perform interpolation also in the hidden space representations. Furthermore, in the unsupervised learning setting,  proposes to measure the realism of latent space interpolations from an autoencoder to improve its training.
We note that after the publication of an earlier version of this paper (Verma et al., 2019), the methods in Berthelot et al. (2020), Berthelot, Carlini et al. (2019) and Sohn et al. (2020) have achieved state-of-the-art experimental results on benchmark datasets and architectures. Similar to Verma et al. (2019), Berthelot et al. (2020), Berthelot, Carlini et al. (2019) use interpolation in the samples and their predicted targets for designing semi-supervised objective functions. Sohn et al. (2020) combines consistency regularization with pseudo-labeling (Lee, 2013).
Other works have approached semi-supervised learning from the perspective of generative models. Some have approached this from a consistency point of view, such as Lecouat et al. (2018), who proposed to encourage smooth changes to the predictions along the data manifold estimated by the generative model (trained on both labeled and unlabeled samples). Others have used the discriminator from a trained generative adversarial network (Goodfellow et al., 2014) as a way of extracting features for a purely supervised model (Radford et al., 2015). Still, others have used trained inference models as a way of extracting features (Dumoulin et al., 2016).
Our main objective in this paper is to improve generalization, instead of the causality invariance. While we mathematically analyzed the generalization properties of consistency-regularization with mixup, Han et al. (2021) recently discussed that intuitively, consistency-regularization in general might be able to promote the causality invariance too. To understand this, consider a causality invariance such that a causality from an input to a class label is not destroyed under some input perturbation or argumentation. Then, intuitively, consistency-regularization with this particular input perturbation or argumentation might be able to approximately promote the causality invariance. It would be interesting to mathematically formalize this intuition for ICT to quantify the degree of causality invariance that can be promoted by ICT in future work.

Conclusion
Machine learning is having a transformative impact on diverse areas, yet its application is often limited by the amount of available labeled data. Progress in semi-supervised learning techniques holds promise for those applications where labels are expensive to obtain. In this paper, we have proposed a simple but efficient semi-supervised learning algorithm, Interpolation Consistency Training (ICT), which has two advantages over previous approaches to semi-supervised learning. First, it uses almost no additional computation, as opposed to computing adversarial perturbations or training generative models. Second, it outperforms strong baselines on two benchmark datasets, even without an extensive hyperparameter tuning. Finally, we have shown how ICT can act as a regularizer on the derivatives of all orders and reduce overfitting when confidence values on unlabeled data are high, which can be achieved by additionally using pseudo-labels on the unlabeled data. Our theoretical results predict a failure mode of ICT with low confidence values (without using pseudolabels), which was confirmed in the experiments, providing a practical guidance to use ICT with high confidence values. As for the future work, extending ICT to interpolations not only at the input but at hidden representations (Verma et al., 2018) could improve the performance even further.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.   boundary of the predictor f θ (i.e., {u : f θ (u)} = 1 2 ) after each update of 1, 10, 100, and 1000. As we can see in the figures, with the KL divergence (instead of the MSE loss), ICT still works as expected and the numerical results are still consistent with our theoretical prediction regarding the importance of the confidence value. Given these preliminary results, it might be interesting to study the use of the KL divergence in ICT more comprehensively in future work.

B.2. The effect of few data points
As ICT is designed for semi-supervised learning instead of oneshot learning, we now examine how ICT could fail in the case of too few data points in Fig. 6. Each line in each subplot shows the decision boundary of the predictor f θ (i.e., {u : f θ (u)} = 1 2 ) after each update of 1, 10, 100, 1000, and 2000. The experiments were conducted with the softplus activation function φ(z) = ln(1 + exp(κz))/κ with κ = 100.
As can be seen in the figure, if we have only one labeled point per class (Fig. 6a) or if unlabeled points are too few (Fig. 6b), then ICT would not learn the correct decision boundary. However, by increasing only the number of unlabeled points from 10 ( Fig. 6b) to 100 (Fig. 6c), ICT starts learning a good decision boundary (Fig. 6c). The decision boundary is further refined by increasing only the unlabeled data points in Figs. 1 and 3 in the main text. Along with our result in Table 2 where ICT performed well only with 250 labeled points for SVHN, these observations further demonstrate the fact that ICT works well with few labeled data points (but not too few such as one per class) as long as we can increase the number of the unlabeled data points. This is consistent with our theoretical prediction. Fig. 3 In Fig. 3, we used the softplus activation function φ(z) = ln(1 + exp(κz))/κ with κ = 100. As an additional experiment, we replaced the softplus activation by the ReLU activation and reported the results in Fig. 7. As can be seen in the figure, the numerical results with the ReLU activation are also consistent with our theoretical prediction. This is expected because the ReLU function is approximated arbitrarily well with the softplus activation function by varying the value of κ ∈ (0, ∞), and our theoretical results hold for any κ ∈ (0, ∞).